You may find this solution accelerator at https://github.com/databricks-industry-solutions/pixels. 

# De-identifying DICOM Data

Since the image and metadata around DICOM imaging can contain sensitive information, you may want to think about ways to de-identify this data. But what sort of de-identification scheme should you adopt? Do you want pure anonymization, or two-way de-identification that allows anonymizing the data and then allowing for the data to be recreated in it's original form? 

Since the previous examples showed you how easy it is to get the data into Delta, let's now extend this capability using open source libraries to anonomize and/or de-identify the data.

In [0]:
%run ./config/setup

In [0]:
path,table,volume,write_mode = init_widgets()

### Introducing Presidio

Microsoft has been working on an open source library called Presidio (https://microsoft.github.io/presidio/) that provides this capability not just for tabular data, but for images as well. This library is built in Python. Before we begin, we need to make sure we install our dependancies. This starts with installing the specific Python packages: 

In [0]:
%pip install presidio_analyzer presidio_anonymizer presidio-structured
%pip install faker

We also need to install one Python-based NLP engine, in our case we'll use SpaCy:

In [0]:
%sh
python -m spacy download en_core_web_lg

## Revisiting our test data

Before we begin, let's re-look at our sample data from the starter notebooks provided in this repository. If we select our data from our table:

In [0]:
%sql select * from ${table}

In [0]:
%sql
SELECT
    --rowid,
    meta:['00100010'].Value[0].Alphabetic patient_name, 
    meta:['00082218'].Value[0]['00080104'].Value[0] `Anatomic Region Sequence Attribute decoded`,
    meta:['0008103E'].Value[0] `Series Description Attribute`,
    meta:['00081030'].Value[0] `Study Description Attribute`,
    meta:`00540220`.Value[0].`00080104`.Value[0] `projection` -- backticks work for numeric keys
FROM ${table}

We see that the metadata has a patient name in it; however, our test data is already anonymized. For demonstration purposes, we'll clone the table, and use the `faker` Python library to generate some random names for the rows:

In [0]:
%sql
CREATE TABLE ${table}_presidio DEEP CLONE ${table}

In [0]:
from faker import Faker
fake = Faker()

from pyspark.sql.types import StringType
def generate_faker_name():
  return fake.name()
spark.udf.register("generate_faker_name", generate_faker_name, StringType())

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW patients_to_faker_names AS
SELECT
    --rowid,
    row_number() over(order by meta:['00100010'].Value[0].Alphabetic) patient_row_id,
    meta:['00100010'].Value[0].Alphabetic patient_name,
    generate_faker_name() as fake_name
FROM ${table}_presidio
WHERE meta:['00100010'].Value[0].Alphabetic IS NOT NULL
GROUP BY patient_name

In [0]:
%sql
select * from patients_to_faker_names

In [0]:
%sql
MERGE INTO ${table}_presidio a
USING patients_to_faker_names b ON meta:['00100010'].Value[0].Alphabetic = b.patient_name
WHEN MATCHED THEN UPDATE SET meta = replace(meta, meta:['00100010'].Value[0].Alphabetic, b.fake_name)

In [0]:
%sql
SELECT
    --rowid,
    meta:['00100010'].Value[0].Alphabetic patient_name, 
    meta:['00082218'].Value[0]['00080104'].Value[0] `Anatomic Region Sequence Attribute decoded`,
    meta:['0008103E'].Value[0] `Series Description Attribute`,
    meta:['00081030'].Value[0] `Study Description Attribute`,
    meta:`00540220`.Value[0].`00080104`.Value[0] `projection` -- backticks work for numeric keys
FROM ${table}_presidio

In [0]:
import pandas as pd
from presidio_structured import StructuredEngine, PandasAnalysisBuilder, StructuredAnalysis
from presidio_anonymizer.entities import OperatorConfig
from faker import Faker

pandas_engine = StructuredEngine()

In [0]:
df = spark.sql(f"""SELECT
    --rowid,
    meta:['00100010'].Value[0].Alphabetic patient_name, 
    meta:['00082218'].Value[0]['00080104'].Value[0] `Anatomic Region Sequence Attribute decoded`,
    meta:['0008103E'].Value[0] `Series Description Attribute`,
    meta:['00081030'].Value[0] `Study Description Attribute`,
    meta:`00540220`.Value[0].`00080104`.Value[0] `projection`
    FROM {table}_presidio
  """)

pandas_df = df.toPandas()

In [0]:
display(pandas_df)

In [0]:
tabular_analysis = PandasAnalysisBuilder().generate_analysis(pandas_df, language = "en")
tabular_analysis

In [0]:
custom_analysis = StructuredAnalysis(entity_mapping={
    "patient_name":"PERSON"
})

In [0]:
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "REDACTED"})
}


operators = {
    "PERSON": OperatorConfig("custom", {"lambda": lambda x: None if x is None else x[0]})
}

In [0]:
anonymized_df = pandas_engine.anonymize(pandas_df, custom_analysis, operators=operators)

In [0]:
display(anonymized_df)