You may find this solution accelerator at https://github.com/databricks-industry-solutions/pixels. 

# De-identifying DICOM Data

Since the image and metadata around DICOM imaging can contain sensitive information, you may want to think about ways to de-identify this data. But what sort of de-identification scheme should you adopt? Do you want pure anonymization, or two-way de-identification that allows anonymizing the data and then allowing for the data to be recreated in it's original form?

Since the previous examples showed you how easy it is to get the data into Delta, let's now extend this capability using open source libraries to anonymize and/or de-identify the data.

### Introducing Presidio

Microsoft has been working on an open source library called Presidio (https://microsoft.github.io/presidio/) that provides this capability not just for tabular data, but for images as well. This library is built in Python. Before we begin, we need to make sure we install our dependancies. This starts with installing the specific Python packages: 

In [0]:
%pip install presidio_analyzer presidio_anonymizer presidio-structured presidio-image-redactor
%pip install faker

We also need to install one Python-based NLP engine, in our case we'll use SpaCy:

In [0]:
%sh
python -m spacy download en_core_web_lg

In [0]:
%restart_python

In [0]:
%run ./config/setup

In [0]:
path,table,volume,write_mode = init_widgets()

catalog = f"""{table.split(".")[0]}"""
schema = f"""{table.split(".")[0]}.{table.split(".")[1]}"""

In [0]:
import pandas as pd

from dbx.pixels import Catalog
from dbx.pixels.dicom import DicomPlot, DicomMetaExtractor, DicomThumbnailExtractor
from presidio_structured import StructuredEngine, PandasAnalysisBuilder, StructuredAnalysis, JsonAnalysisBuilder, JsonDataProcessor
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import OperatorConfig, RecognizerResult

import instance_counter_anonymizer

from faker import Faker
from json import loads, dumps
from pyspark.sql.functions import pandas_udf, col
from pyspark.sql.types import StringType


## Revisiting our test data

Before we begin, let's re-look at our sample data from the starter notebooks provided in this repository. If we select our data from our table:

In [0]:
%sql select * from ${table}

In [0]:
%sql
SELECT
    --rowid,
    meta:['00100010'].Value[0].Alphabetic patient_name, 
    meta:['00082218'].Value[0]['00080104'].Value[0] `Anatomic Region Sequence Attribute decoded`,
    meta:['0008103E'].Value[0] `Series Description Attribute`,
    meta:['00081030'].Value[0] `Study Description Attribute`,
    meta:`00540220`.Value[0].`00080104`.Value[0] `projection` -- backticks work for numeric keys
FROM ${table}

We see that the metadata has a patient name in it; however, our test data is already anonymized. For demonstration purposes, we'll clone the table, and use the `faker` Python library to generate some random names for the rows:

In [0]:
%sql
CREATE OR REPLACE TABLE ${table}_presidio DEEP CLONE ${table}

In [0]:
from faker import Faker
fake = Faker()

from pyspark.sql.types import StringType
def generate_faker_name():
  return fake.name()
spark.udf.register("generate_faker_name", generate_faker_name, StringType())

In [0]:
%sql
CREATE OR REPLACE TEMPORARY VIEW patients_to_faker_names AS
SELECT
    --rowid,
    row_number() over(order by meta:['00100010'].Value[0].Alphabetic) patient_row_id,
    meta:['00100010'].Value[0].Alphabetic patient_name,
    generate_faker_name() as fake_name
FROM ${table}_presidio
WHERE meta:['00100010'].Value[0].Alphabetic IS NOT NULL
GROUP BY patient_name

In [0]:
%sql
select * from patients_to_faker_names

In [0]:
%sql
MERGE INTO ${table}_presidio a
USING patients_to_faker_names b ON meta:['00100010'].Value[0].Alphabetic = b.patient_name
WHEN MATCHED THEN UPDATE SET meta = replace(meta, meta:['00100010'].Value[0].Alphabetic, b.fake_name)

In [0]:
%sql
SELECT
    --rowid,
    meta:['00100010'].Value[0].Alphabetic patient_name, 
    meta:['00082218'].Value[0]['00080104'].Value[0] `Anatomic Region Sequence Attribute decoded`,
    meta:['0008103E'].Value[0] `Series Description Attribute`,
    meta:['00081030'].Value[0] `Study Description Attribute`,
    meta:`00540220`.Value[0].`00080104`.Value[0] `projection` -- backticks work for numeric keys
FROM ${table}_presidio
WHERE meta:['00100010'].Value[0].Alphabetic is not null

### Anonymizing Data With Presidio and Pandas

Presidio has several different Python interfaces for identifying and transforming on data, but one of the most straightforward methods in the library is the ability for it to work with dataframes directly in Pandas. Remember that in PySpark you can seamlessless move between a Spark dataframe and a Pandas dataframe with the simple `toPandas()` and `.createDataFrame()` methods. 

Let's start there: we'll read some data into a new dataframe, then convert it to a Pandas dataframe. First we'll load in the required libraries, then instantiate the `StructuredEngine()` of Presidio:

In [0]:
pandas_engine = StructuredEngine()

Then we'll get our data again, and use `toPandas()` to create a new Pandas dataframe:

In [0]:
df = spark.sql(f"""SELECT
    --rowid,
    meta:['00100010'].Value[0].Alphabetic patient_name, 
    meta:['00082218'].Value[0]['00080104'].Value[0] `Anatomic Region Sequence Attribute decoded`,
    meta:['0008103E'].Value[0] `Series Description Attribute`,
    meta:['00081030'].Value[0] `Study Description Attribute`,
    meta:`00540220`.Value[0].`00080104`.Value[0] `projection`
    FROM {table}_presidio
    WHERE meta:['00100010'].Value[0].Alphabetic is not null
  """)

pandas_df = df.toPandas()

In [0]:
display(pandas_df)

Now it's time to start anonymizing the data. Presidio has two major components that work together:

 * ***Analyzer***: The analyzer is responsible for sampling your data, whether it is a string, JSON, Pandas dataframe, or image, and identifying any potential sensitive data within what is provided. The output of the analyzer is a dictionary containing the field names or positions, and the type of sensitive data it *thinks* it is
 * ***Operator***: The operator is responsible for actually transforming your sensitive data, and can configured to simply replace values, encrypt or decrypt, or even call custom functions that you create to transform your data


Let's start with doing some analysis on the dataframe. We can use the library and pass in our Pandas dataframe to let it produce the analysis:

In [0]:
tabular_analysis = PandasAnalysisBuilder().generate_analysis(pandas_df, language = "en")
tabular_analysis

We can see from the output that it thinks that the `patient_name` column is of type `PERSON` and that two other columns are also flagged as `PERSON`. Obviously, in this example, we only really want to anonymize the `patient_name` column, so we can "override" the analysis by manually providing the mapping.

In [0]:
custom_analysis = StructuredAnalysis(entity_mapping={
    "patient_name":"PERSON"
})

Next we need to define our operator. Operators are defined by "classification"; meaning, for a specific classification, such as `PERSON` what the rule should be. The configuration takes the type of operation to be performed and any operations for that type of operation. For instance, if we wanted to replace each patient name with a static value, we would use the `replace` operation, and provide the `new_value` of `REDACTED`:

In [0]:
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "REDACTED"})
}

Finally, we take our analysis and operator configuration and using the engine, we pass in the Pandas dataframe we want to anonymize, which will produce the resulting dataframe back to us, with our data changed:

In [0]:
anonymized_df = pandas_engine.anonymize(pandas_df, custom_analysis, operators=operators)

In [0]:
display(anonymized_df)

### Working with PySpark and Presidio

Currently, PySpark isn't a supported Engine in Presidio, but it is on the roadmap. In the meantime though, we can still leverage it through UDFs on our original dataset. Since our main dataset has most of the data in the `meta` column, we can use the JSON Engine combined with Spark UDFs to work with the data as-is.

We have a couple other challenges though. Firstly, the data that could contain our patient names are part of a specific key in the metadata, and that key doesn't always exist. Furthermore, the JSON engine provided by Presidio only works with primitive types, so we need to get the exact JSON key and value we want to replace, use the operator to transform it according to our rules, then put it back as a complete document.

Fortunately, Spark makes this easy with Pandas UDF in Spark Dataframes: 

Let's look at one of the rows in our sample dataset first, in a format that's a little easier to read:

In [0]:
json_str = '{"00080005": {"vr": "CS", "Value": ["ISO_IR 100"]}, "00080008": {"vr": "CS", "Value": ["DERIVED", "SECONDARY"]}, "00080016": {"vr": "UI", "Value": ["1.2.840.10008.5.1.4.1.1.1.2"]}, "00080018": {"vr": "UI", "Value": ["1.2.276.0.7230010.3.1.4.1787169844.2836.1454583672.169790"]}, "00080020": {"vr": "DA", "Value": ["19930927"]}, "00080021": {"vr": "DA", "Value": ["19930927"]}, "00080030": {"vr": "TM"}, "00080050": {"vr": "SH"}, "00080060": {"vr": "CS", "Value": ["MG"]}, "00080068": {"vr": "CS", "Value": ["FOR PRESENTATION"]}, "00080070": {"vr": "LO"}, "00080090": {"vr": "PN"}, "00081030": {"vr": "LO", "Value": ["benign_02"]}, "0008103E": {"vr": "LO", "Value": ["case1327"]}, "00082218": {"vr": "SQ", "Value": [{"00080100": {"vr": "SH", "Value": ["T-04000"]}, "00080102": {"vr": "SH", "Value": ["SNM3"]}, "00080104": {"vr": "LO", "Value": ["BREAST"]}}]}, "00100010": {"vr": "PN", "Value": [{"Alphabetic": "0687^Patient"}]}, "00100020": {"vr": "LO", "Value": ["0687"]}, "00100030": {"vr": "DA"}, "00100040": {"vr": "CS", "Value": ["F"]}, "00101010": {"vr": "AS", "Value": ["058Y"]}, "00181164": {"vr": "DS", "Value": [0.58391, 0.58391]}, "00181508": {"vr": "CS", "Value": ["NONE"]}, "00187004": {"vr": "CS", "Value": ["FILM"]}, "0020000D": {"vr": "UI", "Value": ["1.2.276.0.7230010.3.1.4.1787169844.2836.1454583672.169788"]}, "0020000E": {"vr": "UI", "Value": ["1.2.276.0.7230010.3.1.4.1787169844.2836.1454583672.169789"]}, "00200010": {"vr": "SH", "Value": ["benign_02"]}, "00200011": {"vr": "IS", "Value": [1327]}, "00200013": {"vr": "IS"}, "00200020": {"vr": "CS", "Value": ["P", "L"]}, "00200062": {"vr": "CS", "Value": ["R"]}, "00280002": {"vr": "US", "Value": [1]}, "00280004": {"vr": "CS", "Value": ["MONOCHROME2"]}, "00280010": {"vr": "US", "Value": [4726]}, "00280011": {"vr": "US", "Value": [2011]}, "00280100": {"vr": "US", "Value": [16]}, "00280101": {"vr": "US", "Value": [12]}, "00280102": {"vr": "US", "Value": [11]}, "00280103": {"vr": "US", "Value": [0]}, "00280301": {"vr": "CS", "Value": ["NO"]}, "00281040": {"vr": "CS", "Value": ["LOG"]}, "00281041": {"vr": "SS", "Value": [-1]}, "00281050": {"vr": "DS", "Value": [127.0]}, "00281051": {"vr": "DS", "Value": [254.0]}, "00281052": {"vr": "DS", "Value": [0.0]}, "00281053": {"vr": "DS", "Value": [1.0]}, "00281054": {"vr": "LO", "Value": ["US"]}, "00282110": {"vr": "CS", "Value": ["00"]}, "00400318": {"vr": "CS", "Value": ["BREAST"]}, "00400555": {"vr": "SQ", "Value": []}, "00540220": {"vr": "SQ", "Value": [{"00080100": {"vr": "SH", "Value": ["R-10242"]}, "00080102": {"vr": "SH", "Value": ["SNM3"]}, "00080104": {"vr": "LO", "Value": ["cranio-caudal"]}, "00540222": {"vr": "SQ", "Value": []}}]}, "20500020": {"vr": "CS", "Value": ["IDENTITY"]}, "has_pixel": true, "hash": "c8515a67ea226bb47615abe7f03d3cf7c77cd1cd", "img_min": 0, "img_max": 253, "img_avg": 64.04732908907904, "img_shape_x": 4726, "img_shape_y": 2011, "file_size": 4355734}'

json_obj = loads(json_str)
json_obj

The patient name in this data would normally live in the key '00100010', but this key contains another document, with its own keys and values, which may or may not be present in each record we have. To properly anonymize this data then, we'll create a new UDF that takes in the data, checks for the keys, then uses the `StructuredEngine` anonymizer to replace the values:

In [0]:
anonymized_column = "meta" # name of column to anonymize
anonymizer = StructuredEngine(data_processor=JsonDataProcessor())

# broadcast the engines to the cluster nodes
broadcasted_anonymizer = sc.broadcast(anonymizer)

# define a pandas UDF function and a series function over it.
def anonymize_patient_name(text: str) -> str:

    json_str_orig = loads(text)

    if '00100010' in json_str_orig.keys():
        sub_json = json_str_orig["00100010"]
        if 'Value' in sub_json.keys():
            patient_name = json_str_orig["00100010"]["Value"][0]

            custom_analysis = StructuredAnalysis(entity_mapping={
                "Alphabetic":"PERSON"
            })

            operator = {
                "PERSON": OperatorConfig("replace", {"new_value": "REDACTED"})
            }

            anonymizer = broadcasted_anonymizer.value
            anonymized_complex_json = anonymizer.anonymize(patient_name, custom_analysis, operators=operators)

            json_str_orig["00100010"]["Value"][0] = anonymized_complex_json

    json_string_result = dumps(json_str_orig)

    return json_string_result

Next, we'll create our Pandas UDF:

In [0]:
def anonymize_json(s: pd.Series) -> pd.Series:
    return s.apply(anonymize_patient_name)

# define a the function as pandas UDF
anonymize = pandas_udf(anonymize_json, returnType=StringType())



Then we'll read our data in from our table, and append a new column with the same data, only anonymized (for comparison purposes):

In [0]:
# select data
input_df = spark.read.table(table)

# apply the udf
anonymized_df = input_df.withColumn(
    "anonymized_meta", anonymize(col("meta"))
)

In [0]:
display(anonymized_df)

### Other examples: Encryption/Decryption and Pseudo-Anonymization

Presidio has other built-in capabilities for dealing with sensitive data, including the ability to provide encryption and decryption. It works extremely similar to how the anonymization features work: create an operator, and provide slightly different options. Let's revisit our UDF from the previous example, and make a few simple changes: mostly, this involves also broadcasting the analysis and operator configurations outside of the UDF, since ideally you should only pass the Pandas series objects (a column in a dataframe), and the key can be different depending on which column or dataset you want to operate on.

In [0]:
crypto_key = "WmZq4t7w!z%C&F)J"

anonymized_column = "meta" # name of column to anonymize
anonymizer = StructuredEngine(data_processor=JsonDataProcessor())

custom_analysis = StructuredAnalysis(entity_mapping={
    "Alphabetic":"PERSON"
})

operator = {
    "PERSON": OperatorConfig("encrypt", {"key": crypto_key})
}

# broadcast the engines to the cluster nodes
broadcasted_anonymizer = sc.broadcast(anonymizer)
broadcasted_analysis = sc.broadcast(custom_analysis)
broadcasted_operator = sc.broadcast(operator)

# define a pandas UDF function and a series function over it.
def encrypt_patient_name(text: str) -> str:

    json_str_orig = loads(text)

    if '00100010' in json_str_orig.keys():
        sub_json = json_str_orig["00100010"]
        if 'Value' in sub_json.keys():
            patient_name = json_str_orig["00100010"]["Value"][0]

            anonymizer = broadcasted_anonymizer.value
            analysis = broadcasted_analysis.value
            operators = broadcasted_operator.value

            anonymized_complex_json = anonymizer.anonymize(patient_name, analysis, operators=operators)

            json_str_orig["00100010"]["Value"][0] = anonymized_complex_json

    json_string_result = dumps(json_str_orig)

    return json_string_result

In [0]:
def encrypt_json(s: pd.Series) -> pd.Series:
    return s.apply(encrypt_patient_name)#

encrypt = pandas_udf(encrypt_json, returnType=StringType())

In [0]:
# select data
input_df = spark.read.table(table)

# apply the udf
encrypted = input_df.withColumn(
    "encrypted_meta", encrypt(col("meta"))
)

In [0]:
display(encrypted)

Pseudo-anonymization is a little harder, but Presidio provides a lot of ways to extend itself by letting you build your own operators. For instance, we can take the example class provided for incrementing a patient ID per patient:

In [0]:
import instance_counter_anonymizer

In [0]:
anonymized_column = "meta" # name of column to anonymize
anonymizer = AnonymizerEngine()

anonymizer.add_anonymizer(instance_counter_anonymizer.InstanceCounterAnonymizer)      

entity_mapping = dict()

operator = {
    "PERSON": OperatorConfig("entity_counter", {"entity_mapping": entity_mapping})
}

# broadcast the engines to the cluster nodes
broadcasted_anonymizer = sc.broadcast(anonymizer)
broadcasted_operator = sc.broadcast(operator)

# define a pandas UDF function and a series function over it.
def psuedo_patient_name(text: str) -> str:

    json_str_orig = loads(text)

    if '00100010' in json_str_orig.keys():
        sub_json = json_str_orig["00100010"]
        if 'Value' in sub_json.keys():
            patient_name = json_str_orig["00100010"]["Value"][0]["Alphabetic"]

            custom_analysis = [
                RecognizerResult(entity_type = "PERSON", start=0, end=len(patient_name), score=1)
            ]

            anonymizer = broadcasted_anonymizer.value
            operators = broadcasted_operator.value

            anonymized_name = anonymizer.anonymize(patient_name, custom_analysis, operators=operators)

            json_str_orig["00100010"]["Value"][0]["Alphabetic"] = anonymized_name.text

    json_string_result = dumps(json_str_orig)

    return json_string_result

In [0]:
def psuedo_json(s: pd.Series) -> pd.Series:
    return s.apply(psuedo_patient_name)#

psuedo = pandas_udf(psuedo_json, returnType=StringType())

In [0]:
# select data
input_df = spark.read.table(table)

# apply the udf
psuedo_df = input_df.withColumn(
    "psuedo_meta", psuedo(col("meta"))
)

In [0]:
display(psuedo_df)

### De-identifying DICOM Images

Another feature of Presidio is the ability to identify data inside of images as well, using the same concepts with text: first you use an analyzer to detect any sensitive information in the image, then another engine to handle the removal of the sensitve data. Presidio has different methods for analyzing and handling the image, such as built-in support for Tesseract OCR () or external services (and is fully extensible like the other engines).

Let's start with some sample images from the Presidio main repository, which has some sample images. We'll download those and load them into Pixels, then use Presidio to redact the sensitive info:

In [0]:
presidio_volume =volume + "_presidio_images"    
    
if (spark.sql(f"show volumes in {schema} like '{presidio_volume }'").count() == 0):
    spark.sql(f"create volume if not exists {presidio_volume}")

In [0]:
import requests

sample_images = [
  "https://github.com/microsoft/presidio/raw/refs/heads/main/presidio-image-redactor/tests/test_data/0_ORIGINAL.dcm",
  "https://github.com/microsoft/presidio/raw/refs/heads/main/presidio-image-redactor/tests/test_data/dicom_dir_1/dicom_dir_2/1_ORIGINAL.DCM"
]

for s in sample_images:
  r = requests.get(s, allow_redirects=True)
  filename = s.rsplit('/', 1)[1]
  open("/Volumes/" + presidio_volume.replace(".","/") + "/" + filename, 'wb').write(r.content)

In [0]:
catalog = Catalog(spark, table=table+"_presidio_images", volume=presidio_volume)
catalog_df = catalog.catalog(path="/Volumes/" + presidio_volume.replace(".","/") + "/", extractZip=False)

In [0]:
meta_df = DicomMetaExtractor(catalog).transform(catalog_df)
thumbnail_df = DicomThumbnailExtractor().transform(meta_df)

In [0]:
catalog.save(thumbnail_df, mode=write_mode)

In [0]:
display(meta_df)

In [0]:
display(thumbnail_df)

In [0]:
dcm_df_filtered = Catalog(spark, table=table+"_presidio_images", volume=presidio_volume).load().filter('lower(extension) = "dcm"').limit(1000)
DicomPlot(dcm_df_filtered).display()

In [0]:
path = "/Volumes/" + presidio_volume.replace(".","/") + "/1_ORIGINAL.DCM"
redacted_path= "/".join(path.split("/")[:-1]) + "/REDACTED/" + path.split("/")[-1]

In [0]:
#import pandas as pd
#from pyspark.sql.functions import pandas_udf, col
#from pyspark.sql.types import StringType
import pydicom
from presidio_image_redactor import DicomImageRedactorEngine

engine = DicomImageRedactorEngine()

broadcasted_engine= sc.broadcast(engine)

# define a pandas UDF function and a series function over it.
def redact_dicom_image(path: str) -> str:

    #redacted_path= "/".join(path.split("/")[:-1]) + "/REDACTED/" + path.split("/")[-1]
    redacted_path= "/".join(path.split("/")[:-1]) + "/REDACTED/"
    engine = broadcasted_engine.value
    engine.redact_from_file(path, redacted_path, padding_width=25, fill="contrast")
    return redacted_path + path.split("/")[-1]
  
def redact_dicom(s: pd.Series) -> pd.Series:
    return s.apply(redact_dicom_image)#

redact = pandas_udf(redact_dicom, returnType=StringType())


In [0]:

# apply the udf
redacted_df = thumbnail_df.withColumn(
    "redacted_image_path", redact(col("local_path"))
)

In [0]:
display(redacted_df)

In [0]:
catalog = Catalog(spark, table=table+"_presidio_images_redacted", volume=presidio_volume)
catalog_df = catalog.catalog(path="/".join(path.split("/")[:-1]) + "/REDACTED/", extractZip=False)

In [0]:
meta_df = DicomMetaExtractor(catalog).transform(catalog_df)
thumbnail_df = DicomThumbnailExtractor().transform(meta_df)
display(thumbnail_df)

In [0]:
DicomPlot(thumbnail_df).display()