# Detect DICOM Image PHI and Redact PHI
- Recommended cluster: >= 15.4 LTS ML (avoid Serverless)

## Setup `pixels` package (if you haven't done so)
1. [Create a git folder](https://docs.databricks.com/aws/en/repos/git-operations-with-repos) cloning the [pixels](https://github.com/databricks-industry-solutions/pixels) package
2. Then run the [`config/setup.py`]($./config/setup) script in the repo folder

In [0]:
%run ./config/setup_ai

In [0]:
%pip install -U pydicom gdcm Pillow

In [0]:
import pandas as pd
from dbx.pixels.logging import LoggerProvider

logger = LoggerProvider()

In [0]:
# dbutils.widgets.text("table", "main.pixels_solacc.object_catalog", label="table of DICOM paths")
dbutils.widgets.text("table", "hls_radiology.tcia.object_catalog_htj2k_coalesce", label="table of DICOM paths")
dbutils.widgets.text("output_dir", "/Volumes/main/pixels_solacc/pixels_volume/redacted", label="volumes folder for storing redacted images")

In [0]:
table = dbutils.widgets.get("table")
output_dir = dbutils.widgets.get("output_dir")
output_dir = "/Volumes/yen_customers/pixels/redacted"

### Load input dataframe
This assumes a table of DICOM paths already exists with the UC path set in `table`.<br>
To ingest DICOM files into a delta spark table, see [01-dcm-demo]($./01-dcm-demo) for examples.

`VLMPhiExtractor` requires that input be must be ONE of the following:
1. a .dcm file path (e.g. `/Volumes/<catalog>/<schema>/2.1.656.0.2.8048482.9.537.165816238/1-1.dcm`)
2. image file path (e.g. `/Volumes/<catalog>/<schema>/2.1.656.0.2.8048482.9.537.165816238/1-1.jpg`)
3. image encoded as a base64 string required by VLM (e.g. `/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAgGBgcGBQgHBwc...`)

In [0]:
df = spark.table(table)
display(df)

In [0]:
in_df = df.select("path").limit(10)
in_df = in_df.repartition(8)
display(in_df)

In [0]:
import pydicom
import matplotlib.pyplot as plt
from dbx.pixels.dicom.redactor.utils import get_frame, redact_frame, handle_frame_redaction
from dbruntime.patches import cv2_imshow
import os
import json
from dataclasses import asdict

from dbx.pixels.dicom.dicom_easyocr_redactor import ocr_dcm, multiframe_redactor
from dbx.pixels.dicom.dicom_utils import remove_dbfs_prefix
from dbx.pixels.dicom.redactor.dataclasses import *
from dbx.pixels.dicom.redactor.utils import redact_dcm

In [0]:
path = "dbfs:/Volumes/main/pixels_solacc/pixels_volume/unzipped/benigns_21/benigns/patient5397/5397.LEFT_CC.dcm"
#path = "/Workspace/Users/yen.low@databricks.com/pixel/data/055829-00000000.dcm"
#path = "/Workspace/Users/yen.low@databricks.com/pixel/midib/0_ORIGINAL.dcm"
ds = pydicom.dcmread(remove_dbfs_prefix(path))

In [0]:
%skip
# Test single
output_path = multiframe_redactor(path, outdir=output_dir, max_frames=3, gpu=False, text_threshold=0)

In [0]:
%skip
ds_redacted = pydicom.dcmread(output_path)
plt.imshow(ds_redacted.pixel_array)
display(plt)

In [0]:
plt.imshow(ds.pixel_array)
display(plt)

## Use Spark ML Pipeline to build a PHI detection and redaction workflow
### Pipeline 1: Without a filter (VLMPhiDetector + OcrRedactor)

Build an end-to-end `pyspark.ml.Pipeline` defined with `VLMPhiDetector` and `OcrRedactor` as `Transformer` stages in a workflow. This utilizes Spark ML for highly efficient parallel processing.

1. `VLMPhiDetector` detects and extracts PHI entities using a VLM (set by `endpoint` parameter)
2. `OcrRedactor` masks PHI from images using EasyOCR to detect text boundaries

```
# As a pipeline
Pipeline(stages=[VLMPhiDetector("databricks-claude-3-7-sonnet"), 
                OcrRedactor()])
```
This is wrapped in `DicomPhiPipeline` where you provide the inputs to `VLMPhiDetector` and `OcrRedactor`. Here, `redact_even_if_undetected=False` so `OcrRedactor` processes all the .dcm whether or not they have PHI detected by the prior `VLMPhiDetector`

In [0]:
from dbx.pixels.dicom.dicom_phi import DicomPhiPipeline

pipeline = DicomPhiPipeline(endpoint="databricks-claude-3-7-sonnet",
                            output_dir=output_dir,
                            redact_even_if_undetected=True,
                            inputCol="path",
                            detectCol="response",
                            outputCol="path_redacted",
                            gpu=False,
                            text_threshold=0)
model_redact_undetected = pipeline.fit(in_df)
out_df = model_redact_undetected.transform(in_df)
display(out_df)

## Evaluate against ground truth

In [0]:
from pyspark.sql.functions import split, col, when, size, isnotnull
import pandas as pd

extracted_df = (out_df
    .withColumn("phi_detected", when(size(col("response.content"))>1, True).otherwise(False))
    .withColumn("redacted", isnotnull(col("path_redacted")))
    .select("has_phi", "phi_detected", "redacted")
)
display(extracted_df)

In [0]:
from dbx.pixels.dicom.dicom_utils import get_classifer_metrics

# Performance of VLMPhiDetector
get_classifer_metrics(extracted_df, col_pred="phi_detected")

In [0]:
# Performance of OcrRedactor
get_classifer_metrics(extracted_df, col_pred="redacted")

## Interpreting the performance metrics
The metrics are defined as follows:
- **Recall** measures proportion of positive labels (i.e. has PHI) correctly predicted
- **Specificity** measures proportion of negative labels (i.e. non-PHI) correctly predicted
- **Precision** measures of those predicted to be positive, how many actually have PHI. Its difference from 1 represents the false discovery rate
- **Negative Predictive Value (NPV)** measures of those predicted to be negative, how many actually are non-PHI. Its difference from 1 represents the false omission rate.
- **F1** is the harmonic mean of Precision and Recall `F1 = 2*(precision * recall)/(precision + recall)`
- **Accuracy** is the average of Recall and Specificity weighted by proportion of positive and negative labels respectively 

As these metrics are all related and trade off against one another. It is important to determine whether it is more critical to not miss any PHI (i.e. high recall) and risk overredacting non-PHI or if it is acceptable to risk missing PHI (i.e. underredact) in favor of not falsely predicting non-PHI as PHI. The latter becomes more important when there is an unacceptable risk of falsely predicting a negative as positive (i.e. diagnosed with cancer when one does not have it). 

In this example, the former is preferred as the risk of missing PHI is too high. **So we want close to zero false omission rate (i.e. high NPV) and high recall which is the case for both the detector and redactor. Both have 100% NPV and 100% recall**

### Pipeline 2: With a filter (VLMPhiDetector + FilterTransformer + OcrRedactor)
As `OcrRedactor` tends to overredact (i.e. has many false positives), for best performance, we recommend adding a `FilterTransformer` that filters according to the output of `VLMPhiDetector` and is subsequently fed into `OcrRedactor`. 

1. `VLMPhiDetector` detects and extracts PHI entities using a VLM
2. `FilterTransformer` filters only the PHI rows detected by `VLMPhiDetector`
3. `OcrRedactor` works on the filtered PHI rows and masks PHI from images using EasyOCR
```
# As a pipeline
Pipeline(stages=[VLMPhiDetector("databricks-claude-3-7-sonnet"), 
                FilterTransformer(),
                OcrRedactor()])
```

This doesn't seem to affect the redactor's ability to miss PHI (i.e. NPV remains perfect at 1).

In [0]:
pipeline_redact_only_detected = DicomPhiPipeline(
    endpoint="databricks-claude-3-7-sonnet",
    output_dir=output_dir,
    redact_even_if_undetected=False,
    gpu=False,
    text_threshold=0,
)
model_redact_only_detected = pipeline_redact_only_detected.fit(in_df)
out_df_redact_only_detected = model_redact_undetected.transform(in_df)
display(out_df_redact_only_detected)

In [0]:
pipeline_redact_only_detected = DicomPhiPipeline(endpoint="databricks-claude-3-7-sonnet",
                                                 output_dir=output_dir,
                                                 redact_even_if_undetected=False)
model_redact_only_detected = pipeline_redact_only_detected.fit(in_df)
out_df_redact_only_detected = model_redact_only_detected.transform(in_df)
display(out_df_redact_only_detected)

In [0]:
extracted_df_redact_only_detected = (out_df_redact_only_detected
    .withColumn("phi_detected", when(size(col("response.content"))>1, True).otherwise(False))
    .withColumn("redacted", isnotnull(col("path_redacted")))
    .select("has_phi", "phi_detected", "redacted")
)
display(extracted_df_redact_only_detected)

In [0]:
# Performance of VLMPhiDetector
get_classifer_metrics(extracted_df_redact_only_detected, col_pred="phi_detected")

In [0]:
# Performance of OcrRedactor
get_classifer_metrics(extracted_df_redact_only_detected, col_pred="redacted")

## Conclusion
As stated earlier, we emphasize high Recall (100%) and NPV (100%) which were already achieved in the first pipeline without a filter (`redact_even_if_undetected=True`). However, its Specificity (58.3%) and Precision (28.6%) were low due to overredaction of non-PHI. 

To minimize overredaction, we introduced a filter (`redact_even_if_undetected=False`) in the second pipeline such that only detected PHI will undergo redaction. This improved both the detector and redactor to 100% in all metrics.

View the files in the [output_dir](https://e2-demo-field-eng.cloud.databricks.com/explore/data/volumes/hls_radiology/tcia/redacted)