## 🧼 DICOM PHI Cleanup Script Documentation
Niall Bourke     
24-4-25     

This notebook processes and anonymizes DICOM files within a Flywheel project. Specifically, it removes Protected Health Information (PHI) fields such as `PatientName`, then re-uploads the cleaned file to the same acquisition, replacing the original.

---

### 🔧 Prerequisites

- Access to a **Flywheel-hosted Jupyter Lab** environment or local Python session with `flywheel`, `pydicom`, and `pathlib` installed.
- You must already be authenticated to Flywheel and have access to the desired project.
- The helper function `clean_and_replace_dicom()` must be defined in a previous cell. This function:
  - Removes PHI fields from a DICOM file (e.g., `PatientName`)
  - Replaces the file in Flywheel with a cleaned version
  - Optionally deletes the local temp file

---

### 🔁 Overview of Workflow

This script performs the following steps:

1. **Set up working directories**:
   - A `~/Data/` directory is created to store DICOM files temporarily during processing.

2. **Loop through sessions, acquisitions, and files**:
   - For each DICOM file:
     - Downloads it to a subject/session-specific folder under `~/Data/`
     - Calls `clean_and_replace_dicom()` to:
       - Strip PHI from the file
       - Re-upload the cleaned file
       - Delete the original file from the acquisition

3. **Tracks and prints runtime** for performance monitoring.

---

### ⚠️ Notes & Best Practices

- **PHI Removal**: This example removes only `PatientBirthDate`. For full compliance, you may wish to expand it to include other PHI DICOM tags (e.g., `PatientName`, `InstitutionName`, etc.).
- **Safety**: Re-uploading the file will **permanently replace** the original — be sure to back up data if needed.
- **Avoiding Private Methods**: This script uses `acq.delete_file()` instead of `fw._fw.delete_acquisition_file()`, which is discouraged since it relies on Flywheel's internal


### Install required packages

In [None]:
pip install pydicom

### Function for cleaning PHI
- reads dicom headers
- if field exists, deletes it
- deletes existing file on FW
- uploads new version without PHI field

In [None]:
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def clean_and_replace_dicom(acq, file, download_path):
    """
    Remove PHI tags from a DICOM file and re-upload it to the Flywheel acquisition.
    
    Args:
        acq: Flywheel acquisition object
        file: Flywheel file object associated with the acquisition
        download_path: Local path to the downloaded file
    """
    try:
        ds = pydicom.dcmread(download_path)

        if 'PatientBirthDate' in ds:
            logger.info(f"Removing PatientBirthDate: {ds.PatientBirthDate}")
            del ds.PatientBirthDate

            # Save the cleaned file
            ds.save_as(download_path)

            # Remove original file
            acq.delete_file(file.name)

            # Upload the cleaned file
            acq.upload_file(download_path)
            logger.info(f"Re-uploaded cleaned file: {file.name}")
        else:
            logger.info("PatientBirthDate tag not found, skipping.")

    except Exception as e:
        logger.error(f"Failed to clean {file.name}: {e}")
        
        

### Loop to find files in project to clean

In [None]:
import flywheel
import tempfile
import os
import time
from pathlib import Path
import pydicom

# get the start time
st = time.time()

project = fw_project

# Create a work directory in our local "home" directory
work_dir = Path(Path.home()/'Data/', platform='auto')
# If it doesn't exist, create it
if not work_dir.exists():
    work_dir.mkdir(parents = True)

# Loop over subjects → sessions → acquisitions
for subject in project.subjects.iter():
    # if subject.label == '':
        for session in subject.sessions.iter():
            for acq in session.acquisitions.iter():
                # Work on each DICOM file
                for file in acq.files:
                    file = file.reload()
                    if file.type == 'dicom':
                        # 1) Download into a temp dir
                        print(file.name)
                        data_dir = Path(work_dir/subject.label/session.label, platform='auto')
                        # If it doesn't exist, create it
                        if not data_dir.exists():
                            data_dir.mkdir(parents = True)

                        download_path = os.path.join(data_dir, file.name)
                        print(f"Downloading {file.name} to {download_path}")
                        file.download(download_path)

                        clean_and_replace_dicom(acq, file, download_path)


# get the end time
et = time.time()
# get the execution time
elapsed_time = et - st
print('Execution time:', elapsed_time, 'seconds')