<a href="https://colab.research.google.com/github/georgezero/rsna25-deid-using-chatgpt-llms/blob/main/04_adversarial_phi_attack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Adversarial PHI Hunter: Red Team Your Deidentification

**The uncomfortable truth:** Even after removing obvious PHI (names, dates, IDs), a motivated attacker with an LLM might still be able to infer sensitive information about a patient.

In this notebook, we'll flip the script. Instead of *removing* PHI, we'll try to *recover* information from supposedly deidentified reports. This exercise demonstrates:

1. **Quasi-identifiers** - Combinations of non-PHI data that can identify individuals
2. **Inference attacks** - What an LLM can deduce from clinical context
3. **The limits of simple deidentification** - Why redaction alone isn't always enough

**Important:** This is an educational exercise to understand privacy risks. The goal is to build better deidentification practices.

## Quick start
- Run the `pip install` cell.
- Get a free API key from [Google AI Studio](https://aistudio.google.com/app/apikey).
- Set your `GOOGLE_API_KEY` in the code below.

In [None]:
# Install dependencies
%pip install -q -U google-generativeai

You will need to create an API key. Click the link for a quick tutorial: [Tutorial](https://drive.google.com/file/d/1fGoSsapol9JY9oqu-UulLJfMm-v2egWC/view?usp=drive_link)

In [None]:
import os
import google.generativeai as genai

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY") or "PASTE_YOUR_KEY_HERE"

if GOOGLE_API_KEY == "PASTE_YOUR_KEY_HERE":
    print("‚ö†Ô∏è Please paste your Google API Key in the variable above.")
else:
    genai.configure(api_key=GOOGLE_API_KEY)
    print("‚úÖ Gemini API configured")

‚úÖ Gemini API configured


## Scenario: A "Properly" Deidentified Report

Below is a radiology report that has been deidentified following standard practices:
- Patient name ‚Üí removed
- MRN ‚Üí removed  
- Dates ‚Üí removed
- Physician names ‚Üí removed

Looks safe, right? Let's see what an LLM can figure out...

In [None]:
# This report has had "standard" PHI removed
deidentified_report_1 = """
EXAMINATION: MRI LEFT HAND WITHOUT CONTRAST

CLINICAL HISTORY: Professional violinist with 2-month history of left 4th finger pain
and weakness affecting performance. Patient reports practicing 6-8 hours daily for
upcoming solo debut at [REDACTED] Symphony Orchestra.

COMPARISON: None available.

TECHNIQUE: Multiplanar multisequence MRI of the left hand was performed on a 3T scanner.

FINDINGS:
There is intermediate T2 signal and mild enhancement within the A2 pulley of the
4th digit, consistent with stenosing tenosynovitis (trigger finger).
Mild associated flexor digitorum superficialis and profundus tenosynovitis is noted.
The extensor mechanism is intact.
No evidence of fracture or osseous abnormality.
The triangular fibrocartilage complex is intact.

IMPRESSION:
1. Stenosing tenosynovitis (trigger finger) of the left 4th digit with associated
   flexor tenosynovitis, likely related to repetitive strain from instrument playing.
2. Recommend hand therapy consultation and activity modification.

Electronically signed.
"""

print("=" * 60)
print("DEIDENTIFIED REPORT #1")
print("=" * 60)
print(deidentified_report_1)

DEIDENTIFIED REPORT #1

EXAMINATION: MRI LEFT HAND WITHOUT CONTRAST

CLINICAL HISTORY: Professional violinist with 2-month history of left 4th finger pain 
and weakness affecting performance. Patient reports practicing 6-8 hours daily for 
upcoming solo debut at [REDACTED] Symphony Orchestra.

COMPARISON: None available.

TECHNIQUE: Multiplanar multisequence MRI of the left hand was performed on a 3T scanner.

FINDINGS:
There is intermediate T2 signal and mild enhancement within the A2 pulley of the 
4th digit, consistent with stenosing tenosynovitis (trigger finger). 
Mild associated flexor digitorum superficialis and profundus tenosynovitis is noted.
The extensor mechanism is intact.
No evidence of fracture or osseous abnormality.
The triangular fibrocartilage complex is intact.

IMPRESSION:
1. Stenosing tenosynovitis (trigger finger) of the left 4th digit with associated 
   flexor tenosynovitis, likely related to repetitive strain from instrument playing.
2. Recommend hand therapy 

## Attack #1: Demographic Inference

Let's ask the LLM what it can infer about the patient from the "deidentified" report.

In [None]:
model = genai.GenerativeModel('models/gemini-2.0-flash-lite')

inference_prompt = """
You are a privacy researcher analyzing deidentified medical records for re-identification risks.

Based ONLY on the information in this "deidentified" radiology report, what can you infer or
deduce about the patient? Consider:

1. Likely age range
2. Gender (if inferable)
3. Occupation details
4. Skill level / career stage
5. Geographic hints
6. Socioeconomic indicators
7. Any other identifying characteristics

For each inference, explain your reasoning and rate confidence (Low/Medium/High).

Report:
"""

response = model.generate_content(inference_prompt + deidentified_report_1)
print("INFERENCE ATTACK RESULTS:")
print("=" * 60)
print(response.text)

INFERENCE ATTACK RESULTS:
Here's an analysis of the deidentified radiology report, outlining potential inferences and associated confidence levels:

1.  **Likely Age Range:**
    *   **Inference:** The patient is likely an adult, given the context of a "professional violinist" and a "solo debut."  While young musicians exist, the level of career stage suggests an older age.
    *   **Reasoning:** The report mentions a "solo debut" and the "upcoming" nature of it suggesting that the patient has put in a lot of time and dedication to get to this point, which takes time.
    *   **Confidence:** Medium

2.  **Gender:**
    *   **Inference:** It is impossible to determine the gender based on the limited information.
    *   **Reasoning:** The report does not mention any clues to the patient's gender.
    *   **Confidence:** Low

3.  **Occupation Details:**
    *   **Inference:** Professional violinist.
    *   **Reasoning:** This is explicitly stated in the "Clinical History."
    *   **Con

## Attack #2: Uniqueness Analysis

How unique is this combination of characteristics? Could we narrow down the patient pool?

In [None]:
uniqueness_prompt = """
You are analyzing this deidentified report for re-identification risk.

Consider the combination of quasi-identifiers present:
- Occupation: Professional violinist
- Specific detail: Upcoming solo debut
- Venue hint: Symphony Orchestra
- Condition: Left 4th finger trigger finger
- Practice intensity: 6-8 hours daily

Questions:
1. How many professional violinists have solo debuts at major symphony orchestras in a given year?
2. Of those, how many would have this specific finger condition?
3. If someone had access to public concert schedules and this report, could they identify the patient?
4. What's the estimated "anonymity set" size (how many people could this possibly be)?

Provide your analysis of re-identification feasibility.

Report:
"""

response = model.generate_content(uniqueness_prompt + deidentified_report_1)
print("UNIQUENESS ANALYSIS:")
print("=" * 60)
print(response.text)

UNIQUENESS ANALYSIS:
Here's a breakdown of the re-identification risk analysis:

**1. How many professional violinists have solo debuts at major symphony orchestras in a given year?**

*   **Analysis:** This is the most crucial question. The answer will heavily influence the anonymity set size.  The number is likely very small. Consider the following:
    *   **Professional Violinists:** There are many professional violinists, but the number who perform with major symphony orchestras is a much smaller subset.
    *   **Solo Debut:** A solo debut is a significant career milestone, and a limited number of violinists achieve this.
    *   **Major Symphony Orchestra:**  "Major" suggests highly prominent orchestras (e.g., Boston Symphony, New York Philharmonic, London Symphony). These are the most competitive engagements.
*   **Estimation:**  Likely **less than 10 per orchestra per year**, and possibly even less than 1. Some years, a specific orchestra might not have any solo debuts by viol

## Let's Try Another Report

Here's a different scenario with different quasi-identifiers.

In [None]:
deidentified_report_2 = """
EXAMINATION: CT CHEST WITH CONTRAST

CLINICAL HISTORY: Former professional athlete (NFL linebacker, 12 seasons) presenting
with dyspnea on exertion. History of multiple documented concussions.
BMI 34. Non-smoker. Retired 3 years ago.

COMPARISON: Chest X-ray from [REDACTED].

TECHNIQUE: Helical CT of the chest performed with IV contrast.

FINDINGS:
HEART: Cardiomegaly with left ventricular hypertrophy pattern.
Coronary artery calcifications noted in the LAD (Agatston score ~180).

LUNGS: Clear bilaterally. No pulmonary nodules or masses.
No pleural effusion or pneumothorax.

MEDIASTINUM: No lymphadenopathy. Normal caliber thoracic aorta.

MUSCULOSKELETAL: Incidental note of healed right clavicle fracture with hardware.
Degenerative changes in the cervical spine with multilevel disc disease.
Old healed rib fractures on the right (ribs 7, 8).

IMPRESSION:
1. Cardiomegaly with LVH, likely athletic heart with possible early
   hypertensive changes. Cardiology follow-up recommended.
2. Moderate coronary artery calcification - consider risk stratification.
3. Chronic musculoskeletal changes consistent with prior athletic injuries.

Electronically signed.
"""

print("=" * 60)
print("DEIDENTIFIED REPORT #2")
print("=" * 60)
print(deidentified_report_2)

DEIDENTIFIED REPORT #2

EXAMINATION: CT CHEST WITH CONTRAST

CLINICAL HISTORY: Former professional athlete (NFL linebacker, 12 seasons) presenting 
with dyspnea on exertion. History of multiple documented concussions. 
BMI 34. Non-smoker. Retired 3 years ago.

COMPARISON: Chest X-ray from [REDACTED].

TECHNIQUE: Helical CT of the chest performed with IV contrast.

FINDINGS:
HEART: Cardiomegaly with left ventricular hypertrophy pattern. 
Coronary artery calcifications noted in the LAD (Agatston score ~180).

LUNGS: Clear bilaterally. No pulmonary nodules or masses.
No pleural effusion or pneumothorax.

MEDIASTINUM: No lymphadenopathy. Normal caliber thoracic aorta.

MUSCULOSKELETAL: Incidental note of healed right clavicle fracture with hardware.
Degenerative changes in the cervical spine with multilevel disc disease.
Old healed rib fractures on the right (ribs 7, 8).

IMPRESSION:
1. Cardiomegaly with LVH, likely athletic heart with possible early 
   hypertensive changes. Cardiology fo

In [None]:
response = model.generate_content(inference_prompt + deidentified_report_2)
print("INFERENCE ATTACK RESULTS (Report #2):")
print("=" * 60)
print(response.text)

INFERENCE ATTACK RESULTS (Report #2):
Here's an analysis of the provided radiology report, outlining potential inferences and associated confidence levels:

1.  **Likely Age Range:**

    *   **Inference:** The patient is likely in their late 30s to early 50s.
    *   **Reasoning:** The report mentions the patient is a "former professional athlete (NFL linebacker, 12 seasons)" who retired "3 years ago".  This suggests a career that ended recently, indicating a possible age range based on the length of a typical football career. Furthermore, the presence of coronary artery calcifications (Agatston score 180), degenerative spine changes, and healed fractures are more common in middle-aged individuals.
    *   **Confidence:** Medium

2.  **Gender:**

    *   **Inference:** The patient is male.
    *   **Reasoning:**  The profession "NFL linebacker" is almost exclusively held by men.
    *   **Confidence:** High

3.  **Occupation Details:**

    *   **Inference:** Former professional athle

In [None]:
cross_reference_prompt = """
You are a privacy researcher. Given this deidentified medical report about a former
NFL linebacker, explain how someone could potentially re-identify this patient using
publicly available data sources.

Consider:
1. NFL injury reports (publicly released)
2. Sports news archives
3. Player retirement announcements
4. Social media posts about health issues
5. Public records

What specific data points in this report could be cross-referenced with public sources?

Report:
"""

response = model.generate_content(cross_reference_prompt + deidentified_report_2)
print("CROSS-REFERENCE ATTACK ANALYSIS:")
print("=" * 60)
print(response.text)

CROSS-REFERENCE ATTACK ANALYSIS:
Okay, as a privacy researcher, I will analyze this deidentified medical report and outline potential re-identification risks. The goal is to highlight how seemingly anonymized data can be combined with publicly available information to reveal the identity of the patient.

**Data Points Vulnerable to Re-identification:**

Several data points in this report, when combined with public information, significantly increase the risk of re-identification:

1.  **"Former professional athlete (NFL linebacker, 12 seasons)":** This is the biggest giveaway. The combination of sport (NFL), position (linebacker), and career length (12 seasons) narrows the field dramatically.
2.  **"Retired 3 years ago":** This further refines the timeframe, drastically reducing the number of potential candidates.
3.  **"History of multiple documented concussions":** While not definitive, it adds another layer of information that can be used to search.
4.  **"BMI 34. Non-smoker":** The

## Attack #3: The Rare Disease Problem

Some conditions are so rare that the diagnosis itself becomes identifying.

In [None]:
deidentified_report_3 = """
EXAMINATION: MRI BRAIN WITH AND WITHOUT CONTRAST

CLINICAL HISTORY: Progressive neurological symptoms. Known diagnosis of
Creutzfeldt-Jakob disease (sporadic form), confirmed by RT-QuIC testing.
Presenting for disease progression monitoring.

COMPARISON: MRI from 6 weeks prior.

TECHNIQUE: Multiplanar multisequence MRI performed at 3T.

FINDINGS:
Interval progression of cortical ribboning with increased DWI signal involving
bilateral frontal and parietal cortices compared to prior. New involvement of
the left temporal cortex. Basal ganglia signal abnormality is stable.
Progressive cerebral atrophy noted.
No evidence of mass effect or hydrocephalus.

IMPRESSION:
1. Progressive findings consistent with known sCJD, with new left temporal
   cortical involvement.
2. Continued cerebral volume loss.

Electronically signed.
"""

print("=" * 60)
print("DEIDENTIFIED REPORT #3 (Rare Disease)")
print("=" * 60)
print(deidentified_report_3)

DEIDENTIFIED REPORT #3 (Rare Disease)

EXAMINATION: MRI BRAIN WITH AND WITHOUT CONTRAST

CLINICAL HISTORY: Progressive neurological symptoms. Known diagnosis of 
Creutzfeldt-Jakob disease (sporadic form), confirmed by RT-QuIC testing.
Presenting for disease progression monitoring.

COMPARISON: MRI from 6 weeks prior.

TECHNIQUE: Multiplanar multisequence MRI performed at 3T.

FINDINGS:
Interval progression of cortical ribboning with increased DWI signal involving 
bilateral frontal and parietal cortices compared to prior. New involvement of 
the left temporal cortex. Basal ganglia signal abnormality is stable.
Progressive cerebral atrophy noted.
No evidence of mass effect or hydrocephalus.

IMPRESSION:
1. Progressive findings consistent with known sCJD, with new left temporal 
   cortical involvement.
2. Continued cerebral volume loss.

Electronically signed.



In [None]:
rare_disease_prompt = """
Analyze the re-identification risk for this deidentified report containing a rare disease.

Consider:
1. What is the annual incidence of sporadic CJD?
2. At any given time, how many active sCJD patients exist in a typical region?
3. If this report came from a specific hospital system, what's the anonymity set?
4. How does the rarity of the disease itself function as an identifier?
5. What additional data sources might exist (disease registries, research studies)?

Provide a risk assessment.

Report:
"""

response = model.generate_content(rare_disease_prompt + deidentified_report_3)
print("RARE DISEASE RE-IDENTIFICATION RISK:")
print("=" * 60)
print(response.text)

RARE DISEASE RE-IDENTIFICATION RISK:
Let's analyze the re-identification risk of this deidentified report.

**1. What is the annual incidence of sporadic CJD?**

*   Sporadic Creutzfeldt-Jakob disease (sCJD) has a very low incidence. The typical incidence is approximately **1-2 cases per million people per year**. This makes it a rare disease.

**2. At any given time, how many active sCJD patients exist in a typical region?**

*   This depends on the population size of the "typical region." Let's consider a hypothetical region of 1 million people. With an incidence of 1-2 per million per year, we would expect roughly **1-2 new cases per year**. However, sCJD has a rapid disease course (months). Assuming an average survival time of 6 months after diagnosis, and considering there are ongoing diagnosis, we can estimate that **1-2** patients would be active within this specific region at any given time.

**3. If this report came from a specific hospital system, what's the anonymity set?**


## So... How Do We Protect Against This?

Let's ask the LLM to suggest better deidentification strategies for Report #1.

In [None]:
mitigation_prompt = """
You are a health privacy expert. Review this "deidentified" report and suggest
additional steps needed to truly protect patient privacy.

For each quasi-identifier found:
1. Identify the risk
2. Suggest a specific mitigation (generalization, suppression, or perturbation)
3. Explain the tradeoff with clinical utility

Then provide a fully re-deidentified version of the report that balances
privacy protection with clinical usefulness.

Original "deidentified" report:
"""

response = model.generate_content(mitigation_prompt + deidentified_report_1)
print("MITIGATION RECOMMENDATIONS:")
print("=" * 60)
print(response.text)

üõ°Ô∏è MITIGATION RECOMMENDATIONS:
Okay, I have reviewed the "deidentified" report and will provide an assessment of its privacy risks, suggest mitigation strategies, and offer a re-deidentified version.

**Analysis of Privacy Risks and Mitigation Strategies**

Even though the report has been "deidentified" by removing the patient's name and institution (assuming "[REDACTED]" refers to the hospital or clinic), it still contains information that could potentially re-identify the patient. Here's a breakdown of the risks and mitigation strategies for each quasi-identifier:

| Quasi-Identifier          | Risk                                                                                                                                                                                                 | Mitigation Strategy                                                                                                                                                                        | Tra

## Try It Yourself!

Paste your own "deidentified" report below and see what an LLM can infer.

In [None]:
# Paste your own deidentified report here
your_report = """
PASTE YOUR DEIDENTIFIED REPORT HERE
"""

if your_report.strip() != "PASTE YOUR DEIDENTIFIED REPORT HERE":
    response = model.generate_content(inference_prompt + your_report)
    print("INFERENCE ATTACK ON YOUR REPORT:")
    print("=" * 60)
    print(response.text)
else:
    print("Replace the placeholder text with your report to analyze it.")

## Key Takeaways

1. Standard deidentification may not be enough! Removing names, dates, and IDs doesn't eliminate all re-identification risk.

2. Occupation + condition + timing can be as identifying as a name.

3. Some diagnoses have such low prevalence that the diagnosis itself is an identifier.

4. A report from a small rural hospital vs. a large urban system has very different anonymity sets.

5. What once required manual research can now be done in seconds with inference attacks.

6. Consider generalization, suppression, k-anonymity, and differential privacy techniques.