
# 03 · Jurisdictional Compliance Matrix (HIPAA / GDPR / DUA)

This notebook converts privacy metrics into **policy readiness** across jurisdictions:

- **HIPAA Safe Harbor** (name-based direct identifier scan)
- **GDPR anonymization** (risk checklist: low k, sub-year dates, geo precision, direct IDs)
- **DUA policy gates** (purpose limitation, retention, region, third parties)

**Outputs**
- `data/privacy_compliance_report.json` → consolidated compliance results
- Optional visuals (bar chart) saved to `reports/assets/`


In [None]:

import os, json, pandas as pd, numpy as np, matplotlib.pyplot as plt
from pathlib import Path
import sys

# Import project modules
repo_root = Path.cwd()
if (repo_root / "notebooks").exists():
    scripts_dir = repo_root / "scripts"
    visuals_dir = repo_root / "visuals"
else:
    repo_root = Path.cwd().parent
    scripts_dir = repo_root / "scripts"
    visuals_dir = repo_root / "visuals"
for p in (scripts_dir, visuals_dir):
    if str(p) not in sys.path:
        sys.path.append(str(p))

from scripts.privacy_checks import load_dataset, detect_direct_identifiers, detect_quasi_identifiers
from scripts.compliance_matrix import build_compliance_report
from visuals.privacy_plots import plot_compliance_bar, save_fig

DATA_DIR = repo_root / "data"
ASSETS = repo_root / "reports" / "assets"
DATA_DIR.mkdir(exist_ok=True, parents=True)
ASSETS.mkdir(exist_ok=True, parents=True)

print("Environment ready.")
print("repo_root =", repo_root)



## Inputs

This notebook uses either:
- The dataset from prior steps (`data/sample_synthetic.csv` by default), or
- Your own dataset if you change `DATA_FILE` below.

It also **benefits** from the identifier scan in Notebook 01 (for direct/quasi identification), but will recompute if needed.


In [None]:

DATA_FILE = DATA_DIR / "sample_synthetic.csv"  # change to your dataset if desired

# Try to load dataset
if DATA_FILE.exists():
    df = load_dataset(DATA_FILE)
    print(f"[info] Loaded dataset: {DATA_FILE} shape={df.shape}")
else:
    print("[warn] sample_synthetic.csv missing — creating a tiny synthetic frame in-memory.")
    df = pd.DataFrame({
        "patient_id": [1,2,3,4,5,6,7,8],
        "birth_date": ["1980-01-01","1980-02-10","1978-05-03","1990-07-21","1985-10-12","1972-03-30","1972-03-30","1972-03-30"],
        "zip_code": ["94110","94110","02139","02139","10027","10027","10027","10027"],
        "sex": ["F","F","M","M","F","F","M","M"],
        "condition": ["Diabetes","Hypertension","Asthma","Cancer","COPD","Diabetes","Asthma","COPD"],
        "visit_date": ["2022-03-10","2022-03-10","2022-03-11","2022-03-11","2022-03-12","2022-03-13","2022-03-13","2022-03-14"],
        "lab_result": [7.1,"130/85","Mild","Stage II","FEV1=65%","7.4","Moderate","FEV1=60%"]
    })
    print(f"[info] Created synthetic df shape={df.shape}")

display(df.head())



## Identify direct & quasi-identifiers

We’ll reuse the same heuristics from Notebook 01. These inform GDPR risk checks and DUA guidance.


In [None]:

direct = detect_direct_identifiers(df)
quasi  = detect_quasi_identifiers(df)

print("Direct identifiers:", sorted(direct) if direct else "None")
print("Quasi-identifiers :", sorted(quasi) if quasi else "None")



## DUA policy flags (configure for your project)

These represent organizational policy gates often required before data sharing/processing.  
Set to **True** when satisfied.


In [None]:

DUA_FLAGS = {
    "purpose_limited": True,     # The use case is clearly defined and limited
    "retention_defined": True,   # Retention period and deletion policy are documented
    "in_region": True,           # Data stored/processed in allowed regions
    "third_party_vetted": False  # Third-party processors have been assessed
}
DUA_FLAGS



## Build the compliance report

This aggregates HIPAA Safe Harbor, GDPR risk checklist, and DUA gates into a single object.


In [None]:

compliance = build_compliance_report(df, quasi_cols=quasi, direct_cols=direct, dua_flags=DUA_FLAGS)

out_json = DATA_DIR / "privacy_compliance_report.json"
out_json.write_text(json.dumps(compliance, indent=2))
print(f"[ok] wrote → {out_json}")

# Preview
compliance



## Visual: Compliance Readiness

A compact bar chart summarizing HIPAA pass/fail, GDPR risk normalization, DUA score, and overall index.


In [None]:

fig = plot_compliance_bar(compliance, title="Compliance Readiness (HIPAA / GDPR / DUA / Index)")
out = save_fig(fig, ASSETS / "compliance_bar.png")
plt.show()
print("[ok] saved:", out)



## How to interpret

- **HIPAA (PASS/REVIEW)** → `REVIEW` means column names suggest direct identifiers present; remove/suppress before sharing.  
- **GDPR (Risk index)** → lower is better. High risk often means sub‑year dates, precise geo, or direct IDs.  
- **DUA score** → % of gates satisfied; incomplete gates block sharing until addressed.  
- **Compliance index** → a simple weighted roll‑up (US/EU/DUA). Use as a directional guide, not a regulatory verdict.



## Compact dashboard table


In [None]:

hipaa = compliance.get("hipaa", {})
gdpr  = compliance.get("gdpr", {})
jur   = compliance.get("jurisdictions", {})
dua   = (jur.get("DUA") or {})

dash = pd.DataFrame([{
    "HIPAA_pass": hipaa.get("pass"),
    "GDPR_risk_index": gdpr.get("risk_index"),
    "DUA_score": dua.get("score"),
    "Compliance_index": jur.get("compliance_index")
}])
dash



## Next steps

- If **HIPAA=REVIEW** → drop or transform flagged direct identifiers (Notebook 02 has generalization patterns).  
- If **GDPR risk index ≥ 0.33** → coarsen dates to year or month; reduce geo precision; verify no direct identifiers remain.  
- If **DUA < 1** → work with Legal/Data Stewardship to satisfy missing gates (purpose, retention, region, third parties).  
- Finally, pass results to **Notebook 04** to link compliance posture to **ROI** and auto‑build a **PDF report**.
