
# 01 · Privacy Quasi-Identifier Scan (Rewired)

This notebook is the **starting point**. It scans a dataset for **direct** and **quasi-identifiers**, computes **k‑anonymity** and **l‑diversity**, and saves a JSON summary used by the rest of the workflow.

**You will get:**
- A quick schema profile and sample rows
- A list of **direct** vs **quasi-identifiers**
- **k-anonymity**, **l-diversity**, and overall **risk** score
- Visuals saved to `reports/assets/`
- `data/privacy_report.json` for downstream notebooks


In [None]:

import os, json, pandas as pd, numpy as np, matplotlib.pyplot as plt
from pathlib import Path
import sys

# Ensure project modules are importable from scripts/ and visuals/
repo_root = Path.cwd()
if (repo_root / "notebooks").exists():
    scripts_dir = repo_root / "scripts"
    visuals_dir = repo_root / "visuals"
else:
    # If launched inside notebooks/, go up one level
    repo_root = Path.cwd().parent
    scripts_dir = repo_root / "scripts"
    visuals_dir = repo_root / "visuals"
for p in (scripts_dir, visuals_dir):
    if str(p) not in sys.path:
        sys.path.append(str(p))

from scripts.privacy_checks import (
    load_dataset, build_privacy_report, detect_direct_identifiers,
    detect_quasi_identifiers, k_anonymity, l_diversity, infer_column_roles
)
from visuals.privacy_plots import plot_identifier_heatmap, plot_k_equivalence_hist, save_fig

DATA_DIR = repo_root / "data"
ASSETS = repo_root / "reports" / "assets"
DATA_DIR.mkdir(exist_ok=True, parents=True)
ASSETS.mkdir(exist_ok=True, parents=True)

print("Environment ready.")
print("repo_root =", repo_root)



## 🔧 Configure

- Place your CSV/Parquet in `data/` and set `DATA_FILE` below.  
- If not provided, we **auto‑load** `data/sample_synthetic.csv` (included).  
- Optionally set a **sensitive column** (e.g., diagnosis, outcome) to compute **l‑diversity**.


In [None]:

# Path to your dataset (CSV or Parquet). Leave as default to use the toy sample.
DATA_FILE = DATA_DIR / "sample_synthetic.csv"   # <-- change to e.g., DATA_DIR / "my_extract.csv"

# Sensitive column used for l-diversity (optional, can be None).
# Common choices: 'condition', 'diagnosis', 'outcome', 'lab_result', etc.
SENSITIVE_COL = "condition"



## 📥 Load dataset


In [None]:

if DATA_FILE.exists():
    df = load_dataset(DATA_FILE)
    print(f"[info] Loaded dataset → {DATA_FILE}  shape={df.shape}")
else:
    print(f"[warn] {DATA_FILE} not found; creating a tiny synthetic dataframe in-memory.")
    df = pd.DataFrame({
        "patient_id": [1,2,3,4,5,6,7,8,9,10],
        "birth_date": ["1980-01-01","1979-05-12","1988-07-03","1975-09-21","1992-11-30",
                       "1981-04-10","1978-12-02","1990-01-14","1977-06-25","1985-03-17"],
        "zip_code": ["94110","94110","02139","02139","10027","10027","60616","60616","30303","30303"],
        "sex": ["F","M","F","M","F","M","F","M","F","M"],
        "condition": ["Diabetes","Hypertension","Asthma","Cancer","COPD","Diabetes","Asthma","Cancer","COPD","Hypertension"],
        "visit_date": ["2022-03-10","2022-03-10","2022-03-11","2022-03-11","2022-03-12",
                       "2022-03-12","2022-03-13","2022-03-13","2022-03-14","2022-03-14"],
        "lab_result": [7.1,"130/85","Mild","Stage II","FEV1=65%","7.4","Moderate","Stage I","FEV1=60%","140/90"]
    })
    print(f"[info] Created synthetic df shape={df.shape}")

# Keep SENSITIVE_COL only if present; else disable l-diversity
if SENSITIVE_COL not in df.columns:
    print(f"[note] Sensitive column '{SENSITIVE_COL}' not found; l-diversity will be skipped.")
    SENSITIVE_COL = None

display(df.head())



## 🧭 Schema quick‑look


In [None]:

print("Columns:", list(df.columns))
print("Shape:", df.shape)
display(df.describe(include='all').T.head(20))

roles = infer_column_roles(df)
print("\nInferred roles:")
for k,v in roles.items():
    print(f"  - {k}: {v}")



## 🔎 Identifier scan
Detect **direct identifiers** (e.g., names, MRN, email) and **quasi-identifiers** (dates, ZIP, small‑area geos, demographics).


In [None]:

direct = detect_direct_identifiers(df)
quasi  = detect_quasi_identifiers(df)

print("Direct identifiers:", sorted(direct) if direct else "None")
print("Quasi-identifiers :", sorted(quasi) if quasi else "None")



## 📏 Privacy metrics
Compute **k‑anonymity** and **l‑diversity** (if sensitive column provided). The full report is saved for later steps.


In [None]:

k_val = k_anonymity(df, quasi)
l_val = l_diversity(df, quasi, SENSITIVE_COL, method="distinct") if SENSITIVE_COL else np.nan

report = build_privacy_report(
    df,
    sensitive_col=SENSITIVE_COL,
    quasi_override=sorted(quasi) if quasi else None
)

# Save JSON
privacy_json = DATA_DIR / "privacy_report.json"
privacy_json.write_text(json.dumps(report, indent=2))
print(f"[ok] Wrote → {privacy_json}")

pd.DataFrame([{
    "k_anonymity": k_val,
    "l_diversity": l_val if not pd.isna(l_val) else None,
    "direct_identifiers": ", ".join(sorted(direct)) if direct else "None",
    "quasi_identifiers": ", ".join(sorted(quasi)) if quasi else "None"
}])



## 📈 Visuals
We save charts into `reports/assets/` for embedding in the final PDF.


In [None]:

# Identifier map (direct vs quasi)
fig1 = plot_identifier_heatmap(df.columns, direct, quasi)
p1 = save_fig(fig1, ASSETS / "identifier_map.png")
plt.show(); print("[ok] saved:", p1)

# k-anonymity equivalence class histogram
fig2 = plot_k_equivalence_hist(df, quasi)
p2 = save_fig(fig2, ASSETS / "k_hist.png")
plt.show(); print("[ok] saved:", p2)



## ✅ What to do next

- If **direct identifiers** were detected → drop or replace them before sharing data.  
- If **k < 5** or **l < 2**, increase generalization (dates → year/month, ZIP → ZIP3, bucket rare categories).  
- Continue to **02_deidentification_scorecard.ipynb** to apply generalizations and compare **before → after**.  
- `privacy_report.json` you just created will be used by **03 (Compliance)** and **04 (ROI + Report)**.
