# OMOP / FHIR Conformity Checks + CT.gov & CDC Context (Updated)

This notebook extends the governance scorecard by adding:
- **OMOP / FHIR / vocabulary** conformity (standards)
- **ClinicalTrials.gov** portfolio context (feasibility proxy from median duration & enrollment)
- **CDC** public-health timeliness (external signal)

It combines these with the basic OpenFDA governance metrics into an expanded **RWD Governance Scorecard** with components:
**Completeness, Consistency, Timeliness, Conformity, Standards, Context**.

In [None]:
import os, json, requests, pandas as pd, numpy as np, matplotlib.pyplot as plt
from pathlib import Path
from datetime import datetime

from scripts.scorecard import (
    compute_basic_metrics, score, plot_scorecard,
    omop_conformity, fhir_conformity, vocabulary_format_score, standards_score
)

pd.set_option('display.max_columns', 50)
DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)
print("Setup complete.")

## 1) OpenFDA sample (as before)

In [None]:
# Pull a small OpenFDA adverse events sample
BASE = 'https://api.fda.gov/drug/event.json'
params = {'search': 'receivedate:[20240101+TO+20251231]', 'limit': 100}

try:
    payload = requests.get(BASE, params=params, timeout=30).json()
    results = payload.get('results', [])
except Exception as e:
    print("OpenFDA fetch failed:", e)
    results = []

def flatten(rec):
    return {
        'safetyreportid': rec.get('safetyreportid'),
        'receiptdate': rec.get('receiptdate'),
        'occurcountry': rec.get('occurcountry'),
        'serious': rec.get('serious'),
        'companynumb': rec.get('companynumb'),
    }

df = pd.DataFrame([flatten(r) for r in results])
df.head()

## 2) Basic governance metrics (OpenFDA)

In [None]:
required_fields = ['safetyreportid','receiptdate','occurcountry','serious']
basic = compute_basic_metrics(df, required_fields)
basic

## 3) OMOP / FHIR / Vocabulary (Standards)

- **OMOP**: minimal required columns present + non-null coverage (illustrative)
- **FHIR**: minimal required keys across a small synthetic bundle (illustrative)
- **Vocabulary**: regex proxy (e.g., ICD-10-like format)

In [None]:
# OMOP: illustrate "low" on OpenFDA-like structure
omop_like = pd.DataFrame({
    'drug_exposure_id': pd.Series(dtype='Int64'),
    'person_id': pd.Series(dtype='Int64'),
    'drug_concept_id': pd.Series(dtype='Int64'),
    'drug_exposure_start_date': pd.Series(dtype='string'),
})
omop_score = omop_conformity(omop_like, table='drug_exposure')

# Synthetic high-conformity OMOP (for comparison, not used in final aggregate)
omop_synth = pd.DataFrame({
    'drug_exposure_id': range(1, 6),
    'person_id': [101,102,103,104,105],
    'drug_concept_id': [19019073,19019073,1539403,1547504,1539403],
    'drug_exposure_start_date': pd.date_range('2024-01-01', periods=5).astype(str)
})
omop_synth_score = omop_conformity(omop_synth, table='drug_exposure')

# FHIR: minimal AdverseEvent bundle (synthetic)
fhir_bundle = [
    {'resourceType':'AdverseEvent','id':'ae1','subject':{'reference':'Patient/1'},'dateRecorded':'2024-10-01'},
    {'resourceType':'AdverseEvent','id':'ae2','subject':{'reference':'Patient/2'},'dateRecorded':'2024-11-15'},
]
fhir_score = fhir_conformity(fhir_bundle, resource_type='AdverseEvent')

# Vocabulary proxy: ICD-10-like format check
pattern_icd10 = r'^[A-TV-Z][0-9][A-Z0-9](?:\.[A-Z0-9]{1,4})?$'
df_codes = pd.DataFrame({'diagnosis_code': ['C34.1','I10','E11.9','1234','XYZ']})
vocab_score = vocabulary_format_score(df_codes, 'diagnosis_code', pattern_icd10)

# Aggregate to "standards"
standards = standards_score({'omop': omop_score, 'fhir': fhir_score, 'vocab': vocab_score})
omop_score, omop_synth_score, fhir_score, vocab_score, standards

## 4) ClinicalTrials.gov portfolio context (reuses saved summary if present)

This section computes basic **portfolio proxies**:
- Median **duration** (months)
- Median **enrollment**

If `data/ctgov_summary.json` exists (from your other notebook), we use it; otherwise we fetch from the API directly.

In [None]:
ctgov_cache = DATA_DIR / "ctgov_summary.json"

def fetch_ctgov(expr="oncology", max_rnk=300):
    BASE = "https://clinicaltrials.gov/api/query/study_fields"
    params = {
        "expr": expr,
        "fields": "NCTId,OverallStatus,StartDate,CompletionDate,EnrollmentCount",
        "min_rnk": 1,
        "max_rnk": max_rnk,
        "fmt": "json"
    }
    r = requests.get(BASE, params=params, timeout=30)
    r.raise_for_status()
    data = r.json()["StudyFieldsResponse"]["StudyFields"]
    return data

def to_dt(s):
    if s is None: return None
    for fmt in ("%B %Y", "%Y-%m", "%Y-%m-%d", "%Y"):
        try: return datetime.strptime(s, fmt)
        except: pass
    return None

if ctgov_cache.exists():
    ct = json.loads(ctgov_cache.read_text())
else:
    try:
        ct = fetch_ctgov("oncology", 300)
        ctgov_cache.write_text(json.dumps(ct))
    except Exception as e:
        print("CT.gov fetch failed:", e)
        ct = []

rows = []
for rec in ct:
    def first(x): return x[0] if isinstance(x, list) and x else None
    start = to_dt(first(rec.get("StartDate")))
    comp = to_dt(first(rec.get("CompletionDate")))
    dur_m = (comp - start).days/30.4 if (start and comp) else None
    enroll = first(rec.get("EnrollmentCount"))
    enroll = int(enroll) if enroll and str(enroll).isdigit() else None
    rows.append({"duration_months": dur_m, "enrollment": enroll})

df_ct = pd.DataFrame(rows)
med_duration = float(df_ct['duration_months'].dropna().median()) if df_ct['duration_months'].notna().any() else None
med_enroll = int(df_ct['enrollment'].dropna().median()) if df_ct['enrollment'].notna().any() else None
med_duration, med_enroll

### Context score (feasibility proxy)

We convert portfolio context into a **0–1 score**:
- **Shorter median duration** → higher score  
- **Moderate enrollment** (not extreme) → higher score  

These are *illustrative* heuristics to show how portfolio signals can feed governance.

In [None]:
def clamp(x, lo=0.0, hi=1.0):
    return max(lo, min(hi, x))

def context_from_ctgov(med_duration, med_enroll):
    # Favor durations closer to ~24 months, with declining scores after ~48m
    if med_duration is None:
        s_dur = 0.5
    else:
        s_dur = clamp(1.0 - max(0.0, (med_duration - 24.0)) / 48.0)

    # Favor enrollments around ~300; penalize extremes using a triangular shape
    if med_enroll is None:
        s_enr = 0.5
    else:
        center, width = 300.0, 500.0
        s_enr = clamp(1.0 - abs(med_enroll - center) / width)

    return 0.6 * s_dur + 0.4 * s_enr

ctgov_context = context_from_ctgov(med_duration, med_enroll)
ctgov_context

## 5) CDC public-health timeliness (reuses saved summary if present)

This section computes a **timeliness proxy** from CDC data:
- Share of records in the last **14 days**

If `data/cdc_summary.json` exists, we reuse it; otherwise we fetch from the CDC API directly.

In [None]:
cdc_cache = DATA_DIR / "cdc_summary.json"

def fetch_cdc(limit=5000):
    URL = "https://data.cdc.gov/resource/9mfq-cb36.json"
    params = {"$limit": limit, "$select": "submission_date,state,tot_cases,conf_cases,prob_cases,new_case"}
    r = requests.get(URL, params=params, timeout=30)
    r.raise_for_status()
    return r.json()

if cdc_cache.exists():
    raw = json.loads(cdc_cache.read_text())
else:
    try:
        raw = fetch_cdc(5000)
        cdc_cache.write_text(json.dumps(raw))
    except Exception as e:
        print("CDC fetch failed:", e)
        raw = []

df_cdc = pd.DataFrame(raw)
if not df_cdc.empty and "submission_date" in df_cdc.columns:
    df_cdc['submission_date'] = pd.to_datetime(df_cdc['submission_date'], errors='coerce')
    cdc_timeliness = (df_cdc['submission_date'] >= (pd.Timestamp.utcnow() - pd.Timedelta(days=14))).mean()
else:
    cdc_timeliness = 0.5

cdc_timeliness

## 6) Combine to **Standards** and **Context** components

- **Standards**: OMOP + FHIR + vocab (already aggregated)
- **Context**: Blend **CT.gov** (feasibility proxy) and **CDC** (external timeliness) → a single context score

In [None]:
def context_score(ctgov_context, cdc_timeliness, w_ctgov=0.6, w_cdc=0.4):
    return float(w_ctgov*float(ctgov_context or 0.0) + w_cdc*float(cdc_timeliness or 0.0))

context = context_score(ctgov_context, cdc_timeliness)
standards, context

## 7) Final metrics + Expanded Governance Scorecard

Weights (illustrative, tune per org):
- Completeness 0.25
- Consistency 0.15
- Timeliness 0.15
- Conformity 0.10
- **Standards 0.20**
- **Context 0.15**

In [None]:
metrics_all = {
    **basic,
    "standards": standards,
    "context": context
}

# Custom plot to include the new "Context" bar
labels = ["Completeness","Consistency","Timeliness","Conformity","Standards","Context"]
vals = [metrics_all.get(k,0) for k in ["completeness","consistency","timeliness","conformity","standards","context"]]

fig, ax = plt.subplots(figsize=(8,4))
ax.bar(labels, vals)
ax.set_ylim(0,1)
ax.set_title("Expanded RWD Governance Scorecard")
for i, v in enumerate(vals):
    ax.text(i, min(0.98, v + 0.02), f"{v:.2f}", ha="center", va="bottom")
plt.tight_layout()

# Reuse score() but with custom weights that include "context"
def score_with_context(m):
    weights = {"completeness":0.25,"consistency":0.15,"timeliness":0.15,"conformity":0.10,"standards":0.20,"context":0.15}
    return sum(weights[k]*float(m.get(k,0.0)) for k in weights)

overall = score_with_context(metrics_all)
metrics_all, overall

**Interpretation**
- **Standards** captures interoperability readiness (OMOP/FHIR/vocab).  
- **Context** injects real-world feasibility and external timeliness into governance, grounding go/no-go or investment decisions.  
- You can tune weights by therapy area, geography, and evidence strategy.

**Next**
- Replace regex vocab proxies with real code-set coverage (RxNorm/SNOMED) from cached dictionaries.  
- Swap the synthetic FHIR bundle for actual resources; add schema validation.  
- Persist summaries from the CT.gov and CDC notebooks into `data/` so this runs offline.