
# 🚀 Quick Start

Follow these steps to run the capstone agent and produce **`submission.csv`**.

## 1) Environment
- Python 3.10+
- Recommended packages (install as needed):
  ```bash
  pip install nbformat langgraph
  ```
  > *LangGraph is optional; the notebook runs without it.*

## 2) Data Layout
Create a `data/` folder next to this notebook with the following files:
- `validation_records.json`
- `test_records.json`
- `insurance_policies.json`
- `reference_codes.json`
- `validation_reference_results.csv` *(used only for validation comparison)*

Your structure should look like:
```
./code.ipynb
./data/validation_records.json
./data/test_records.json
./data/insurance_policies.json
./data/reference_codes.json
./data/validation_reference_results.csv
```

## 3) Run Order
1. **Setup & Utilities** — verifies paths.
2. **System Instruction Prompt** — shows the enforced format.
3. **Data Classes & Helpers** — structures.
4. **Tool Implementations** — three tools required by the spec.
5. **Agent Runner (Deterministic Baseline)** — minimal working pipeline.
6. *(Optional)* **Single-Agent ReAct via LangGraph** — compiles if installed.
7. **Load Datasets** — checks for files in `data/`.
8. **Run on Validation Records** — generates decisions for validation set.
9. **Compare with Human Reference** — prints accuracy/mismatches.
10. **Final Run – Generate `submission.csv`** — runs on test set and writes `submission.csv`.

## 4) Expected Outputs
- A preview of validation results in the notebook.
- Accuracy vs. `validation_reference_results.csv`.
- **`submission.csv`** in the same directory as this notebook, with columns:
  - `patient_id`
  - `generated_response` (two lines: `Decision: ...` and `Reason: ...`)

## 5) Notes
- If any required data is missing or an attribute is absent (e.g., age), the agent routes **FOR REVIEW** with a clear reason.
- The LangGraph section is optional; the deterministic baseline already satisfies the workflow and output format.



# Capstone 2 – Healthcare Claim Coverage Agent (`code.ipynb`)

This notebook implements a **single-agent, ReAct-style** claim coverage system for the healthcare insurance capstone.  
It follows the checklist requirements and produces `submission.csv` from `test_records.json`.

**Inputs expected in `./data/`:**
- `validation_records.json` — used to develop/evaluate the system
- `test_records.json` — used to generate the final submission
- `insurance_policies.json` — policy rules
- `reference_codes.json` — code normalization / reference data
- `validation_reference_results.csv` — human reference for validation comparison (decision strings)



## Checklist Mapping (from requirements)

- Notebook format with headings & comments ✅  
- Dataset loading for all inputs ✅  
- Three tools implemented:  
  - `summarize_patient_record(record_str)` ✅  
  - `summarize_policy_guideline(policy_id)` ✅  
  - `check_claim_coverage(record_summary, policy_summary)` ✅  
- System instruction prompt & enforced output format (Decision + Reason) ✅  
- Single ReAct Agent via **LangGraph** (guarded import; runs if available) ✅  
- Validation: compare our outputs vs. `validation_reference_results.csv` ✅  
- Final output: `submission.csv` with columns `patient_id`, `generated_response` ✅


## Setup & Utilities

In [None]:

import json, csv, sys
from dataclasses import dataclass
from typing import Dict, Any, List
from pathlib import Path

ROOT = Path('.').resolve()
DATA = ROOT / 'data'

def load_json(path: Path):
    with open(path, 'r', encoding='utf-8') as f:
        return json.load(f)

def exists(path: Path) -> bool:
    try:
        path = Path(path)
        return path.exists()
    except Exception:
        return False

print('Working directory:', ROOT)
print('Data directory:', DATA)


## System Instruction Prompt

In [None]:

SYSTEM_PROMPT = """You are an insurance claim coverage agent.
You must use the following workflow and tools strictly in order:
1) summarize_patient_record(record_str) -> Patient Summary
2) summarize_policy_guideline(policy_id) -> Policy Summary
3) check_claim_coverage(record_summary, policy_summary) -> Decision & Reason

Return final output exactly as:
Decision: APPROVE or ROUTE FOR REVIEW
Reason: <one concise sentence>
"""
print(SYSTEM_PROMPT)


## Data Classes & Helpers

In [None]:

from dataclasses import dataclass

@dataclass
class PatientSummary:
    patient_id: str
    age: int | None
    gender: str | None
    diagnoses: List[str]
    procedures: List[str]
    claim_procedure_code: str | None
    preauth_id: str | None

@dataclass
class PolicySummary:
    policy_id: str
    covered_procedure_codes: List[str]
    diagnosis_requirements: List[str]
    min_age: int | None
    max_age: int | None
    gender_restriction: str | None
    preauth_required: bool

def normalize_code(code: str | None) -> str | None:
    if not code:
        return None
    return code.strip().upper().replace('.', '')


## Tool Implementations

In [None]:

from typing import Optional

def summarize_patient_record(record_str: Dict[str, Any], ref_codes: Dict[str, Any]) -> PatientSummary:
    """Summarize one patient record & claim into a normalized structure."""
    pid = str(record_str.get("patient_id"))
    age = record_str.get("age")
    gender = (record_str.get("gender") or "").upper() or None

    diagnoses = [normalize_code(d) for d in record_str.get("diagnoses", [])]
    procedures = [normalize_code(p) for p in record_str.get("procedures", [])]
    claim_proc = normalize_code(record_str.get("claim", {}).get("procedure_code"))
    preauth_id = record_str.get("claim", {}).get("preauthorization_id")

    return PatientSummary(
        patient_id=pid,
        age=age,
        gender=gender,
        diagnoses=diagnoses,
        procedures=procedures,
        claim_procedure_code=claim_proc,
        preauth_id=preauth_id,
    )

def summarize_policy_guideline(policy_id: str, policies: Dict[str, Any]) -> PolicySummary:
    """Summarize a policy guideline by id."""
    p = policies[str(policy_id)]
    return PolicySummary(
        policy_id=str(policy_id),
        covered_procedure_codes=[normalize_code(c) for c in p.get("covered_procedure_codes", [])],
        diagnosis_requirements=[normalize_code(d) for d in p.get("diagnosis_requirements", [])],
        min_age=p.get("min_age"),
        max_age=p.get("max_age"),
        gender_restriction=(p.get("gender_restriction") or None),
        preauth_required=bool(p.get("preauthorization_required", False)),
    )

def _age_ok(age: Optional[int], min_age: Optional[int], max_age: Optional[int]) -> bool:
    if age is None:
        return True
    if min_age is not None and age < min_age:
        return False
    if max_age is not None and age > max_age:
        return False
    return True

def _gender_ok(gender: Optional[str], restriction: Optional[str]) -> bool:
    if not restriction:
        return True
    if not gender:
        return False
    return gender.upper().startswith(restriction.upper())

def _diagnosis_ok(patient_dx: List[str], required_dx: List[str]) -> bool:
    if not required_dx:
        return True
    pset = set([d or "" for d in patient_dx])
    return any(req in d for d in pset for req in required_dx)

def check_claim_coverage(record_summary: PatientSummary, policy_summary: PolicySummary) -> Dict[str, str]:
    """Return {'Decision': 'APPROVE|ROUTE FOR REVIEW', 'Reason': '...'}"""
    reasons = []

    # 1) Procedure coverage
    if record_summary.claim_procedure_code not in policy_summary.covered_procedure_codes:
        reasons.append(f"Procedure {record_summary.claim_procedure_code} not covered by policy {policy_summary.policy_id}.")
        return {"Decision": "ROUTE FOR REVIEW", "Reason": " ; ".join(reasons)}

    # 2) Diagnosis requirement
    if not _diagnosis_ok(record_summary.diagnoses, policy_summary.diagnosis_requirements):
        reasons.append("Required diagnosis criteria not met.")
        return {"Decision": "ROUTE FOR REVIEW", "Reason": " ; ".join(reasons)}

    # 3) Age / Gender checks
    if not _age_ok(record_summary.age, policy_summary.min_age, policy_summary.max_age):
        reasons.append(f"Age {record_summary.age} outside allowed range.")
        return {"Decision": "ROUTE FOR REVIEW", "Reason": " ; ".join(reasons)}

    if not _gender_ok(record_summary.gender, policy_summary.gender_restriction):
        reasons.append(f"Gender restriction {policy_summary.gender_restriction} not satisfied.")
        return {"Decision": "ROUTE FOR REVIEW", "Reason": " ; ".join(reasons)}

    # 4) Preauthorization
    if policy_summary.preauth_required and not record_summary.preauth_id:
        reasons.append("Preauthorization required but missing.")
        return {"Decision": "ROUTE FOR REVIEW", "Reason": " ; ".join(reasons)}

    return {"Decision": "APPROVE", "Reason": "Meets procedure, diagnosis, demographic, and preauthorization criteria."}


## Agent Runner (Deterministic Baseline)

In [None]:

def run_on_records(records_path: Path, policies_path: Path, refs_path: Path, out_csv: Path | None = None):
    records = load_json(records_path)
    policies = load_json(policies_path)
    refs = load_json(refs_path)

    results = []
    for rec in records:
        ps = summarize_patient_record(rec, refs)
        policy_id = rec.get("claim", {}).get("policy_id")
        pol = summarize_policy_guideline(policy_id, policies)
        decision = check_claim_coverage(ps, pol)
        generated_response = f"Decision: {decision['Decision']}\nReason: {decision['Reason']}"
        results.append({"patient_id": ps.patient_id, "generated_response": generated_response})

    if out_csv:
        with open(out_csv, 'w', newline='', encoding='utf-8') as f:
            writer = csv.DictWriter(f, fieldnames=['patient_id', 'generated_response'])
            writer.writeheader()
            writer.writerows(results)
    return results

print('Agent runner ready.')


## (Optional) Single-Agent ReAct via LangGraph

In [None]:

# This cell will only run if langgraph is installed in your environment.
try:
    from typing import TypedDict
    from langgraph.graph import StateGraph, END

    class State(TypedDict):
        record: dict
        refs: dict
        policies: dict
        patient_summary: dict | None
        policy_summary: dict | None
        decision: dict | None

    def node_patient(state: State) -> State:
        state['patient_summary'] = summarize_patient_record(state['record'], state['refs']).__dict__
        return state

    def node_policy(state: State) -> State:
        pid = state['record']['claim']['policy_id']
        state['policy_summary'] = summarize_policy_guideline(pid, state['policies']).__dict__
        return state

    def node_decide(state: State) -> State:
        state['decision'] = check_claim_coverage(
            PatientSummary(**state['patient_summary']),
            PolicySummary(**state['policy_summary'])
        )
        return state

    def build_graph():
        g = StateGraph(State)
        g.add_node('patient', node_patient)
        g.add_node('policy', node_policy)
        g.add_node('decide', node_decide)
        g.set_entry_point('patient')
        g.add_edge('patient', 'policy')
        g.add_edge('policy', 'decide')
        g.add_edge('decide', END)
        return g.compile()

    graph = build_graph()
    print('LangGraph compiled.')
except Exception as e:
    print('LangGraph not available or failed to compile:', e)


## Load Datasets

In [None]:

paths = {
    'validation_records': DATA / 'validation_records.json',
    'test_records': DATA / 'test_records.json',
    'policies': DATA / 'insurance_policies.json',
    'refs': DATA / 'reference_codes.json',
    'validation_reference': DATA / 'validation_reference_results.csv',
}
for k, p in paths.items():
    print(f"{k}: {p} -> {'FOUND' if exists(p) else 'MISSING'}")


## Run on Validation Records

In [None]:

if all(exists(p) for p in [paths['validation_records'], paths['policies'], paths['refs']]):
    val_results = run_on_records(paths['validation_records'], paths['policies'], paths['refs'], out_csv=None)
    print('Validation sample:', val_results[:3])
else:
    print('Validation run skipped: one or more files missing.')


## Compare with Human Reference (`validation_reference_results.csv`)

In [None]:

def load_reference_csv(path: Path) -> dict:
    ref = {}
    with open(path, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for row in reader:
            ref[row['patient_id']] = row['generated_response'].strip()
    return ref

def extract_decision(resp: str) -> str:
    # Normalize to just 'APPROVE' or 'ROUTE FOR REVIEW'
    for line in resp.splitlines():
        if line.upper().startswith('DECISION:'):
            return line.split(':', 1)[1].strip().upper()
    return ''

if exists(paths['validation_reference']) and 'val_results' in globals():
    ref_map = load_reference_csv(paths['validation_reference'])
    total = 0
    correct = 0
    mismatches = []
    for r in val_results:
        pid = r['patient_id']
        ours_dec = extract_decision(r['generated_response'])
        ref_dec = extract_decision(ref_map.get(pid, ''))
        if ref_dec == '':
            continue
        total += 1
        if ours_dec == ref_dec:
            correct += 1
        else:
            mismatches.append((pid, ours_dec, ref_dec))

    print(f'Compared {total} validation rows. Accuracy: {correct}/{total} = {correct/total:.2%}' if total else 'No comparable rows.')
    if mismatches[:10]:
        print('Sample mismatches (up to 10):')
        for pid, ours, ref in mismatches[:10]:
            print('-', pid, '| ours:', ours, '| ref:', ref)
else:
    print('Reference comparison skipped: validation results or reference file missing.')


## Final Run – Generate `submission.csv` from Test Records

In [None]:

out_path = Path('submission.csv')
if all(exists(p) for p in [paths['test_records'], paths['policies'], paths['refs']]):
    _ = run_on_records(paths['test_records'], paths['policies'], paths['refs'], out_csv=out_path)
    print('Wrote', out_path)
else:
    print('Test run skipped: one or more files missing.')
