# CRC UDS Metric Triage - LayoutLMv3 Model Prototype

This notebook simulates the **NextGen API workflow** where a patient's full 10-year chart is extracted and the model must:

1. **TRIAGE**: Identify which documents are CRC-relevant (colonoscopy, FIT, FOBT, etc.) vs non-CRC (mammograms, paps, gyn notes)
2. **EXTRACT**: Pull document type, dates, and results from CRC documents
3. **EVALUATE**: Apply UDS CRC numerator rules with appropriate lookback periods

**UDS CRC Numerator Rules (Lookback from Reporting Year End):**
- FOBT/FIT: Within reporting year only
- FIT-DNA (Cologuard): Within 3 years
- Sigmoidoscopy/CT Colonography: Within 5 years  
- Colonoscopy: Within 10 years

# CRC UDS Triage with LayoutLMv3 Model (2025)

This notebook uses the **trained LayoutLMv3 model** to triage CRC documents and extract:
- Document type (Colonoscopy, FIT, FOBT, Sigmoidoscopy, CT Colonography)
- Procedure/Collection dates
- Test results (Positive/Negative)
- Clinical findings (polyps, diverticula, etc.)

**UDS CRC Numerator Rules (10-year lookback):**
- FOBT/FIT: Within reporting year
- FIT-DNA: Within 3 years
- Sigmoidoscopy/CT Colonography: Within 5 years
- Colonoscopy: Within 10 years

It outputs per-patient CRC numerator status with model-based confidence scores.

In [1]:
# Imports for LayoutLMv3 model-based triage
import os
import sys
import glob
import datetime as dt
from dataclasses import dataclass, asdict
from typing import Optional, List, Dict, Any, Literal
from pathlib import Path

import pandas as pd
from dateutil.relativedelta import relativedelta

# Add project root to path
sys.path.insert(0, '/opt/UDS_LayoutLM')

# Import our trained model inference
from src.inference import UDSExtractor
from src.processor import PDFProcessor

print("‚úì Imports successful")

  from .autonotebook import tqdm as notebook_tqdm


‚úì Imports successful


In [5]:
# Configuration
MODEL_PATH = "models/crc_triage/final_model"
DEVICE = "cpu"  # Use CPU (GPU sm_61 not compatible with PyTorch)
CONFIDENCE_THRESHOLD = 0.5
REPORTING_YEAR = 2025

# USB drive path - simulating NextGen API extraction of patient's FULL 10-year chart
# This includes ALL document types (mammograms, paps, gyn notes, colonoscopy, FIT, etc.)
USB_PDF_DIR = "/media/jloyamd/UBUNTU 25_1/new_pdf_01_16"

# Get ALL PDFs (simulating full patient chart - NOT filtered by CRC keywords)
PDF_PATHS = sorted(glob.glob(os.path.join(USB_PDF_DIR, "*.pdf")))

print(f"üìÅ Simulating NextGen API: Loaded {len(PDF_PATHS)} documents from patient's 10-year chart")
print("=" * 70)
for p in PDF_PATHS:
    print(f"  üìÑ {os.path.basename(p)}")
print("=" * 70)
print("\n‚ö†Ô∏è  Model must TRIAGE to find only CRC-relevant documents from this mixed chart")

üìÅ Simulating NextGen API: Loaded 28 documents from patient's 10-year chart
  üìÑ 104918_colonoscopy.pdf
  üìÑ 104918_mmg.pdf
  üìÑ 146747_mmg.pdf
  üìÑ 147332_mmg.pdf
  üìÑ 147332_pap.pdf
  üìÑ 171516_pap.pdf
  üìÑ 171561_colonoscopy.pdf
  üìÑ 171561_mmg.pdf
  üìÑ 1928298_mmg.pdf
  üìÑ 1928298_pap.pdf
  üìÑ 1936610_colonoscopy.pdf
  üìÑ 1936610_mmg.pdf
  üìÑ 1944227_colonoscopy.pdf
  üìÑ 1944227_gyn_note.pdf
  üìÑ 1944227_mmg.pdf
  üìÑ 1945734_colonoscopy.pdf
  üìÑ 1945734_mmg.pdf
  üìÑ 1945734_pap.pdf
  üìÑ 1946500_pap.pdf
  üìÑ 1947752_ifobt.pdf
  üìÑ 1947752_mmg.pdf
  üìÑ 1947752_pap.pdf
  üìÑ 1948064_mmg.pdf
  üìÑ 1948064_pap.pdf
  üìÑ 197868_mmg.pdf
  üìÑ 57919_fit.pdf
  üìÑ 57919_mmg.pdf
  üìÑ 59208_mmg.pdf

‚ö†Ô∏è  Model must TRIAGE to find only CRC-relevant documents from this mixed chart


In [3]:
# Initialize the LayoutLMv3 extractor
print("Loading LayoutLMv3 CRC Triage model...")
extractor = UDSExtractor(
    model_path=MODEL_PATH,
    device=DEVICE,
    confidence_threshold=CONFIDENCE_THRESHOLD,
    labels_module="src.labels_crc_triage"
)
print("‚úì Model loaded")

Loading LayoutLMv3 CRC Triage model...
CRC Triage labels loaded: 51 labels
Loading model from models/crc_triage/final_model...
Model loaded. Using device: cpu
Labels: 51 classes from src.labels_crc_triage
‚úì Model loaded


In [6]:
# Extract entities from all PDFs using the trained model
# The model will TRIAGE - identifying CRC-relevant vs non-CRC documents
import re
from collections import defaultdict

DATE_PAT = re.compile(r'(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})')

# CRC document types that the model was trained to detect
CRC_DOC_TYPES = {"DOC_TYPE_COLONOSCOPY", "DOC_TYPE_FIT", "DOC_TYPE_FOBT", 
                 "DOC_TYPE_SIGMOIDOSCOPY", "DOC_TYPE_CT_COLONOGRAPHY"}

def parse_date_from_text(text: str) -> Optional[str]:
    """Extract first valid date from text and return as ISO format."""
    m = DATE_PAT.search(text or "")
    if not m:
        return None
    mm, dd, yy = map(int, m.groups())
    if yy < 100:
        yy = 2000 + yy if yy < 50 else 1900 + yy
    try:
        return dt.date(yy, mm, dd).isoformat()
    except:
        return None

def map_doc_type_to_event(entity_type: str) -> Optional[str]:
    """Map model entity types to UDS event types."""
    mapping = {
        "DOC_TYPE_COLONOSCOPY": "COLONOSCOPY",
        "DOC_TYPE_FIT": "FIT",
        "DOC_TYPE_FOBT": "FOBT",
        "DOC_TYPE_SIGMOIDOSCOPY": "SIGMOIDOSCOPY",
        "DOC_TYPE_CT_COLONOGRAPHY": "CT_COLONOGRAPHY",
    }
    return mapping.get(entity_type)

@dataclass
class ModelExtraction:
    """Extraction result from model."""
    pdf_path: str
    is_crc_relevant: bool  # TRIAGE result
    event_type: Optional[str]
    event_date: Optional[str]
    result: Optional[str]  # POSITIVE, NEGATIVE, or None
    confidence: float
    entities: List[Dict]
    needs_review: bool
    evidence: Dict[str, Any]

def extract_from_pdf_with_model(pdf_path: str) -> ModelExtraction:
    """Use LayoutLMv3 model to TRIAGE and extract CRC metrics from a PDF."""
    try:
        result = extractor.extract_from_pdf(pdf_path)
        entities = result.entities
    except Exception as e:
        print(f"    ‚ö†Ô∏è Error: {e}")
        return ModelExtraction(
            pdf_path=pdf_path,
            is_crc_relevant=False,
            event_type=None,
            event_date=None,
            result=None,
            confidence=0.0,
            entities=[],
            needs_review=True,
            evidence={"error": str(e)}
        )
    
    # Aggregate entities by type
    entity_dict = defaultdict(list)
    for ent in entities:
        entity_dict[ent.entity_type].append(ent)
    
    # TRIAGE: Check if any CRC-relevant document type was detected with high confidence
    doc_type = None
    doc_type_conf = 0.0
    for etype in CRC_DOC_TYPES:
        if etype in entity_dict:
            best = max(entity_dict[etype], key=lambda e: e.confidence)
            if best.confidence > doc_type_conf:
                doc_type = map_doc_type_to_event(etype)
                doc_type_conf = best.confidence
    
    # Determine if CRC-relevant based on model detection
    # Require minimum confidence threshold for CRC classification
    is_crc_relevant = doc_type is not None and doc_type_conf >= 0.6
    
    # Extract date based on document type (only if CRC-relevant)
    event_date = None
    date_conf = 0.0
    if is_crc_relevant:
        if doc_type == "COLONOSCOPY":
            if "PROCEDURE_DATE" in entity_dict:
                for ent in entity_dict["PROCEDURE_DATE"]:
                    parsed = parse_date_from_text(ent.text)
                    if parsed and ent.confidence > date_conf:
                        event_date = parsed
                        date_conf = ent.confidence
        else:
            for date_type in ["COLLECTION_DATE", "PROCEDURE_DATE"]:
                if date_type in entity_dict:
                    for ent in entity_dict[date_type]:
                        parsed = parse_date_from_text(ent.text)
                        if parsed and ent.confidence > date_conf:
                            event_date = parsed
                            date_conf = ent.confidence
    
    # Extract result (POSITIVE/NEGATIVE)
    test_result = None
    result_conf = 0.0
    if "RESULT_POSITIVE" in entity_dict:
        best = max(entity_dict["RESULT_POSITIVE"], key=lambda e: e.confidence)
        if best.confidence > result_conf:
            test_result = "POSITIVE"
            result_conf = best.confidence
    if "RESULT_NEGATIVE" in entity_dict:
        best = max(entity_dict["RESULT_NEGATIVE"], key=lambda e: e.confidence)
        if best.confidence > result_conf:
            test_result = "NEGATIVE"
            result_conf = best.confidence
    
    # Calculate overall confidence
    overall_conf = (doc_type_conf + date_conf) / 2 if event_date else doc_type_conf * 0.7
    
    # Determine if needs review
    needs_review = is_crc_relevant and (event_date is None or overall_conf < 0.7)
    
    # Build evidence
    evidence = {
        "file": os.path.basename(pdf_path),
        "num_pages": result.num_pages if hasattr(result, 'num_pages') else 1,
        "entities_found": len(entities),
        "doc_type_confidence": doc_type_conf,
        "date_confidence": date_conf,
        "crc_entities": [k for k in entity_dict.keys() if k.startswith("DOC_TYPE_") or 
                         k in ["PROCEDURE_DATE", "COLLECTION_DATE", "RESULT_POSITIVE", "RESULT_NEGATIVE"]],
        "entity_summary": {k: [{"text": e.text, "conf": e.confidence} for e in v[:2]] 
                          for k, v in list(entity_dict.items())[:6]}
    }
    
    return ModelExtraction(
        pdf_path=pdf_path,
        is_crc_relevant=is_crc_relevant,
        event_type=doc_type if is_crc_relevant else None,
        event_date=event_date,
        result=test_result if is_crc_relevant else None,
        confidence=overall_conf if is_crc_relevant else doc_type_conf,
        entities=[asdict(e) for e in entities],
        needs_review=needs_review,
        evidence=evidence
    )

# Process ALL PDFs (simulating full patient chart from NextGen API)
print(f"\nüîç TRIAGE: Processing {len(PDF_PATHS)} documents from patient chart...\n")
print("=" * 80)
extractions = []
crc_count = 0
non_crc_count = 0

for pdf_path in PDF_PATHS:
    filename = os.path.basename(pdf_path)
    extraction = extract_from_pdf_with_model(pdf_path)
    extractions.append(extraction)
    
    if extraction.is_crc_relevant:
        crc_count += 1
        print(f"‚úÖ CRC: {filename}")
        print(f"       Type: {extraction.event_type} | Date: {extraction.event_date or 'N/A'} | "
              f"Result: {extraction.result or 'N/A'} | Conf: {extraction.confidence:.0%}")
    else:
        non_crc_count += 1
        print(f"‚¨ú NON-CRC: {filename} (not relevant to CRC metric)")

print("=" * 80)
print(f"\nüìä TRIAGE SUMMARY:")
print(f"   Total documents: {len(PDF_PATHS)}")
print(f"   CRC-relevant: {crc_count} documents")
print(f"   Non-CRC: {non_crc_count} documents (filtered out)")


üîç TRIAGE: Processing 28 documents from patient chart...





‚úÖ CRC: 104918_colonoscopy.pdf
       Type: COLONOSCOPY | Date: 2018-05-23 | Result: NEGATIVE | Conf: 99%




‚¨ú NON-CRC: 104918_mmg.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 146747_mmg.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 147332_mmg.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 147332_pap.pdf (not relevant to CRC metric)




‚úÖ CRC: 171516_pap.pdf
       Type: COLONOSCOPY | Date: N/A | Result: NEGATIVE | Conf: 47%




‚úÖ CRC: 171561_colonoscopy.pdf
       Type: COLONOSCOPY | Date: 2022-05-26 | Result: NEGATIVE | Conf: 99%




‚¨ú NON-CRC: 171561_mmg.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1928298_mmg.pdf (not relevant to CRC metric)




‚úÖ CRC: 1928298_pap.pdf
       Type: COLONOSCOPY | Date: N/A | Result: N/A | Conf: 45%




‚úÖ CRC: 1936610_colonoscopy.pdf
       Type: COLONOSCOPY | Date: 2023-07-20 | Result: NEGATIVE | Conf: 89%




‚¨ú NON-CRC: 1936610_mmg.pdf (not relevant to CRC metric)




‚úÖ CRC: 1944227_colonoscopy.pdf
       Type: COLONOSCOPY | Date: 2024-04-11 | Result: NEGATIVE | Conf: 100%




‚¨ú NON-CRC: 1944227_gyn_note.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1944227_mmg.pdf (not relevant to CRC metric)




‚úÖ CRC: 1945734_colonoscopy.pdf
       Type: COLONOSCOPY | Date: N/A | Result: N/A | Conf: 70%




‚¨ú NON-CRC: 1945734_mmg.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1945734_pap.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1946500_pap.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1947752_ifobt.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1947752_mmg.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1947752_pap.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1948064_mmg.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 1948064_pap.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 197868_mmg.pdf (not relevant to CRC metric)




‚úÖ CRC: 57919_fit.pdf
       Type: FIT | Date: 2024-07-30 | Result: NEGATIVE | Conf: 95%




‚¨ú NON-CRC: 57919_mmg.pdf (not relevant to CRC metric)




‚¨ú NON-CRC: 59208_mmg.pdf (not relevant to CRC metric)

üìä TRIAGE SUMMARY:
   Total documents: 28
   CRC-relevant: 8 documents
   Non-CRC: 20 documents (filtered out)


In [7]:
# Create DataFrame from model extractions - separate CRC and non-CRC
df_all = pd.DataFrame([
    {
        "patient_id": "TEST_PATIENT",  # Treating all PDFs as one patient's chart
        "file": os.path.basename(e.pdf_path),
        "is_crc_relevant": e.is_crc_relevant,
        "event_type": e.event_type,
        "event_date": e.event_date,
        "result": e.result,
        "confidence": e.confidence,
        "needs_review": e.needs_review,
        "evidence": e.evidence,
    }
    for e in extractions
])

# Filter to CRC-relevant only for numerator evaluation
df_crc = df_all[df_all["is_crc_relevant"] == True].copy()
df_non_crc = df_all[df_all["is_crc_relevant"] == False].copy()

print("\nüìã CRC-RELEVANT DOCUMENTS (for numerator evaluation):")
print(df_crc[["file", "event_type", "event_date", "result", "confidence"]].to_string(index=False))

print(f"\nüìã NON-CRC DOCUMENTS FILTERED OUT ({len(df_non_crc)}):")
for _, row in df_non_crc.iterrows():
    print(f"   - {row['file']}")


üìã CRC-RELEVANT DOCUMENTS (for numerator evaluation):
                   file  event_type event_date   result  confidence
 104918_colonoscopy.pdf COLONOSCOPY 2018-05-23 NEGATIVE    0.994118
         171516_pap.pdf COLONOSCOPY       None NEGATIVE    0.466638
 171561_colonoscopy.pdf COLONOSCOPY 2022-05-26 NEGATIVE    0.994408
        1928298_pap.pdf COLONOSCOPY       None     None    0.454010
1936610_colonoscopy.pdf COLONOSCOPY 2023-07-20 NEGATIVE    0.894775
1944227_colonoscopy.pdf COLONOSCOPY 2024-04-11 NEGATIVE    0.995599
1945734_colonoscopy.pdf COLONOSCOPY       None     None    0.698638
          57919_fit.pdf         FIT 2024-07-30 NEGATIVE    0.948372

üìã NON-CRC DOCUMENTS FILTERED OUT (20):
   - 104918_mmg.pdf
   - 146747_mmg.pdf
   - 147332_mmg.pdf
   - 147332_pap.pdf
   - 171561_mmg.pdf
   - 1928298_mmg.pdf
   - 1936610_mmg.pdf
   - 1944227_gyn_note.pdf
   - 1944227_mmg.pdf
   - 1945734_mmg.pdf
   - 1945734_pap.pdf
   - 1946500_pap.pdf
   - 1947752_ifobt.pdf
   - 1947752_

In [8]:
# UDS CRC Numerator Rule Engine (applied ONLY to CRC-relevant documents)
year_start = dt.date(REPORTING_YEAR, 1, 1)
year_end = dt.date(REPORTING_YEAR, 12, 31)

def counts_for_crc_2025(event_type: str, event_date_iso: str) -> bool:
    """Apply UDS CRC numerator rules with appropriate lookback periods."""
    if not event_date_iso or not event_type:
        return False
    try:
        d = dt.date.fromisoformat(event_date_iso)
    except:
        return False
    
    # UDS CRC Lookback Rules:
    if event_type in ["FOBT", "FIT"]:
        # FOBT/FIT: Within reporting year only
        return year_start <= d <= year_end
    
    if event_type == "FIT_DNA":
        # FIT-DNA (Cologuard): Within 3 years
        lookback = year_start - relativedelta(years=2)
        return lookback <= d <= year_end
    
    if event_type in ["SIGMOIDOSCOPY", "CT_COLONOGRAPHY"]:
        # Sigmoidoscopy/CT Colonography: Within 5 years
        lookback = year_start - relativedelta(years=4)
        return lookback <= d <= year_end
    
    if event_type == "COLONOSCOPY":
        # Colonoscopy: Within 10 years
        lookback = year_start - relativedelta(years=9)
        return lookback <= d <= year_end
    
    return False

# Apply rules ONLY to CRC-relevant documents
if len(df_crc) > 0:
    df_crc["counts_crc_2025"] = df_crc.apply(
        lambda r: counts_for_crc_2025(r["event_type"], r["event_date"]), axis=1
    )
else:
    df_crc["counts_crc_2025"] = []

# Show results with rule evaluation
print("\n" + "="*80)
print(f"UDS CRC NUMERATOR EVALUATION (Reporting Year: {REPORTING_YEAR})")
print("="*80)
print(f"\nLookback Periods:")
print(f"  ‚Ä¢ Colonoscopy: {year_start - relativedelta(years=9)} to {year_end} (10 years)")
print(f"  ‚Ä¢ Sigmoidoscopy/CT: {year_start - relativedelta(years=4)} to {year_end} (5 years)")
print(f"  ‚Ä¢ FIT-DNA: {year_start - relativedelta(years=2)} to {year_end} (3 years)")
print(f"  ‚Ä¢ FIT/FOBT: {year_start} to {year_end} (same year only)")
print("="*80)

if len(df_crc) > 0:
    print("\nüìä CRC Evidence Evaluation:")
    print(df_crc[["file", "event_type", "event_date", "confidence", "counts_crc_2025"]].to_string(index=False))
else:
    print("\n‚ö†Ô∏è No CRC-relevant documents found in patient chart")


UDS CRC NUMERATOR EVALUATION (Reporting Year: 2025)

Lookback Periods:
  ‚Ä¢ Colonoscopy: 2016-01-01 to 2025-12-31 (10 years)
  ‚Ä¢ Sigmoidoscopy/CT: 2021-01-01 to 2025-12-31 (5 years)
  ‚Ä¢ FIT-DNA: 2023-01-01 to 2025-12-31 (3 years)
  ‚Ä¢ FIT/FOBT: 2025-01-01 to 2025-12-31 (same year only)

üìä CRC Evidence Evaluation:
                   file  event_type event_date  confidence  counts_crc_2025
 104918_colonoscopy.pdf COLONOSCOPY 2018-05-23    0.994118             True
         171516_pap.pdf COLONOSCOPY       None    0.466638            False
 171561_colonoscopy.pdf COLONOSCOPY 2022-05-26    0.994408             True
        1928298_pap.pdf COLONOSCOPY       None    0.454010            False
1936610_colonoscopy.pdf COLONOSCOPY 2023-07-20    0.894775             True
1944227_colonoscopy.pdf COLONOSCOPY 2024-04-11    0.995599             True
1945734_colonoscopy.pdf COLONOSCOPY       None    0.698638            False
          57919_fit.pdf         FIT 2024-07-30    0.948372         

In [9]:
# Final CRC Numerator Determination for the Patient
print("\n" + "="*80)
print("üéØ FINAL CRC NUMERATOR DETERMINATION")
print("="*80)

if len(df_crc) == 0:
    print(f"\n‚ùå NUMERATOR NOT MET")
    print(f"   Reason: No CRC-relevant documents found in patient's 10-year chart")
    print(f"   Documents reviewed: {len(df_all)}")
else:
    qualifying_events = df_crc[df_crc["counts_crc_2025"] == True]
    
    if len(qualifying_events) > 0:
        # Patient meets numerator - pick best qualifying evidence
        best = qualifying_events.sort_values(
            ["confidence", "event_date"], ascending=[False, False]
        ).iloc[0]
        
        print(f"\n‚úÖ NUMERATOR MET: Patient qualifies for CRC screening metric")
        print(f"\n   Best Qualifying Evidence:")
        print(f"   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ")
        print(f"   üìÑ Document: {best['file']}")
        print(f"   üìã Type: {best['event_type']}")
        print(f"   üìÖ Date: {best['event_date']}")
        print(f"   üî¨ Result: {best['result'] or 'N/A'}")
        print(f"   üéØ Model Confidence: {best['confidence']:.1%}")
        
        if best["needs_review"]:
            print(f"\n   ‚ö†Ô∏è Note: This extraction may need human review (low confidence)")
        
        # Show all qualifying events
        if len(qualifying_events) > 1:
            print(f"\n   All {len(qualifying_events)} Qualifying Events:")
            for _, row in qualifying_events.iterrows():
                print(f"     ‚Ä¢ {row['event_type']} on {row['event_date']} ({row['confidence']:.0%}) - {row['file']}")
    
    else:
        print(f"\n‚ùå NUMERATOR NOT MET: No qualifying CRC screening evidence found")
        
        non_qualifying = df_crc[df_crc["event_date"].notna()]
        if len(non_qualifying) > 0:
            print(f"\n   CRC evidence found but outside lookback window:")
            for _, row in non_qualifying.iterrows():
                reason = ""
                if row["event_type"] == "COLONOSCOPY":
                    cutoff = year_start - relativedelta(years=9)
                    reason = f"(needs to be after {cutoff})"
                elif row["event_type"] in ["FIT", "FOBT"]:
                    reason = f"(needs to be in {REPORTING_YEAR})"
                print(f"     ‚Ä¢ {row['event_type']} on {row['event_date']} {reason}")
        
        no_date = df_crc[df_crc["event_date"].isna()]
        if len(no_date) > 0:
            print(f"\n   CRC documents with no extractable date:")
            for _, row in no_date.iterrows():
                print(f"     ‚Ä¢ {row['file']} - {row['event_type'] or 'Unknown type'}")

print("\n" + "="*80)
print(f"üìä CHART SUMMARY:")
print(f"   Total documents in chart: {len(df_all)}")
print(f"   CRC-relevant (triaged): {len(df_crc)}")
print(f"   Non-CRC (filtered): {len(df_non_crc)}")
print(f"   Qualifying for {REPORTING_YEAR}: {len(qualifying_events) if 'qualifying_events' in dir() else 0}")
print("="*80)


üéØ FINAL CRC NUMERATOR DETERMINATION

‚úÖ NUMERATOR MET: Patient qualifies for CRC screening metric

   Best Qualifying Evidence:
   ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
   üìÑ Document: 1944227_colonoscopy.pdf
   üìã Type: COLONOSCOPY
   üìÖ Date: 2024-04-11
   üî¨ Result: NEGATIVE
   üéØ Model Confidence: 99.6%

   All 4 Qualifying Events:
     ‚Ä¢ COLONOSCOPY on 2018-05-23 (99%) - 104918_colonoscopy.pdf
     ‚Ä¢ COLONOSCOPY on 2022-05-26 (99%) - 171561_colonoscopy.pdf
     ‚Ä¢ COLONOSCOPY on 2023-07-20 (89%) - 1936610_colonoscopy.pdf
     ‚Ä¢ COLONOSCOPY on 2024-04-11 (100%) - 1944227_colonoscopy.pdf

üìä CHART SUMMARY:
   Total documents in chart: 28
   CRC-relevant (triaged): 8
   Non-CRC (filtered): 20
   Qualifying for 2025: 4


In [10]:
# Detailed Triage Breakdown - What did the model see in each document?
print("\n" + "="*80)
print("üî¨ DETAILED TRIAGE ANALYSIS")
print("="*80)

for extraction in extractions:
    filename = extraction.evidence.get('file', 'Unknown')
    status = "‚úÖ CRC" if extraction.is_crc_relevant else "‚¨ú NON-CRC"
    
    print(f"\n{status}: {filename}")
    print(f"   Pages: {extraction.evidence.get('num_pages', '?')} | "
          f"Entities found: {extraction.evidence.get('entities_found', 0)}")
    
    if extraction.is_crc_relevant:
        print(f"   Type: {extraction.event_type} (conf: {extraction.evidence.get('doc_type_confidence', 0):.0%})")
        print(f"   Date: {extraction.event_date or 'Not extracted'} (conf: {extraction.evidence.get('date_confidence', 0):.0%})")
        print(f"   Result: {extraction.result or 'N/A'}")
        qualifies = counts_for_crc_2025(extraction.event_type, extraction.event_date) if extraction.event_date else False
        print(f"   Qualifies for {REPORTING_YEAR}: {'‚úÖ YES' if qualifies else '‚ùå NO'}")
    else:
        crc_entities = extraction.evidence.get("crc_entities", [])
        if crc_entities:
            print(f"   CRC entities found (low conf): {crc_entities}")
        else:
            print(f"   No CRC-related entities detected")
    
    # Show top entities detected
    summary = extraction.evidence.get("entity_summary", {})
    if summary:
        print(f"   Top entities:")
        for etype, items in list(summary.items())[:3]:
            for item in items[:1]:
                print(f"     [{etype}] \"{item['text'][:40]}\" ({item['conf']:.0%})")


üî¨ DETAILED TRIAGE ANALYSIS

‚úÖ CRC: 104918_colonoscopy.pdf
   Pages: 4 | Entities found: 14
   Type: COLONOSCOPY (conf: 100%)
   Date: 2018-05-23 (conf: 99%)
   Result: NEGATIVE
   Qualifies for 2025: ‚úÖ YES
   Top entities:
     [PROCEDURE_DATE] "$/23/2018" (99%)
     [DOC_TYPE_COLONOSCOPY] "Colonoscopy" (100%)
     [INDICATION_SCREENING] "Screcning" (98%)

‚¨ú NON-CRC: 104918_mmg.pdf
   Pages: 2 | Entities found: 2
   CRC entities found (low conf): ['RESULT_NEGATIVE']
   Top entities:
     [INDICATION_SCREENING] "Screening" (100%)
     [RESULT_NEGATIVE] "NEGATIVE." (99%)

‚¨ú NON-CRC: 146747_mmg.pdf
   Pages: 2 | Entities found: 2
   CRC entities found (low conf): ['COLLECTION_DATE']
   Top entities:
     [COLLECTION_DATE] "07/11/2025" (64%)
     [INDICATION_SCREENING] "Screening" (100%)

‚¨ú NON-CRC: 147332_mmg.pdf
   Pages: 2 | Entities found: 5
   CRC entities found (low conf): ['PROCEDURE_DATE']
   Top entities:
     [INDICATION_SCREENING] "Screening" (99%)
     [PROCEDURE_