# 3.03 All Province Alberta Crosswalk

**Consolidated notebook for mapping Alberta billing codes to BC, MB, ON, and SK equivalents.**

## Workflow
1. Upload all files (PDFs, Reference CSVs, Taxonomy)
2. Configure Alberta code to match
3. Run each province individually
4. Combine results

## Province-Specific Features
| Province | Chunking | Special Features |
|----------|----------|------------------|
| BC | Level 1 | Code prefixes (P, G, PG) |
| MB | Level 1 | Specialty-based fees |
| ON | Level 2 | H/P settings, Surg/Asst/Anae fees |
| SK | Level 1 | Referred/Not Referred dual-fees, Age premiums |

---
# STEP 1: Setup
---

## Cell 1: Install Dependencies

In [None]:
!pip install openai pandas pdfplumber openpyxl tqdm PyMuPDF -q

import pandas as pd
import pdfplumber
import fitz  # PyMuPDF
import json
import re
from tqdm.notebook import tqdm
from google.colab import files

print("All dependencies loaded.")
print("Ready to proceed.")

---
# STEP 2: Upload Files
---

## Cell 2a: Upload Province PDFs

Upload all 4 province schedule PDFs. Files will be auto-detected by name.

In [None]:
print("="*70)
print("STEP 2a: Upload Province Schedule PDFs")
print("="*70)
print("\nExpected files:")
print("  - BC Payment Schedule - March 31, 2024.pdf")
print("  - MB Payment Schedule - April 1, 2024.pdf")
print("  - ON - February 20, 2024 (effective April 1, 2024).pdf")
print("  - SK Payment Schedule - April 1, 2024.pdf")
print()

uploaded_pdfs = files.upload()

# Auto-detect province from filename
PDF_FILES = {'BC': None, 'MB': None, 'ON': None, 'SK': None}

for filename in uploaded_pdfs.keys():
    filename_upper = filename.upper()
    if 'BC' in filename_upper:
        PDF_FILES['BC'] = filename
    elif 'MB' in filename_upper:
        PDF_FILES['MB'] = filename
    elif 'ON' in filename_upper:
        PDF_FILES['ON'] = filename
    elif 'SK' in filename_upper:
        PDF_FILES['SK'] = filename

print("\n" + "="*70)
print("Detected PDFs:")
print("="*70)
for prov, f in PDF_FILES.items():
    status = "✓" if f else "✗ MISSING"
    print(f"  {prov}: {status} {f if f else ''}")

# Warn if any missing
missing = [p for p, f in PDF_FILES.items() if f is None]
if missing:
    print(f"\n⚠️  WARNING: Missing PDFs for: {', '.join(missing)}")
    print("    You can still run the provinces that have PDFs.")
else:
    print("\n✓ All 4 province PDFs loaded successfully.")

## Cell 2b: Upload Section Reference CSVs

Upload all 4 section reference CSVs. Files will be auto-detected by name.

In [None]:
print("="*70)
print("STEP 2b: Upload Section Reference CSVs")
print("="*70)
print("\nExpected files:")
print("  - bc_section_reference_simple.csv")
print("  - manitoba_section_reference_final.csv")
print("  - on_section_reference_full.csv")
print("  - sk_section_reference_simple.csv")
print()

uploaded_refs = files.upload()

# Auto-detect province from filename
REF_FILES = {'BC': None, 'MB': None, 'ON': None, 'SK': None}

for filename in uploaded_refs.keys():
    filename_lower = filename.lower()
    # Check more specific patterns first, and use prefix patterns to avoid false matches
    # (e.g., 'on' in 'section' was causing sk files to match ON)
    if filename_lower.startswith('bc') or '_bc_' in filename_lower:
        REF_FILES['BC'] = filename
    elif 'mb' in filename_lower or 'manitoba' in filename_lower:
        REF_FILES['MB'] = filename
    elif filename_lower.startswith('sk') or '_sk_' in filename_lower:
        REF_FILES['SK'] = filename
    elif filename_lower.startswith('on') or '_on_' in filename_lower:
        REF_FILES['ON'] = filename

print("\n" + "="*70)
print("Detected Reference CSVs:")
print("="*70)
for prov, f in REF_FILES.items():
    status = "✓" if f else "✗ MISSING"
    print(f"  {prov}: {status} {f if f else ''}")

# Warn if any missing
missing = [p for p, f in REF_FILES.items() if f is None]
if missing:
    print(f"\n⚠️  WARNING: Missing reference CSVs for: {', '.join(missing)}")
    print("    You can still run the provinces that have reference files.")
else:
    print("\n✓ All 4 province reference CSVs loaded successfully.")

## Cell 2c: Upload Extraction Taxonomy

Upload the extraction taxonomy Excel file for Phase 2 attribute extraction.

In [None]:
print("="*70)
print("STEP 2c: Upload Extraction Taxonomy")
print("="*70)
print("\nExpected file:")
print("  - extraction_taxonomy.xlsx")
print()

uploaded_tax = files.upload()

TAXONOMY_FILE = list(uploaded_tax.keys())[0]
df_taxonomy = pd.read_excel(TAXONOMY_FILE)

print("\n" + "="*70)
print(f"Loaded Taxonomy: {TAXONOMY_FILE}")
print("="*70)
print(f"\n{len(df_taxonomy)} attributes:")
for _, row in df_taxonomy.iterrows():
    print(f"  - {row['attribute']}: {row['data_type']}")

# Build taxonomy reference string for prompts
taxonomy_reference = "\n".join([
    f"- {row['attribute']} ({row['data_type']}): {row['definition']} Taxonomy: {row['taxonomy']}"
    for _, row in df_taxonomy.iterrows()
])

print("\n✓ Taxonomy loaded and ready for Phase 2.")

## Cell 3: API Key

Enter your OpenAI API key.

In [None]:
print("="*70)
print("STEP 2d: API Key")
print("="*70)

OPENAI_API_KEY = ""  # <-- Paste your key here, or leave blank to use getpass

if not OPENAI_API_KEY:
    from getpass import getpass
    OPENAI_API_KEY = getpass("Enter OpenAI API Key: ")

from openai import OpenAI
client = OpenAI(api_key=OPENAI_API_KEY)

print("\n✓ API client initialized.")
print("\n" + "="*70)
print("SETUP COMPLETE - Ready to configure Alberta code and run provinces")
print("="*70)

---
# STEP 3: Alberta Code Configuration
---

**Edit this cell to change the Alberta code being mapped.**

## Cell 4: Alberta Code Config

⚠️ **EDIT THIS CELL** to map a different Alberta code.

In [None]:
# ============================================================================
# ALBERTA CODE CONFIGURATION
# ============================================================================
# Edit this section to map a different Alberta billing code.
# The province configs (Cell 5) should NOT need to change.
# ============================================================================

ALBERTA_CODE_CONFIG = {
    # Basic code info
    'code': '03.03CV',
    'description': 'Telehealth consultation',
    'fee': 25.09,
    
    # Clinical definition - describes the service in detail
    'clinical_definition': """Assessment of a patient's condition via telephone or secure videoconference.

NOTE:
- At minimum: limited assessment requiring history related to presenting problems, appropriate records review, and advice to the patient
- Total physician time spent providing patient care must be MINIMUM 10 MINUTES
- If less than 10 minutes same day, must use HSC 03.01AD instead
- May only be claimed if service was initiated by the patient or their agent
- May only be claimed if service is personally rendered by the physician
- Benefit includes ordering appropriate diagnostic tests and discussion with patient
- Patient record must include detailed summary of all services including start/stop times
- Time spent on administrative tasks cannot be claimed
- May NOT be claimed same day as: 03.01AD, 03.01S, 03.01T, 03.03FV, 03.05JR, 03.08CV, 08.19CV, 08.19CW, or 08.19CX by same physician for same patient
- May NOT be claimed same day as in-person visit or consultation by same physician for same patient

Category: V Visit (Virtual)
Base rate: $25.09""",
    
    # Service type context - helps LLM understand what we're looking for
    'service_context': """This is a BASIC PATIENT-FACING virtual visit by any physician (not specialist-specific, not physician-to-physician).""",
    
    # What to search for (specific to this AB code type)
    'search_criteria': """
WHAT TO LOOK FOR:
- Virtual visits / virtual care
- Telephone consultations / assessments
- Video consultations / assessments
- Telehealth codes
- Any code that can be billed for a patient-facing virtual encounter
""",
    
    # What to exclude (specific to this AB code type)
    'exclusion_criteria': """
DO NOT INCLUDE:
- Physician-to-physician consultations (e-consults between doctors)
- E-assessments / e-consults (specialist-to-PCP) - not patient-facing
- In-person only codes
- Diagnostic procedures (ECG, imaging, labs)
- Codes you cannot find literally in the text
""",
}

# Display config
print("="*70)
print("ALBERTA CODE CONFIGURATION")
print("="*70)
print(f"\nCode: {ALBERTA_CODE_CONFIG['code']}")
print(f"Description: {ALBERTA_CODE_CONFIG['description']}")
print(f"Fee: ${ALBERTA_CODE_CONFIG['fee']}")
print("\n✓ Alberta code configured.")

---
# STEP 4: Province Configurations
---

**DO NOT EDIT** unless province schedule structure changes.

## Cell 5: Province Configs

Static configurations for each province's schedule structure.

In [None]:
# ============================================================================
# PROVINCE CONFIGURATIONS
# ============================================================================
# Static configurations for each province's schedule structure.
# These should NOT change when mapping different Alberta codes.
# ============================================================================

PROVINCE_CONFIGS = {
    'BC': {
        'name': 'British Columbia',
        'chunking_level': 1,  # Level 1 sections
        'rules_pages': (1, 52),  # General Preamble
        'skip_sections': [
            "1. GENERAL PREAMBLE TO THE PAYMENT SCHEDULE",
            "2. OUT-OF-OFFICE HOURS PREMIUMS",
        ],
        'special_fields': [],  # No special fields
        'extraction_rules': """
BC-SPECIFIC EXTRACTION RULES:

1. **CODE PREFIXES** (indicate payment type, NOT setting):
   - P = Professional fee
   - G = Group fee
   - PG = Professional + Group

2. **FEE EXTRACTION**: Copy the exact fee value as shown

ACCURACY RULES - YOU MUST FOLLOW:

1. **ONLY REAL CODES**: Return ONLY codes that LITERALLY appear in the text above.
   - Copy the EXACT code as shown (e.g., 00100, 14051, 97017)
   - If you cannot find the exact code string in the text, DO NOT include it
   - NEVER invent, fabricate, or guess codes

2. **EXACT VALUES**: Copy fee EXACTLY as shown in the document
   - Use exact decimal values (e.g., "25.43" not "25.00")
   - If fee is percentage-based premium, use "-" and explain in condition

3. **FULL DESCRIPTIONS - CLIENT READY FORMAT**:
   - Copy the COMPLETE service description as written in the schedule
   - Do NOT abbreviate (write "Telephone/video consultation" not "Tel consult")
   - Do NOT truncate (include the full description text)
   - Use sentence case for consistency (capitalize first word and proper nouns)
   - Include qualifying details (e.g., "minimum 10 minutes")
   - Format: Clear, professional, ready for client delivery

4. **MODALITY**: Only include modalities explicitly stated
   - "telephone" = text says telephone/phone only
   - "video" = text says video/videoconference only
   - "both" = text explicitly allows BOTH, or doesn't restrict

5. **PAGE NUMBERS**: page_found must match the "=== PAGE X ===" marker where code appears

6. **SECTION HEADING**: Extract the subsection heading the code appears under
   - Look for bold/uppercase headings
   - This becomes level_2_subsection
""",
        'json_schema': {
            'primary_codes': ['code', 'description', 'fee', 'modality', 'page_found', 'section_heading', 'reasoning'],
            'add_on_codes': ['code', 'description', 'fee', 'modality', 'page_found', 'section_heading', 'links_to', 'condition']
        },
        'output_columns': [
            'AB_Code', 'AB_Description', 'AB_Fee', 'Target_Province',
            'Code', 'Description', 'Fee', 'Type', 'Modality', 'Specialty',
            'Links_To', 'Condition', 'Reasoning',
            'Level_1_Section', 'Level_2_Subsection', 'Page_Found'
        ]
    },

    'MB': {
        'name': 'Manitoba',
        'chunking_level': 1,  # Level 1 sections
        'rules_pages': (1, 82),  # Rules of Application
        'skip_sections': [
            "APPENDICES",
        ],
        'min_clinical_page': 83,  # Clinical sections start at page 83
        'special_fields': [],  # No special fields
        'extraction_rules': """
MB-SPECIFIC EXTRACTION RULES:

1. **SPECIALTY-BASED FEES**: Each specialty section may have its own fee schedules

ACCURACY RULES - YOU MUST FOLLOW:

1. **ONLY REAL CODES**: Return ONLY codes that LITERALLY appear in the text above.
   - Copy the EXACT code as shown (e.g., 8321, 8340, 8447)
   - If you cannot find the exact code string in the text, DO NOT include it
   - NEVER invent, fabricate, or guess codes

2. **EXACT VALUES**: Copy fee EXACTLY as shown in the document
   - Use exact decimal values (e.g., "59.05" not "59.00")
   - If fee is percentage-based premium, use "-" and explain in condition

3. **FULL DESCRIPTIONS - CLIENT READY FORMAT**:
   - Copy the COMPLETE service description as written in the schedule
   - Do NOT abbreviate (write "Virtual visit by telephone or video" not "Virtual visit")
   - Do NOT truncate (include the full description text)
   - Use sentence case for consistency (capitalize first word and proper nouns)
   - Include qualifying details (e.g., "minimum 10 minutes")
   - Format: Clear, professional, ready for client delivery

4. **MODALITY**: Only include modalities explicitly stated
   - "telephone" = text says telephone/phone only
   - "video" = text says video/videoconference only
   - "both" = text explicitly allows BOTH, or doesn't restrict

5. **PAGE NUMBERS**: page_found must match the "=== PAGE X ===" marker where code appears

6. **SECTION HEADING**: Extract the subsection heading the code appears under
   - Look for bold/uppercase headings like "VIRTUAL VISITS", "HOSPITAL CARE", etc.
   - This becomes level_2_subsection
""",
        'json_schema': {
            'primary_codes': ['code', 'description', 'fee', 'modality', 'page_found', 'section_heading', 'reasoning'],
            'add_on_codes': ['code', 'description', 'fee', 'modality', 'page_found', 'section_heading', 'links_to', 'condition']
        },
        'output_columns': [
            'AB_Code', 'AB_Description', 'AB_Fee', 'Target_Province',
            'Code', 'Description', 'Fee', 'Type', 'Modality', 'Specialty',
            'Links_To', 'Condition', 'Reasoning',
            'Level_1_Section', 'Level_2_Subsection', 'Page_Found'
        ]
    },

    'ON': {
        'name': 'Ontario',
        'chunking_level': 2,  # Level 2 sections (more granular)
        'rules_pages': (1, 126),  # General Preamble
        'skip_sections': [
            "General Preamble",
            "Appendix A",
            "Appendix B",
            "Appendix C",
            "Appendix D",
            "Appendix F",
            "Appendix G",
            "Appendix H",
            "Appendix J",
            "Appendix Q",
            "Numeric Index",
        ],
        'special_fields': ['Fee_Type', 'Setting', 'Level_3_Heading'],
        'extraction_rules': """
ONTARIO-SPECIFIC EXTRACTION RULES:

1. **H/P COLUMNS (Setting)**:
   - If a code has BOTH H (Hospital) and P (Professional/Office) fees, create SEPARATE entries for each
   - H = Hospital setting, P = Professional/Office setting
   - If only one fee exists, use that setting

2. **SURGICAL FEE COLUMNS**:
   - Surg = Surgeon fee -> create entry with fee_type "Surgeon"
   - Asst = Assistant fee -> create entry with fee_type "Assistant" (skip if "nil")
   - Anae = Anaesthesia units -> create entry with fee_type "Anaesthesia" (these are TIME UNITS, not dollars)

3. **CODE PREFIXES** (indicate service type, NOT setting):
   - A = Assessments/consultations
   - E = Diagnostic/therapeutic procedures
   - G = General listings
   - K = Special visit premiums
   - Z = Surgical procedures

ACCURACY RULES:

1. **ONLY REAL CODES**: Return ONLY codes that LITERALLY appear in the text above.
   - Copy the EXACT code as shown (e.g., A003, K017, Z101)
   - NEVER invent, fabricate, or guess codes

2. **EXACT VALUES**: Copy fee EXACTLY as shown in the document
   - Use exact decimal values (e.g., "87.35" not "87.00")
   - For Anae column, these are UNITS not dollars

3. **FULL DESCRIPTIONS - CLIENT READY FORMAT**:
   - Copy the COMPLETE service description as written in the schedule
   - Do NOT abbreviate or truncate
   - Use sentence case for consistency
   - Include qualifying details (e.g., "minimum 50 minutes")

4. **LEVEL 3 EXTRACTION**:
   - Extract the subsection heading the code appears under (e.g., "INCISION", "EXCISION", "GENERAL LISTINGS")
   - This becomes level_3_heading

5. **MODALITY**: Only include modalities explicitly stated
   - "telephone" = text says telephone/phone
   - "video" = text says video/videoconference
   - "both" = text explicitly allows BOTH, or doesn't restrict

IMPORTANT: For codes with multiple fee types (Surg/Asst/Anae) or settings (H/P), create SEPARATE entries for each combination.
""",
        'json_schema': {
            'codes': ['code', 'description', 'fee', 'fee_type', 'setting', 'modality', 'page_found',
                     'level_3_heading', 'is_addon', 'links_to', 'condition', 'reasoning']
        },
        'output_columns': [
            'AB_Code', 'AB_Description', 'AB_Fee', 'Target_Province',
            'Code', 'Description', 'Fee', 'Fee_Type', 'Setting', 'Type', 'Modality',
            'Links_To', 'Condition', 'Reasoning',
            'Level_1_Section', 'Level_2_Subsection', 'Level_3_Heading', 'Page_Found'
        ]
    },

    'SK': {
        'name': 'Saskatchewan',
        'chunking_level': 1,  # Level 1 sections
        'rules_pages': (1, 70),  # Preamble/Rules
        'skip_sections': [
            "Introduction",
            "To Request a Change to the Payment Schedule",
            "Services Provided Outside Saskatchewan",
            "Billing For Services Provided To Out-Of-Province Beneficiaries",
            "Definitions",
            "Documentation Requirements",
            "Services Billable by Entitlement or by Approval",
            "Assessment Rules",
            "General Information",
            "Services Not Insured by the Ministry of Health",
            "Assessment of Accounts",
            "Verification Program",
            "Information Sources",
            "Reciprocal Billing",
            "Explanatory Codes for Physicians",
        ],
        'min_clinical_page': 71,  # Clinical sections start at page 71
        'special_fields': ['Fee_Type', 'Age_Premium_Applies'],
        'extraction_rules': """
SASKATCHEWAN-SPECIFIC EXTRACTION RULES:

1. **DUAL-FEE STRUCTURE (Referred vs Not Referred)**:
   - Many SK codes have TWO fees: "Referred" and "Not Referred"
   - If a code has BOTH fees, create SEPARATE entries for each:
     - One entry with fee_type="Referred" and the referred fee
     - One entry with fee_type="Not Referred" and the not-referred fee
   - If only one fee exists, use fee_type="Standard"

2. **AGE PREMIUMS (Section-Wide)**:
   - SK has age-based premiums for patients 0-5 years and 65+ years
   - If the section header or preamble states age premiums apply to ALL codes in the section, note this for EVERY code
   - Do NOT skip age premiums just because they're not repeated per-code

ACCURACY RULES - YOU MUST FOLLOW:

1. **ONLY REAL CODES**: Return ONLY codes that LITERALLY appear in the text above.
   - Copy the EXACT code as shown
   - If you cannot find the exact code string in the text, DO NOT include it
   - NEVER invent, fabricate, or guess codes

2. **EXACT VALUES**: Copy fee EXACTLY as shown in the document
   - Use exact decimal values (e.g., "45.50" not "45.00")
   - If fee is percentage-based premium, use "-" and explain in condition

3. **FULL DESCRIPTIONS - CLIENT READY FORMAT**:
   - Copy the COMPLETE service description as written in the schedule
   - Do NOT abbreviate (write "Telephone/video consultation" not "Tel consult")
   - Do NOT truncate (include the full description text)
   - Use sentence case for consistency (capitalize first word and proper nouns)
   - Include qualifying details (e.g., "minimum 10 minutes")
   - Format: Clear, professional, ready for client delivery

4. **MODALITY**: Only include modalities explicitly stated
   - "telephone" = text says telephone/phone only
   - "video" = text says video/videoconference only
   - "both" = text explicitly allows BOTH, or doesn't restrict

5. **PAGE NUMBERS**: page_found must match the "=== PAGE X ===" marker where code appears

6. **SECTION HEADING**: Extract the subsection heading the code appears under
   - Look for bold/uppercase headings
   - This becomes level_2_subsection

IMPORTANT: For codes with BOTH Referred and Not Referred fees, create SEPARATE entries for each fee type.
""",
        'json_schema': {
            'primary_codes': ['code', 'description', 'fee', 'fee_type', 'modality', 'page_found',
                             'section_heading', 'age_premium_applies', 'reasoning'],
            'add_on_codes': ['code', 'description', 'fee', 'fee_type', 'modality', 'page_found',
                            'section_heading', 'age_premium_applies', 'links_to', 'condition']
        },
        'output_columns': [
            'AB_Code', 'AB_Description', 'AB_Fee', 'Target_Province',
            'Code', 'Description', 'Fee', 'Fee_Type', 'Type', 'Modality', 'Specialty',
            'Links_To', 'Condition', 'Reasoning',
            'Level_1_Section', 'Level_2_Subsection', 'Page_Found', 'Age_Premium_Applies'
        ]
    }
}

# Display loaded configs
print("="*70)
print("PROVINCE CONFIGURATIONS LOADED")
print("="*70)
for prov, config in PROVINCE_CONFIGS.items():
    print(f"\n{prov} ({config['name']}):")
    print(f"  - Chunking: Level {config['chunking_level']}")
    print(f"  - Rules pages: {config['rules_pages'][0]}-{config['rules_pages'][1]}")
    print(f"  - Skip sections: {len(config['skip_sections'])}")
    print(f"  - Special fields: {config['special_fields'] or 'None'}")

print("\n" + "="*70)
print("All province configurations loaded.")

---
# STEP 5: Shared Functions
---

## Cell 6: Shared Functions

Core processing functions used by all provinces.

In [None]:
# ============================================================================
# SHARED FUNCTIONS
# ============================================================================
# Core processing functions used by all provinces.
# These work with ALBERTA_CODE_CONFIG and PROVINCE_CONFIGS defined above.
# ============================================================================

# --- Cost Tracking ---
total_cost = 0.0
total_calls = 0

def reset_cost_tracking():
    """Reset cost tracking for a new province run."""
    global total_cost, total_calls
    total_cost = 0.0
    total_calls = 0

def track_cost(inp_tokens, out_tokens):
    """Track API costs (GPT-4 pricing estimate)."""
    global total_cost, total_calls
    total_cost += (inp_tokens/1e6)*3.0 + (out_tokens/1e6)*15.0
    total_calls += 1

# --- Dynamic Token Limits ---
def get_dynamic_max_tokens(char_count):
    """Set max_completion_tokens based on section size for Phase 1."""
    if char_count > 150000:
        return 20000
    elif char_count > 80000:
        return 14000
    elif char_count > 40000:
        return 10000
    elif char_count > 15000:
        return 6000
    else:
        return 4000

def get_phase2_max_tokens(rules_char_count):
    """Set max_completion_tokens for Phase 2 based on rules size.
    
    Larger rules = more context for LLM to process = need more output buffer.
    """
    if rules_char_count > 300000:
        return 4000
    elif rules_char_count > 200000:
        return 3000
    else:
        return 2500

# --- PDF Loading ---
def load_pdf_pages(pdf_path):
    """Load all pages from PDF into a dictionary."""
    pdf_pages = {}
    with pdfplumber.open(pdf_path) as pdf:
        total_pages = len(pdf.pages)
        for i, page in enumerate(tqdm(pdf.pages, desc="Loading pages")):
            page_num = i + 1
            try:
                text = page.extract_text()
                if text:
                    pdf_pages[page_num] = text
            except:
                pass
    return pdf_pages, total_pages

# --- Section Chunking ---
def build_section_chunks_level1(df_ref, pdf_pages, total_pages, prov_config):
    """Build section chunks for Level 1 chunking (BC, MB, SK)."""
    skip_sections = prov_config['skip_sections']
    min_page = prov_config.get('min_clinical_page', 1)
    
    # Get unique Level 1 sections with their minimum page_start
    level_1_sections = df_ref.groupby('level_1')['page_start'].min().sort_values()
    level_1_list = list(level_1_sections.items())
    
    section_chunks = {}
    
    for idx, (section_name, start_page) in enumerate(level_1_list):
        # Skip configured sections
        if section_name in skip_sections:
            continue
        
        # Skip pages before clinical content
        if start_page < min_page:
            continue
        
        # End page is start of next section - 1, or last page
        if idx + 1 < len(level_1_list):
            end_page = level_1_list[idx + 1][1] - 1
        else:
            end_page = total_pages
        
        # Extract text
        section_text = ""
        pages_in_section = []
        for pg in range(start_page, end_page + 1):
            if pg in pdf_pages:
                section_text += f"\n=== PAGE {pg} ===\n{pdf_pages[pg]}"
                pages_in_section.append(pg)
        
        section_chunks[section_name] = {
            'text': section_text,
            'level_1': section_name,
            'level_2': section_name,
            'start_page': start_page,
            'end_page': end_page,
            'page_count': len(pages_in_section),
            'char_count': len(section_text)
        }
    
    return section_chunks

def build_section_chunks_level2(df_ref, pdf_pages, total_pages, prov_config):
    """Build section chunks for Level 2 chunking (ON)."""
    skip_sections = prov_config['skip_sections']
    
    # Fill empty level_2 with level_1
    df_ref = df_ref.copy()
    df_ref['level_2'] = df_ref['level_2'].fillna('')
    
    section_chunks = {}
    
    for idx, row in df_ref.iterrows():
        level_1 = row['level_1']
        level_2 = row['level_2'] if row['level_2'] else level_1
        start_page = int(row['page_start'])
        
        # Skip configured sections
        if level_1 in skip_sections:
            continue
        
        # Section key
        section_key = f"{level_1} | {level_2}" if level_2 != level_1 else level_1
        
        # End page
        if idx + 1 < len(df_ref):
            end_page = int(df_ref.iloc[idx + 1]['page_start']) - 1
        else:
            end_page = total_pages
        
        # Extract text
        section_text = ""
        pages_in_section = []
        for pg in range(start_page, end_page + 1):
            if pg in pdf_pages:
                section_text += f"\n=== PAGE {pg} ===\n{pdf_pages[pg]}"
                pages_in_section.append(pg)
        
        section_chunks[section_key] = {
            'text': section_text,
            'level_1': level_1,
            'level_2': level_2,
            'start_page': start_page,
            'end_page': end_page,
            'page_count': len(pages_in_section),
            'char_count': len(section_text)
        }
    
    return section_chunks

# --- Rules Extraction ---
def extract_rules_text(pdf_path, rules_pages):
    """Extract rules/preamble text from PDF."""
    start_page, end_page = rules_pages
    rules_text = ""
    
    src_pdf = fitz.open(pdf_path)
    for page_num in range(start_page - 1, end_page):
        page = src_pdf[page_num]
        text = page.get_text()
        if text:
            rules_text += f"\n=== RULES PAGE {page_num + 1} ===\n{text}"
    src_pdf.close()
    
    return rules_text

# --- Phase 1 Prompt Builder ---
def build_phase1_prompt(section_key, section_info, prov_code, prov_config):
    """Build Phase 1 extraction prompt for a section."""
    ab = ALBERTA_CODE_CONFIG
    section_text = section_info['text']
    start_page = section_info['start_page']
    end_page = section_info['end_page']
    level_1 = section_info.get('level_1', section_key)
    level_2 = section_info.get('level_2', section_key)
    
    # Province-specific JSON schema
    if prov_code == 'ON':
        json_template = '''{
  "section_key": "''' + section_key + '''",
  "found": true/false,
  "codes": [
    {
      "code": "EXACT code",
      "description": "COMPLETE description",
      "fee": "EXACT fee",
      "fee_type": "Standard|Surgeon|Assistant|Anaesthesia",
      "setting": "Hospital|Professional|N/A",
      "modality": "telephone|video|both",
      "page_found": <integer>,
      "level_3_heading": "subsection heading",
      "is_addon": true/false,
      "links_to": [],
      "condition": "",
      "reasoning": "why this matches"
    }
  ]
}'''
        no_match = '{"section_key": "' + section_key + '", "found": false, "codes": []}'
    elif prov_code == 'SK':
        json_template = '''{
  "section_name": "''' + section_key + '''",
  "found": true/false,
  "primary_codes": [
    {
      "code": "EXACT code",
      "description": "COMPLETE description",
      "fee": "EXACT fee",
      "fee_type": "Referred|Not Referred|Standard",
      "modality": "telephone|video|both",
      "page_found": <integer>,
      "section_heading": "subsection heading",
      "age_premium_applies": true/false,
      "reasoning": "why this matches"
    }
  ],
  "add_on_codes": [
    {
      "code": "EXACT code",
      "description": "COMPLETE description",
      "fee": "EXACT fee",
      "fee_type": "Referred|Not Referred|Standard",
      "modality": "telephone|video|both",
      "page_found": <integer>,
      "section_heading": "subsection heading",
      "age_premium_applies": true/false,
      "links_to": [],
      "condition": ""
    }
  ]
}'''
        no_match = '{"section_name": "' + section_key + '", "found": false, "primary_codes": [], "add_on_codes": []}'
    else:  # BC, MB
        json_template = '''{
  "section_name": "''' + section_key + '''",
  "found": true/false,
  "primary_codes": [
    {
      "code": "EXACT code",
      "description": "COMPLETE description",
      "fee": "EXACT fee",
      "modality": "telephone|video|both",
      "page_found": <integer>,
      "section_heading": "subsection heading",
      "reasoning": "why this matches"
    }
  ],
  "add_on_codes": [
    {
      "code": "EXACT code",
      "description": "COMPLETE description",
      "fee": "EXACT fee",
      "modality": "telephone|video|both",
      "page_found": <integer>,
      "section_heading": "subsection heading",
      "links_to": [],
      "condition": ""
    }
  ]
}'''
        no_match = '{"section_name": "' + section_key + '", "found": false, "primary_codes": [], "add_on_codes": []}'
    
    return f"""You are a senior physician billing specialist mapping Alberta fee codes to {prov_config['name']} equivalents.

ALBERTA CODE TO MATCH:
- Code: {ab['code']}
- Description: {ab['description']}
- Fee: ${ab['fee']}

CLINICAL SERVICE DEFINITION:
{ab['clinical_definition']}

{ab['service_context']}

You are reviewing the section: {level_1} > {level_2}
Pages {start_page} to {end_page}

{prov_config['name'].upper()} PAYMENT SCHEDULE - SECTION:

{section_text}

TASK:
Find ALL {prov_config['name']} codes in this section that bill for patient-facing virtual assessments (telephone or video consultations with patients).

{prov_config['extraction_rules']}

{ab['search_criteria']}

{ab['exclusion_criteria']}

JSON only:
{json_template}

If no telehealth/virtual codes in this section: {no_match}"""

# --- Phase 2 Prompt Builder ---
def build_phase2_prompt(code_info, chunk_text, rules_text, prov_code):
    """Build Phase 2 attribute extraction prompt.
    
    IMPORTANT:
    - rules_text: FULL (no truncation) - LLM needs all billing rules
    - chunk_text: Truncated to 30K chars - just the code-specific section
    """
    
    # SK-specific additional rules checklist
    sk_rules = ""
    if prov_code == 'SK':
        sk_rules = """
SASKATCHEWAN STANDARD RULES CHECKLIST:
- 3,000 services per physician per year limit
- Age premiums: 0-5 years and 65+ years eligible for premium
- Verify if this code/section is eligible for age premiums
"""
    
    return f"""You are a senior physician billing specialist extracting detailed attributes for a {prov_code} billing code.

CODE TO ANALYZE:
- Code: {code_info['Code']}
- Description: {code_info['Description']}
- Fee: {code_info['Fee']}
- Type: {code_info['Type']}
- Section: {code_info.get('Level_1_Section', 'N/A')}
- Condition (from Phase 1): {code_info.get('Condition', 'N/A')}

ATTRIBUTES TO EXTRACT:
{taxonomy_reference}
{sk_rules}
RULES/PREAMBLE (FULL - read carefully for billing rules applicable to this code):
{rules_text}

CODE-SPECIFIC SECTION:
{chunk_text[:30000]}

TASK:
Using ALL available information above, extract values for each attribute.

INSTRUCTIONS:
1. Read through the ENTIRE Rules/Preamble to find billing rules that apply to this code
2. Look for: frequency limits, time requirements, same-day exclusions, premiums, conditions
3. For each attribute, extract the value if found, or null if not explicitly stated
4. For same_day_exclusions: return as array of code strings
5. For additional_notes: ONLY include important billing information not captured elsewhere

Return JSON only:
{{
  "modality": "telephone|video|both|in_person|asynchronous|null",
  "minimum_time_minutes": integer or null,
  "frequency_per_day": integer or null,
  "frequency_per_year": integer or null,
  "frequency_per_year_period": "annual|quarterly|90_days|monthly|null",
  "same_day_exclusions": ["code1", "code2"] or [] or null,
  "premium_extended_hours": "rate% code conditions" or null,
  "premium_location": "rate% code conditions" or null,
  "premium_age": "rate% conditions" or null,
  "premium_other": "rate% code conditions" or null,
  "additional_notes": "other important billing info" or null
}}"""

# --- Result Processing ---
def process_phase1_result(result, section_key, section_info, prov_code, code_chunks):
    """Process Phase 1 JSON result into standardized rows."""
    ab = ALBERTA_CODE_CONFIG
    rows = []
    
    if prov_code == 'ON':
        # Ontario uses 'codes' array with fee_type and setting
        for c in result.get('codes', []):
            code = c.get('code', '')
            fee = str(c.get('fee', ''))
            fee_type = c.get('fee_type', 'Standard')
            setting = c.get('setting', 'N/A')
            modality = c.get('modality', '')
            
            unique_key = f"{code}_{fee}_{fee_type}_{setting}_{section_key}"
            code_chunks[unique_key] = section_info['text']
            
            rows.append({
                'AB_Code': ab['code'],
                'AB_Description': ab['description'],
                'AB_Fee': ab['fee'],
                'Target_Province': prov_code,
                'Code': code,
                'Description': c.get('description', ''),
                'Fee': c.get('fee', ''),
                'Fee_Type': fee_type,
                'Setting': setting,
                'Type': 'ADD-ON' if c.get('is_addon') else 'PRIMARY',
                'Modality': modality,
                'Links_To': ', '.join(c.get('links_to', [])) if c.get('links_to') else '',
                'Condition': c.get('condition', ''),
                'Reasoning': c.get('reasoning', ''),
                'Level_1_Section': section_info.get('level_1', section_key),
                'Level_2_Subsection': section_info.get('level_2', ''),
                'Level_3_Heading': c.get('level_3_heading', ''),
                'Page_Found': c.get('page_found', ''),
                '_unique_key': unique_key
            })
    
    elif prov_code == 'SK':
        # Saskatchewan uses primary_codes/add_on_codes with fee_type and age_premium
        for p in result.get('primary_codes', []):
            code = p.get('code', '')
            fee = str(p.get('fee', ''))
            fee_type = p.get('fee_type', 'Standard')
            modality = p.get('modality', '')
            
            unique_key = f"{code}_{fee}_{fee_type}_{modality}_{section_key}"
            code_chunks[unique_key] = section_info['text']
            
            rows.append({
                'AB_Code': ab['code'],
                'AB_Description': ab['description'],
                'AB_Fee': ab['fee'],
                'Target_Province': prov_code,
                'Code': code,
                'Description': p.get('description', ''),
                'Fee': p.get('fee', ''),
                'Fee_Type': fee_type,
                'Type': 'PRIMARY',
                'Modality': modality,
                'Specialty': section_key,
                'Links_To': '',
                'Condition': '',
                'Reasoning': p.get('reasoning', ''),
                'Level_1_Section': section_key,
                'Level_2_Subsection': p.get('section_heading', ''),
                'Page_Found': p.get('page_found', ''),
                'Age_Premium_Applies': p.get('age_premium_applies', False),
                '_unique_key': unique_key
            })
        
        for a in result.get('add_on_codes', []):
            code = a.get('code', '')
            fee = str(a.get('fee', ''))
            fee_type = a.get('fee_type', 'Standard')
            modality = a.get('modality', '')
            
            unique_key = f"{code}_{fee}_{fee_type}_{modality}_{section_key}"
            code_chunks[unique_key] = section_info['text']
            
            rows.append({
                'AB_Code': ab['code'],
                'AB_Description': ab['description'],
                'AB_Fee': ab['fee'],
                'Target_Province': prov_code,
                'Code': code,
                'Description': a.get('description', ''),
                'Fee': a.get('fee', ''),
                'Fee_Type': fee_type,
                'Type': 'ADD-ON',
                'Modality': modality,
                'Specialty': section_key,
                'Links_To': ', '.join(a.get('links_to', [])) if a.get('links_to') else '',
                'Condition': a.get('condition', ''),
                'Reasoning': '',
                'Level_1_Section': section_key,
                'Level_2_Subsection': a.get('section_heading', ''),
                'Page_Found': a.get('page_found', ''),
                'Age_Premium_Applies': a.get('age_premium_applies', False),
                '_unique_key': unique_key
            })
    
    else:  # BC, MB
        for p in result.get('primary_codes', []):
            code = p.get('code', '')
            fee = str(p.get('fee', ''))
            modality = p.get('modality', '')
            
            unique_key = f"{code}_{fee}_{modality}_{section_key}"
            code_chunks[unique_key] = section_info['text']
            
            rows.append({
                'AB_Code': ab['code'],
                'AB_Description': ab['description'],
                'AB_Fee': ab['fee'],
                'Target_Province': prov_code,
                'Code': code,
                'Description': p.get('description', ''),
                'Fee': p.get('fee', ''),
                'Type': 'PRIMARY',
                'Modality': modality,
                'Specialty': section_key,
                'Links_To': '',
                'Condition': '',
                'Reasoning': p.get('reasoning', ''),
                'Level_1_Section': section_key,
                'Level_2_Subsection': p.get('section_heading', ''),
                'Page_Found': p.get('page_found', ''),
                '_unique_key': unique_key
            })
        
        for a in result.get('add_on_codes', []):
            code = a.get('code', '')
            fee = str(a.get('fee', ''))
            modality = a.get('modality', '')
            
            unique_key = f"{code}_{fee}_{modality}_{section_key}"
            code_chunks[unique_key] = section_info['text']
            
            rows.append({
                'AB_Code': ab['code'],
                'AB_Description': ab['description'],
                'AB_Fee': ab['fee'],
                'Target_Province': prov_code,
                'Code': code,
                'Description': a.get('description', ''),
                'Fee': a.get('fee', ''),
                'Type': 'ADD-ON',
                'Modality': modality,
                'Specialty': section_key,
                'Links_To': ', '.join(a.get('links_to', [])) if a.get('links_to') else '',
                'Condition': a.get('condition', ''),
                'Reasoning': '',
                'Level_1_Section': section_key,
                'Level_2_Subsection': a.get('section_heading', ''),
                'Page_Found': a.get('page_found', ''),
                '_unique_key': unique_key
            })
    
    return rows

print("="*70)
print("SHARED FUNCTIONS LOADED")
print("="*70)
print("\nAvailable functions:")
print("  - reset_cost_tracking()")
print("  - track_cost(inp, out)")
print("  - get_dynamic_max_tokens(char_count) - Phase 1")
print("  - get_phase2_max_tokens(rules_char_count) - Phase 2")
print("  - load_pdf_pages(pdf_path)")
print("  - build_section_chunks_level1(df_ref, pdf_pages, total_pages, prov_config)")
print("  - build_section_chunks_level2(df_ref, pdf_pages, total_pages, prov_config)")
print("  - extract_rules_text(pdf_path, rules_pages)")
print("  - build_phase1_prompt(section_key, section_info, prov_code, prov_config)")
print("  - build_phase2_prompt(code_info, chunk_text, rules_text, prov_code)")
print("  - process_phase1_result(result, section_key, section_info, prov_code, code_chunks)")
print("\nPhase 2: Full rules_text (no truncation), chunk_text[:30000]")
print("Ready to process provinces.")

---
# STEP 6: Run Provinces
---

Run each province individually. Results are saved and downloaded after each province completes.

## Cell 7a: Run British Columbia

In [None]:
# ============================================================================
# BRITISH COLUMBIA - FULL CROSSWALK
# ============================================================================

PROV_CODE = 'BC'
prov_config = PROVINCE_CONFIGS[PROV_CODE]

print("="*70)
print(f"BRITISH COLUMBIA CROSSWALK")
print("="*70)

# Check files exist
if not PDF_FILES.get(PROV_CODE) or not REF_FILES.get(PROV_CODE):
    print(f"ERROR: Missing files for {PROV_CODE}")
    print(f"  PDF: {PDF_FILES.get(PROV_CODE)}")
    print(f"  REF: {REF_FILES.get(PROV_CODE)}")
else:
    reset_cost_tracking()
    
    # Load reference CSV
    df_ref = pd.read_csv(REF_FILES[PROV_CODE])
    df_ref = df_ref.sort_values('page_start').reset_index(drop=True)
    print(f"Loaded {len(df_ref)} section entries from reference CSV")
    
    # Load PDF pages
    print(f"\nLoading PDF: {PDF_FILES[PROV_CODE]}")
    pdf_pages, total_pages = load_pdf_pages(PDF_FILES[PROV_CODE])
    print(f"Loaded {len(pdf_pages)} pages")
    
    # Build section chunks (Level 1)
    section_chunks = build_section_chunks_level1(df_ref, pdf_pages, total_pages, prov_config)
    print(f"\nCreated {len(section_chunks)} section chunks")
    
    # --- PHASE 1: Extract codes ---
    print(f"\n{'='*70}")
    print("PHASE 1: EXTRACTING CODES")
    print("="*70)
    
    all_results_bc = []
    code_chunks_bc = {}
    
    for section_key, section_info in tqdm(section_chunks.items(), desc="Processing sections"):
        char_count = section_info['char_count']
        max_tokens = get_dynamic_max_tokens(char_count)
        
        prompt = build_phase1_prompt(section_key, section_info, PROV_CODE, prov_config)
        
        try:
            resp = client.chat.completions.create(
                model="gpt-5.1-2025-11-13",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1,
                max_completion_tokens=max_tokens
            )
            track_cost(resp.usage.prompt_tokens, resp.usage.completion_tokens)
            
            content = resp.choices[0].message.content
            match = re.search(r'\{[\s\S]*\}', content)
            
            if match:
                result = json.loads(match.group())
                if result.get('found'):
                    rows = process_phase1_result(result, section_key, section_info, PROV_CODE, code_chunks_bc)
                    all_results_bc.extend(rows)
                    print(f"  [{section_key[:40]}] -> {len(rows)} codes")
        except Exception as e:
            print(f"  [{section_key[:40]}] ERROR: {e}")
    
    print(f"\nPhase 1 complete: {len(all_results_bc)} codes found")
    
    # --- PHASE 2: Extract attributes ---
    if len(all_results_bc) > 0:
        print(f"\n{'='*70}")
        print("PHASE 2: EXTRACTING ATTRIBUTES")
        print("="*70)
        
        rules_text = extract_rules_text(PDF_FILES[PROV_CODE], prov_config['rules_pages'])
        print(f"Loaded rules text: {len(rules_text):,} chars")
        
        # Dynamic token limit based on rules size
        phase2_tokens = get_phase2_max_tokens(len(rules_text))
        print(f"Phase 2 max_completion_tokens: {phase2_tokens}")
        
        phase2_results = []
        for code_info in tqdm(all_results_bc, desc="Extracting attributes"):
            unique_key = code_info.get('_unique_key', '')
            chunk_text = code_chunks_bc.get(unique_key, '')
            
            prompt = build_phase2_prompt(code_info, chunk_text, rules_text, PROV_CODE)
            
            try:
                resp = client.chat.completions.create(
                    model="gpt-5.1-2025-11-13",
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0.1,
                    max_completion_tokens=phase2_tokens
                )
                track_cost(resp.usage.prompt_tokens, resp.usage.completion_tokens)
                
                content = resp.choices[0].message.content
                match = re.search(r'\{[\s\S]*\}', content)
                
                if match:
                    attrs = json.loads(match.group())
                    if attrs.get('same_day_exclusions') and isinstance(attrs['same_day_exclusions'], list):
                        attrs['same_day_exclusions'] = ', '.join(attrs['same_day_exclusions'])
                    phase2_results.append({'_unique_key': unique_key, **attrs})
                else:
                    phase2_results.append({'_unique_key': unique_key})
            except Exception as e:
                phase2_results.append({'_unique_key': unique_key})
        
        # Combine Phase 1 + Phase 2
        df_phase1 = pd.DataFrame(all_results_bc)
        df_phase2 = pd.DataFrame(phase2_results)
        df_bc = df_phase1.merge(df_phase2, on='_unique_key', how='left')
        df_bc = df_bc.drop(columns=['_unique_key'])
        
        # Save
        output_file = f"3.02_BC_Alberta_Complete.xlsx"
        df_bc.to_excel(output_file, index=False)
        
        print(f"\n{'='*70}")
        print(f"BC COMPLETE")
        print("="*70)
        print(f"Total codes: {len(df_bc)}")
        print(f"  - PRIMARY: {len(df_bc[df_bc['Type'] == 'PRIMARY'])}")
        print(f"  - ADD-ON: {len(df_bc[df_bc['Type'] == 'ADD-ON'])}")
        print(f"API calls: {total_calls} | Cost: ${total_cost:.2f}")
        print(f"\nSaved: {output_file}")
        
        files.download(output_file)
    else:
        print("No codes found for BC")
        df_bc = pd.DataFrame()

## Cell 7b: Run Manitoba

In [None]:
# ============================================================================
# MANITOBA - FULL CROSSWALK
# ============================================================================

PROV_CODE = 'MB'
prov_config = PROVINCE_CONFIGS[PROV_CODE]

print("="*70)
print(f"MANITOBA CROSSWALK")
print("="*70)

# Check files exist
if not PDF_FILES.get(PROV_CODE) or not REF_FILES.get(PROV_CODE):
    print(f"ERROR: Missing files for {PROV_CODE}")
    print(f"  PDF: {PDF_FILES.get(PROV_CODE)}")
    print(f"  REF: {REF_FILES.get(PROV_CODE)}")
else:
    reset_cost_tracking()
    
    # Load reference CSV
    df_ref = pd.read_csv(REF_FILES[PROV_CODE])
    df_ref = df_ref.sort_values('page_start').reset_index(drop=True)
    print(f"Loaded {len(df_ref)} section entries from reference CSV")
    
    # Load PDF pages
    print(f"\nLoading PDF: {PDF_FILES[PROV_CODE]}")
    pdf_pages, total_pages = load_pdf_pages(PDF_FILES[PROV_CODE])
    print(f"Loaded {len(pdf_pages)} pages")
    
    # Build section chunks (Level 1)
    section_chunks = build_section_chunks_level1(df_ref, pdf_pages, total_pages, prov_config)
    print(f"\nCreated {len(section_chunks)} section chunks")
    
    # --- PHASE 1: Extract codes ---
    print(f"\n{'='*70}")
    print("PHASE 1: EXTRACTING CODES")
    print("="*70)
    
    all_results_mb = []
    code_chunks_mb = {}
    
    for section_key, section_info in tqdm(section_chunks.items(), desc="Processing sections"):
        char_count = section_info['char_count']
        max_tokens = get_dynamic_max_tokens(char_count)
        
        prompt = build_phase1_prompt(section_key, section_info, PROV_CODE, prov_config)
        
        try:
            resp = client.chat.completions.create(
                model="gpt-5.1-2025-11-13",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1,
                max_completion_tokens=max_tokens
            )
            track_cost(resp.usage.prompt_tokens, resp.usage.completion_tokens)
            
            content = resp.choices[0].message.content
            match = re.search(r'\{[\s\S]*\}', content)
            
            if match:
                result = json.loads(match.group())
                if result.get('found'):
                    rows = process_phase1_result(result, section_key, section_info, PROV_CODE, code_chunks_mb)
                    all_results_mb.extend(rows)
                    print(f"  [{section_key[:40]}] -> {len(rows)} codes")
        except Exception as e:
            print(f"  [{section_key[:40]}] ERROR: {e}")
    
    print(f"\nPhase 1 complete: {len(all_results_mb)} codes found")
    
    # --- PHASE 2: Extract attributes ---
    if len(all_results_mb) > 0:
        print(f"\n{'='*70}")
        print("PHASE 2: EXTRACTING ATTRIBUTES")
        print("="*70)
        
        rules_text = extract_rules_text(PDF_FILES[PROV_CODE], prov_config['rules_pages'])
        print(f"Loaded rules text: {len(rules_text):,} chars")
        
        # Dynamic token limit based on rules size
        phase2_tokens = get_phase2_max_tokens(len(rules_text))
        print(f"Phase 2 max_completion_tokens: {phase2_tokens}")
        
        phase2_results = []
        for code_info in tqdm(all_results_mb, desc="Extracting attributes"):
            unique_key = code_info.get('_unique_key', '')
            chunk_text = code_chunks_mb.get(unique_key, '')
            
            prompt = build_phase2_prompt(code_info, chunk_text, rules_text, PROV_CODE)
            
            try:
                resp = client.chat.completions.create(
                    model="gpt-5.1-2025-11-13",
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0.1,
                    max_completion_tokens=phase2_tokens
                )
                track_cost(resp.usage.prompt_tokens, resp.usage.completion_tokens)
                
                content = resp.choices[0].message.content
                match = re.search(r'\{[\s\S]*\}', content)
                
                if match:
                    attrs = json.loads(match.group())
                    if attrs.get('same_day_exclusions') and isinstance(attrs['same_day_exclusions'], list):
                        attrs['same_day_exclusions'] = ', '.join(attrs['same_day_exclusions'])
                    phase2_results.append({'_unique_key': unique_key, **attrs})
                else:
                    phase2_results.append({'_unique_key': unique_key})
            except Exception as e:
                phase2_results.append({'_unique_key': unique_key})
        
        # Combine Phase 1 + Phase 2
        df_phase1 = pd.DataFrame(all_results_mb)
        df_phase2 = pd.DataFrame(phase2_results)
        df_mb = df_phase1.merge(df_phase2, on='_unique_key', how='left')
        df_mb = df_mb.drop(columns=['_unique_key'])
        
        # Save
        output_file = f"3.02_MB_Alberta_Complete.xlsx"
        df_mb.to_excel(output_file, index=False)
        
        print(f"\n{'='*70}")
        print(f"MB COMPLETE")
        print("="*70)
        print(f"Total codes: {len(df_mb)}")
        print(f"  - PRIMARY: {len(df_mb[df_mb['Type'] == 'PRIMARY'])}")
        print(f"  - ADD-ON: {len(df_mb[df_mb['Type'] == 'ADD-ON'])}")
        print(f"API calls: {total_calls} | Cost: ${total_cost:.2f}")
        print(f"\nSaved: {output_file}")
        
        files.download(output_file)
    else:
        print("No codes found for MB")
        df_mb = pd.DataFrame()

## Cell 7c: Run Ontario

In [None]:
# ============================================================================
# ONTARIO - FULL CROSSWALK
# ============================================================================

PROV_CODE = 'ON'
prov_config = PROVINCE_CONFIGS[PROV_CODE]

print("="*70)
print(f"ONTARIO CROSSWALK")
print("="*70)

# Check files exist
if not PDF_FILES.get(PROV_CODE) or not REF_FILES.get(PROV_CODE):
    print(f"ERROR: Missing files for {PROV_CODE}")
    print(f"  PDF: {PDF_FILES.get(PROV_CODE)}")
    print(f"  REF: {REF_FILES.get(PROV_CODE)}")
else:
    reset_cost_tracking()
    
    # Load reference CSV
    df_ref = pd.read_csv(REF_FILES[PROV_CODE])
    df_ref = df_ref.sort_values('page_start').reset_index(drop=True)
    print(f"Loaded {len(df_ref)} section entries from reference CSV")
    
    # Load PDF pages
    print(f"\nLoading PDF: {PDF_FILES[PROV_CODE]}")
    pdf_pages, total_pages = load_pdf_pages(PDF_FILES[PROV_CODE])
    print(f"Loaded {len(pdf_pages)} pages")
    
    # Build section chunks (Level 2 for Ontario)
    section_chunks = build_section_chunks_level2(df_ref, pdf_pages, total_pages, prov_config)
    print(f"\nCreated {len(section_chunks)} section chunks")
    
    # --- PHASE 1: Extract codes ---
    print(f"\n{'='*70}")
    print("PHASE 1: EXTRACTING CODES")
    print("="*70)
    
    all_results_on = []
    code_chunks_on = {}
    
    for section_key, section_info in tqdm(section_chunks.items(), desc="Processing sections"):
        char_count = section_info['char_count']
        max_tokens = get_dynamic_max_tokens(char_count)
        
        prompt = build_phase1_prompt(section_key, section_info, PROV_CODE, prov_config)
        
        try:
            resp = client.chat.completions.create(
                model="gpt-5.1-2025-11-13",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1,
                max_completion_tokens=max_tokens
            )
            track_cost(resp.usage.prompt_tokens, resp.usage.completion_tokens)
            
            content = resp.choices[0].message.content
            match = re.search(r'\{[\s\S]*\}', content)
            
            if match:
                result = json.loads(match.group())
                if result.get('found'):
                    rows = process_phase1_result(result, section_key, section_info, PROV_CODE, code_chunks_on)
                    all_results_on.extend(rows)
                    print(f"  [{section_key[:40]}] -> {len(rows)} codes")
        except Exception as e:
            print(f"  [{section_key[:40]}] ERROR: {e}")
    
    print(f"\nPhase 1 complete: {len(all_results_on)} codes found")
    
    # --- PHASE 2: Extract attributes ---
    if len(all_results_on) > 0:
        print(f"\n{'='*70}")
        print("PHASE 2: EXTRACTING ATTRIBUTES")
        print("="*70)
        
        rules_text = extract_rules_text(PDF_FILES[PROV_CODE], prov_config['rules_pages'])
        print(f"Loaded rules text: {len(rules_text):,} chars")
        
        # Dynamic token limit based on rules size
        phase2_tokens = get_phase2_max_tokens(len(rules_text))
        print(f"Phase 2 max_completion_tokens: {phase2_tokens}")
        
        phase2_results = []
        for code_info in tqdm(all_results_on, desc="Extracting attributes"):
            unique_key = code_info.get('_unique_key', '')
            chunk_text = code_chunks_on.get(unique_key, '')
            
            prompt = build_phase2_prompt(code_info, chunk_text, rules_text, PROV_CODE)
            
            try:
                resp = client.chat.completions.create(
                    model="gpt-5.1-2025-11-13",
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0.1,
                    max_completion_tokens=phase2_tokens
                )
                track_cost(resp.usage.prompt_tokens, resp.usage.completion_tokens)
                
                content = resp.choices[0].message.content
                match = re.search(r'\{[\s\S]*\}', content)
                
                if match:
                    attrs = json.loads(match.group())
                    if attrs.get('same_day_exclusions') and isinstance(attrs['same_day_exclusions'], list):
                        attrs['same_day_exclusions'] = ', '.join(attrs['same_day_exclusions'])
                    phase2_results.append({'_unique_key': unique_key, **attrs})
                else:
                    phase2_results.append({'_unique_key': unique_key})
            except Exception as e:
                phase2_results.append({'_unique_key': unique_key})
        
        # Combine Phase 1 + Phase 2
        df_phase1 = pd.DataFrame(all_results_on)
        df_phase2 = pd.DataFrame(phase2_results)
        df_on = df_phase1.merge(df_phase2, on='_unique_key', how='left')
        df_on = df_on.drop(columns=['_unique_key'])
        
        # Save
        output_file = f"3.02_ON_Alberta_Complete.xlsx"
        df_on.to_excel(output_file, index=False)
        
        print(f"\n{'='*70}")
        print(f"ON COMPLETE")
        print("="*70)
        print(f"Total codes: {len(df_on)}")
        print(f"  - PRIMARY: {len(df_on[df_on['Type'] == 'PRIMARY'])}")
        print(f"  - ADD-ON: {len(df_on[df_on['Type'] == 'ADD-ON'])}")
        print(f"\nBy Fee Type:")
        for ft in df_on['Fee_Type'].unique():
            print(f"  - {ft}: {len(df_on[df_on['Fee_Type'] == ft])}")
        print(f"\nBy Setting:")
        for s in df_on['Setting'].unique():
            print(f"  - {s}: {len(df_on[df_on['Setting'] == s])}")
        print(f"\nAPI calls: {total_calls} | Cost: ${total_cost:.2f}")
        print(f"\nSaved: {output_file}")
        
        files.download(output_file)
    else:
        print("No codes found for ON")
        df_on = pd.DataFrame()

## Cell 7d: Run Saskatchewan

In [None]:
# ============================================================================
# SASKATCHEWAN - FULL CROSSWALK
# ============================================================================

PROV_CODE = 'SK'
prov_config = PROVINCE_CONFIGS[PROV_CODE]

print("="*70)
print(f"SASKATCHEWAN CROSSWALK")
print("="*70)

# Check files exist
if not PDF_FILES.get(PROV_CODE) or not REF_FILES.get(PROV_CODE):
    print(f"ERROR: Missing files for {PROV_CODE}")
    print(f"  PDF: {PDF_FILES.get(PROV_CODE)}")
    print(f"  REF: {REF_FILES.get(PROV_CODE)}")
else:
    reset_cost_tracking()
    
    # Load reference CSV
    df_ref = pd.read_csv(REF_FILES[PROV_CODE])
    df_ref = df_ref.sort_values('page_start').reset_index(drop=True)
    print(f"Loaded {len(df_ref)} section entries from reference CSV")
    
    # Load PDF pages
    print(f"\nLoading PDF: {PDF_FILES[PROV_CODE]}")
    pdf_pages, total_pages = load_pdf_pages(PDF_FILES[PROV_CODE])
    print(f"Loaded {len(pdf_pages)} pages")
    
    # Build section chunks (Level 1)
    section_chunks = build_section_chunks_level1(df_ref, pdf_pages, total_pages, prov_config)
    print(f"\nCreated {len(section_chunks)} section chunks")
    
    # --- PHASE 1: Extract codes ---
    print(f"\n{'='*70}")
    print("PHASE 1: EXTRACTING CODES")
    print("="*70)
    
    all_results_sk = []
    code_chunks_sk = {}
    
    for section_key, section_info in tqdm(section_chunks.items(), desc="Processing sections"):
        char_count = section_info['char_count']
        max_tokens = get_dynamic_max_tokens(char_count)
        
        prompt = build_phase1_prompt(section_key, section_info, PROV_CODE, prov_config)
        
        try:
            resp = client.chat.completions.create(
                model="gpt-5.1-2025-11-13",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1,
                max_completion_tokens=max_tokens
            )
            track_cost(resp.usage.prompt_tokens, resp.usage.completion_tokens)
            
            content = resp.choices[0].message.content
            match = re.search(r'\{[\s\S]*\}', content)
            
            if match:
                result = json.loads(match.group())
                if result.get('found'):
                    rows = process_phase1_result(result, section_key, section_info, PROV_CODE, code_chunks_sk)
                    all_results_sk.extend(rows)
                    print(f"  [{section_key[:40]}] -> {len(rows)} codes")
        except Exception as e:
            print(f"  [{section_key[:40]}] ERROR: {e}")
    
    print(f"\nPhase 1 complete: {len(all_results_sk)} codes found")
    
    # --- PHASE 2: Extract attributes ---
    if len(all_results_sk) > 0:
        print(f"\n{'='*70}")
        print("PHASE 2: EXTRACTING ATTRIBUTES")
        print("="*70)
        
        rules_text = extract_rules_text(PDF_FILES[PROV_CODE], prov_config['rules_pages'])
        print(f"Loaded rules text: {len(rules_text):,} chars")
        
        # Dynamic token limit based on rules size
        phase2_tokens = get_phase2_max_tokens(len(rules_text))
        print(f"Phase 2 max_completion_tokens: {phase2_tokens}")
        
        phase2_results = []
        for code_info in tqdm(all_results_sk, desc="Extracting attributes"):
            unique_key = code_info.get('_unique_key', '')
            chunk_text = code_chunks_sk.get(unique_key, '')
            
            prompt = build_phase2_prompt(code_info, chunk_text, rules_text, PROV_CODE)
            
            try:
                resp = client.chat.completions.create(
                    model="gpt-5.1-2025-11-13",
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0.1,
                    max_completion_tokens=phase2_tokens
                )
                track_cost(resp.usage.prompt_tokens, resp.usage.completion_tokens)
                
                content = resp.choices[0].message.content
                match = re.search(r'\{[\s\S]*\}', content)
                
                if match:
                    attrs = json.loads(match.group())
                    if attrs.get('same_day_exclusions') and isinstance(attrs['same_day_exclusions'], list):
                        attrs['same_day_exclusions'] = ', '.join(attrs['same_day_exclusions'])
                    phase2_results.append({'_unique_key': unique_key, **attrs})
                else:
                    phase2_results.append({'_unique_key': unique_key})
            except Exception as e:
                phase2_results.append({'_unique_key': unique_key})
        
        # Combine Phase 1 + Phase 2
        df_phase1 = pd.DataFrame(all_results_sk)
        df_phase2 = pd.DataFrame(phase2_results)
        df_sk = df_phase1.merge(df_phase2, on='_unique_key', how='left')
        df_sk = df_sk.drop(columns=['_unique_key'])
        
        # Save
        output_file = f"3.02_SK_Alberta_Complete.xlsx"
        df_sk.to_excel(output_file, index=False)
        
        print(f"\n{'='*70}")
        print(f"SK COMPLETE")
        print("="*70)
        print(f"Total codes: {len(df_sk)}")
        print(f"  - PRIMARY: {len(df_sk[df_sk['Type'] == 'PRIMARY'])}")
        print(f"  - ADD-ON: {len(df_sk[df_sk['Type'] == 'ADD-ON'])}")
        print(f"\nBy Fee Type:")
        for ft in df_sk['Fee_Type'].unique():
            print(f"  - {ft}: {len(df_sk[df_sk['Fee_Type'] == ft])}")
        print(f"\nAge Premium Applies: {df_sk['Age_Premium_Applies'].sum()} codes")
        print(f"\nAPI calls: {total_calls} | Cost: ${total_cost:.2f}")
        print(f"\nSaved: {output_file}")
        
        files.download(output_file)
    else:
        print("No codes found for SK")
        df_sk = pd.DataFrame()

---
# STEP 7: Combine & Summary
---

## Cell 8: Combine All Provinces & Final Summary

In [None]:
# ============================================================================
# COMBINE ALL PROVINCES & FINAL SUMMARY
# ============================================================================

print("="*70)
print("COMBINING ALL PROVINCE RESULTS")
print("="*70)

# Collect all province DataFrames
all_dfs = []
province_stats = {}

# Check each province
for prov_code in ['BC', 'MB', 'ON', 'SK']:
    df_name = f"df_{prov_code.lower()}"
    if df_name in dir() and len(eval(df_name)) > 0:
        df = eval(df_name)
        all_dfs.append(df)
        province_stats[prov_code] = {
            'total': len(df),
            'primary': len(df[df['Type'] == 'PRIMARY']),
            'addon': len(df[df['Type'] == 'ADD-ON'])
        }
        print(f"  {prov_code}: {len(df)} codes")
    else:
        print(f"  {prov_code}: No data (skipped or not run)")

if len(all_dfs) > 0:
    # Combine all
    df_combined = pd.concat(all_dfs, ignore_index=True)
    
    # Fill missing columns with empty values
    all_columns = set()
    for df in all_dfs:
        all_columns.update(df.columns)
    
    for col in all_columns:
        if col not in df_combined.columns:
            df_combined[col] = ''
    
    # Save combined file
    ab_code = ALBERTA_CODE_CONFIG['code'].replace('.', '_')
    combined_file = f"3.03_All_Province_{ab_code}_Complete.xlsx"
    df_combined.to_excel(combined_file, index=False)
    
    print(f"\n{'='*70}")
    print("FINAL SUMMARY")
    print("="*70)
    print(f"\nAlberta Code: {ALBERTA_CODE_CONFIG['code']} - {ALBERTA_CODE_CONFIG['description']}")
    print(f"\nTotal codes across all provinces: {len(df_combined)}")
    
    print(f"\n--- BY PROVINCE ---")
    for prov, stats in province_stats.items():
        print(f"  {prov}: {stats['total']:3} codes ({stats['primary']} primary, {stats['addon']} add-on)")
    
    print(f"\n--- BY TYPE ---")
    print(f"  PRIMARY: {len(df_combined[df_combined['Type'] == 'PRIMARY'])}")
    print(f"  ADD-ON: {len(df_combined[df_combined['Type'] == 'ADD-ON'])}")
    
    print(f"\n--- OUTPUT FILES ---")
    for prov_code in province_stats.keys():
        print(f"  3.02_{prov_code}_Alberta_Complete.xlsx")
    print(f"  {combined_file} (combined)")
    
    print(f"\n{'='*70}")
    print("ALL PROVINCES COMPLETE")
    print("="*70)
    
    # Download combined file
    files.download(combined_file)
else:
    print("\nNo province data available to combine.")
    print("Run the individual province cells (7a-7d) first.")