
This notebook applies metadata Sheet 2 definitions to decode survey values.  
This notebook maps coded responses (e.g., 1, 2, 3) into human‑readable labels (e.g., Employed, Unemployed).

Dependencies:
- Run `00_Settings.ipynb` and `01_Inventory.ipynb` first.
- Requires outputs from `03_Metadata_Decoder.ipynb` (Header Encoded Surveys).
- Note: Survey CSVs already carry meaning from Sheet 2 reshaping. This notebook applies the final decoding logic.

Output:

- Fully decoded survey CSVs saved into **NEW Fully Decoded Surveys**.
- Reports per survey showing number of columns successfully decoded.

Notes:
- **Sheet 1 metadata** → header translation (column names).  
- **Sheet 2 metadata** → value translation (coded responses).  

**INTENT:** This notebook performs the **final decoding stage** of the Labor Force Survey pipeline.  
It applies **Sheet 2 metadata** to translate coded survey values into human‑readable labels.

- Next steps: duplicate variable detection, integrity checks, coverage scanning.


In [1]:
import json
from pathlib import Path
import os
import pandas as pd

# ------------------------------------------------------------
# Load settings from config.json (produced by 00_Settings.ipynb)
# ------------------------------------------------------------
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# ------------------------------------------------------------
# Load inventory (produced by 01_Inventory.ipynb)
# ------------------------------------------------------------
with open(Path(INTERIM_DIR) / "inventory.json") as f:
    inventory = json.load(f)

# Alias for compatibility
base_path = str(BASE_PATH)


### Interim Sample 

#### Intent: Interim Sample Decoder

This interim function generates a **single sample output** (e.g., `NEW Fully Decoded Survey Sample` for **January 2018**) using the same decoding logic as the batch runner.  

Unlike the full batch process, which redirects all decoded surveys to Google Drive, this sample run saves directly into the local **interim repository path**.  

The purpose is to provide a quick preview of how a fully decoded CSV looks without requiring you to download the large, heavy files from Google Drive.  

Use this when:
- You want to validate decoding logic on a small subset before running the full batch.  
- You need a lightweight example for documentation, testing, or demonstration.  
- You want to inspect decoded values locally without waiting for the complete dataset.


In [2]:
import json
from pathlib import Path
import os
import pandas as pd
import re

# ------------------------------------------------------------
# Shared folder names
# ------------------------------------------------------------
HEADER_ENCODED_FOLDER = "NEW Header Encoded Surveys"
FULLY_DECODED_FOLDER = "NEW Fully Decoded Surveys"
FULLY_DECODED_SAMPLE_FOLDER = "NEW Fully Decoded Survey Sample"

# ------------------------------------------------------------
# Month parsing
# ------------------------------------------------------------
MONTHS = [
    "JANUARY","FEBRUARY","MARCH","APRIL","MAY","JUNE",
    "JULY","AUGUST","SEPTEMBER","OCTOBER","NOVEMBER","DECEMBER"
]

MONTH_PATTERN = re.compile(
    r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)",
    re.IGNORECASE
)

# ============================================================
# Metadata loader
# ============================================================
def load_clean_sheet2(base_path, year, month):
    path = os.path.join(
        base_path,
        "NEW Metadata Sheet 2 CSV's",
        str(year),
        f"Sheet2_{month}_{year}.csv"
    )
    if not os.path.exists(path):
        raise FileNotFoundError(f"Metadata not found at: {path}")
    return pd.read_csv(path, dtype=str)

# ============================================================
# Helper: Normalize Text
# ============================================================
def normalize_key(s):
    if pd.isna(s): return ""
    return re.sub(r'[^a-z0-9]', '', str(s).lower())

# ============================================================
# COLUMN MATCHING LOGIC
# ============================================================
def find_target_column(survey_columns, meta_desc):
    if pd.isna(meta_desc): return None
    meta_desc = str(meta_desc).strip()
    
    # 1. EXTRACT ID (e.g. C29)
    meta_code_match = re.match(r'^(C\d+)', meta_desc, re.IGNORECASE)
    required_code = meta_code_match.group(1).upper() if meta_code_match else None
    
    # --- STRATEGY A: CODE MATCH (Priority) ---
    if required_code:
        # Regex: Start with CODE, followed by NOT a digit.
        # This prevents C10 from matching C101.
        pattern = re.compile(f"^{required_code}(?![0-9])", re.IGNORECASE)
        candidates = [col for col in survey_columns if pattern.match(col.strip())]
        
        if not candidates:
            return None
        
        # If multiple candidates exist (e.g. C29 Month vs C29 Year), use text match to decide.
        if len(candidates) > 1:
            target_clean = normalize_key(meta_desc)
            for cand in candidates:
                if normalize_key(cand) == target_clean:
                    return cand
            # Soft match fallback
            meta_text_only = normalize_key(re.sub(r'^(C\d+)', '', meta_desc))
            for cand in candidates:
                if meta_text_only and meta_text_only in normalize_key(cand):
                    return cand
            
        return candidates[0]

    # --- STRATEGY B: TEXT MATCH (Fallback) ---
    else:
        # Exact Match
        if meta_desc in survey_columns: 
            return meta_desc
        # Clean Match
        elif re.sub(r'^C\d+[\s\-_]+', '', meta_desc, flags=re.IGNORECASE).strip() in survey_columns:
            return re.sub(r'^C\d+[\s\-_]+', '', meta_desc, flags=re.IGNORECASE).strip()
        # Suffix Match
        else:
            for col in survey_columns:
                if col.endswith(meta_desc):
                    prefix = col[:-len(meta_desc)].strip()
                    if re.search(r'^C\d+[\s\-_]*$', prefix, re.IGNORECASE) or prefix == "":
                        return col
    
    return None

# ============================================================
# Safe decoder
# ============================================================
def decode_survey_safe(survey_df, meta_df, year=None, month=None):
    # This logic is overridden by Code 2 but provided here for completeness.
    unique_vars = meta_df['Variable'].unique()
    decoded_count = 0
    survey_cols = list(survey_df.columns)

    for var_code in unique_vars:
        subset = meta_df[meta_df['Variable'] == var_code].copy()
        if subset['Description'].isnull().all(): continue

        raw_desc = subset['Description'].dropna().iloc[0].strip()
        target_col = find_target_column(survey_cols, raw_desc)
        
        if not target_col: continue

        lookup = {}
        for _, row in subset.iterrows():
            label = str(row['Label']).strip()
            if label in ['0', '0.0', 'nan', 'NaN', '']: continue
            
            # --- RANGE EXPANSION LOGIC ---
            range_added = False
            min_val = str(row.get('min_value', '')).strip()
            max_val = str(row.get('max_value', '')).strip()
            if min_val and max_val and min_val != max_val:
                try:
                    start = int(float(min_val))
                    end = int(float(max_val))
                    # Prevent expanding huge accidental ranges
                    if start < end and (end - start) < 1000:
                        for k in range(start, end + 1):
                            lookup[k] = label
                        range_added = True
                except: pass

            if not range_added:
                for field in ['min_value', 'max_value', 'additional_value']:
                    val = str(row.get(field, '')).strip()
                    if val and val not in ['0', 'nan', 'NaN']:
                        try: lookup[int(float(val))] = label
                        except: lookup[val] = label

        if not lookup: continue

        def safe_map(val):
            try: return lookup.get(int(float(val)), val)
            except: return lookup.get(str(val), val)

        survey_df[target_col] = survey_df[target_col].apply(safe_map)
        decoded_count += 1

    return survey_df, decoded_count

# ============================================================
# INTERIM SAMPLE DECODER
# ============================================================
def run_sample_decoding(base_path, year="2018", month="January", interim_root=None):
    if interim_root is None:
        interim_root = os.path.join(base_path, "data", "interim")

    month = month.strip().capitalize()
    if month.upper() not in MONTHS:
        raise ValueError(f"Invalid month: {month}")

    input_root = os.path.join(base_path, HEADER_ENCODED_FOLDER, year)
    output_root = os.path.join(interim_root, FULLY_DECODED_SAMPLE_FOLDER, year)
    os.makedirs(output_root, exist_ok=True)

    if not os.path.exists(input_root):
        print(f"[SKIP] Input folder not found: {input_root}")
        return

    files = [f for f in os.listdir(input_root) if f.lower().endswith(".csv")]
    target_file = next((f for f in files if month.upper() in f.upper()), None)

    if not target_file:
        print(f"[SKIP] No survey file found for {month} {year}")
        return

    print("================================================")
    print(f"SAMPLE DECODING: {month.upper()} {year}")
    print(f"Source: {input_root}")
    print(f"Dest:   {output_root}")
    print("================================================\n")

    df_survey = pd.read_csv(os.path.join(input_root, target_file), low_memory=False)
    df_meta = load_clean_sheet2(base_path, year, month)

    df_final, count = decode_survey_safe(df_survey, df_meta, year, month)

    save_path = os.path.join(output_root, target_file)
    df_final.to_csv(save_path, index=False)

    print(f"[OK] Decoded {count} columns")
    print(f"[SAVED] {save_path}")

# ============================================================
# FULL BATCH DECODER
# ============================================================
def run_batch_decoding(base_path):
    input_root = os.path.join(base_path, HEADER_ENCODED_FOLDER)
    output_root = os.path.join(base_path, FULLY_DECODED_FOLDER)
    os.makedirs(output_root, exist_ok=True)

    print("================================================")
    print("STARTING FULL BATCH DECODING")
    print(f"Source: {input_root}")
    print(f"Dest:   {output_root}")
    print("================================================\n")

    if not os.path.exists(input_root):
        print(f"[ERROR] Input folder not found: {input_root}")
        return

    success, errors = 0, 0

    year_folders = [
        f for f in os.listdir(input_root)
        if f.isdigit() and os.path.isdir(os.path.join(input_root, f))
    ]

    for year in sorted(year_folders):
        year_in = os.path.join(input_root, year)
        year_out = os.path.join(output_root, year)
        os.makedirs(year_out, exist_ok=True)

        files = [f for f in os.listdir(year_in) if f.lower().endswith(".csv")]
        for filename in files:
            match = MONTH_PATTERN.search(filename)
            if not match: continue

            month = match.group(1).capitalize()
            print(f"Processing: {month.upper()} {year}...")

            try:
                df_survey = pd.read_csv(os.path.join(year_in, filename), low_memory=False)
                df_meta = load_clean_sheet2(base_path, year, month)
                
                df_final, count = decode_survey_safe(df_survey, df_meta, year, month)

                save_path = os.path.join(year_out, filename)
                df_final.to_csv(save_path, index=False)

                print(f"[OK] Decoded {count} columns")
                success += 1

            except Exception as e:
                print(f"[ERROR] {e}")
                errors += 1

            print("-" * 40)

    print(f"\nCOMPLETED | Success: {success} | Errors: {errors}")

## Age Decoder

In [3]:
import pandas as pd
import re

# ============================================================
# LOGIC: MAP SELECTS METADATA RULE -> APPLIES TO C05 COLUMN
# ============================================================

AGE_DESCRIPTION_MAP = {
    ("2018", "January"): "C07-Age as of Last Birthday", 
    ("2018", "April"): "C07-Age as of Last Birthday",
    ("2018", "July"): "C07-Age as of Last Birthday", 
    ("2018", "October"): "C07-Age as of Last Birthday",
    ("2019", "January"): "C07-Age as of Last Birthday", 
    ("2019", "April"): "C07-Age as of Last Birthday",
    ("2019", "July"): "C07-Age as of Last Birthday", 
    ("2019", "October"): "C07-Age as of Last Birthday",
    ("2022", "January"): "C07-Age as of Last Birthday", 
    ("2022", "February"): "Age Group",
    ("2022", "March"): "Age Group", 
    ("2022", "April"): "C07-Age as of Last Birthday", 
    ("2022", "May"): "Age Group", 
    ("2022", "June"): "Age Group", 
    ("2022", "July"): "C07-Age as of Last Birthday", 
    ("2022", "August"): "Age Group", 
    ("2022", "September"): "Age Group", 
    ("2022", "October"): "C07-Age as of Last Birthday", 
    ("2022", "November"): "Age Group", 
    ("2022", "December"): "Age Group", 
    ("2023", "January"): "C07-Age as of Last Birthday", 
    ("2023", "February"): "Age Group", 
    ("2023", "March"): "Age Group", 
    ("2023", "April"): "C07-Age as of Last Birthday", 
    ("2023", "May"): "Age Group", 
    ("2023", "June"): "Age Group", 
    ("2023", "July"): "C05-Age as of Last Birthday", 
    ("2023", "August"): "Age Group", 
    ("2023", "September"): "Age Group", 
    ("2023", "October"): "C07-Age as of Last Birthday", 
    ("2023", "November"): "Age Group", 
    ("2023", "December"): "Age Group", 
    ("2024", "January"): "C07-Age as of Last Birthday", 
    ("2024", "February"): "Age Group", 
    ("2024", "March"): "Age Group", 
    ("2024", "April"): "C07-Age as of Last Birthday", 
    ("2024", "May"): "Age Group", 
    ("2024", "June"): "Age Group", 
    ("2024", "July"): "C07-Age as of Last Birthday", 
    ("2024", "August"): "Age Group"
}

def normalize_key(s):
    if pd.isna(s): return ""
    return re.sub(r'[^a-z0-9]', '', str(s).lower())

def find_consistent_age_column(survey_columns):
    for col in survey_columns:
        if col.upper().startswith("C05"): return col
    for col in survey_columns:
        if col.upper().startswith("C07"): return col
    for col in survey_columns:
        if "AGE" in col.upper(): return col
    return None

def find_column_smart_strict(survey_columns, meta_desc):
    desc_str = str(meta_desc).strip()
    
    # 1. EXTRACT ID (e.g. C29)
    meta_code_match = re.match(r'^(C\d+)', desc_str, re.IGNORECASE)
    required_code = meta_code_match.group(1).upper() if meta_code_match else None
    
    # --- STRATEGY A: CODE MATCH (With Tie-Breaker) ---
    if required_code:
        # Regex: Start with CODE, followed by NOT a digit.
        pattern = re.compile(f"^{required_code}(?![0-9])", re.IGNORECASE)
        candidates = [col for col in survey_columns if pattern.match(col.strip())]
        
        if not candidates: return None
        
        # If only one match, return it
        if len(candidates) == 1: return candidates[0]
        
        # TIE-BREAKER: Check text match if multiple candidates exist (e.g. C29 Month vs Year)
        target_clean = normalize_key(desc_str)
        for cand in candidates:
            if normalize_key(cand) == target_clean: return cand
            
        meta_text_only = normalize_key(re.sub(r'^(C\d+)', '', desc_str))
        for cand in candidates:
            if meta_text_only and meta_text_only in normalize_key(cand): return cand
            
        return candidates[0]
    
    # --- STRATEGY B: TEXT MATCH (Fallback) ---
    target_clean = normalize_key(desc_str)
    for col in survey_columns:
        if normalize_key(col) == target_clean:
            return col
            
    return None

def decode_survey_safe(survey_df, meta_df, year, month):
    unique_vars = meta_df['Variable'].unique()
    decoded_count = 0
    survey_cols = list(survey_df.columns)

    # --- PART 1: DICTATOR AGE LOGIC ---
    target_map_desc = AGE_DESCRIPTION_MAP.get((str(year), str(month)), "C05-Age as of Last Birthday")
    age_meta_rows = pd.DataFrame()
    code_match = re.match(r'^(C\d+)', target_map_desc, re.IGNORECASE)
    
    if code_match:
        required_code = code_match.group(1).upper()
        age_meta_rows = meta_df[
            (meta_df['Variable'].str.upper().str.contains("AGE", na=False)) & 
            (meta_df['Description'].str.upper().str.contains(required_code, na=False))
        ]
    else:
        target_clean = normalize_key(target_map_desc)
        meta_df['temp_norm'] = meta_df['Description'].apply(normalize_key)
        age_meta_rows = meta_df[
            (meta_df['Variable'].str.upper().str.contains("AGE", na=False)) & 
            (meta_df['temp_norm'] == target_clean)
        ]
        del meta_df['temp_norm']

    target_col = find_consistent_age_column(survey_cols)

    if target_col and not age_meta_rows.empty:
        ranges = []
        for _, row in age_meta_rows.iterrows():
            try:
                low = int(float(row['min_value']))
                high = int(float(row['max_value']))
                label = str(row['Label']).strip()
                if "2014-10-01" in label or "Oct-14" in label: label = "10 - 14"
                elif "2023-09-05" in label or "Sep-05" in label or "05-09" in label: label = "05 - 09"
                elif "-01-04" in label or "Jan-04" in label: label = "00 - 04"
                ranges.append((low, high, label))
            except: continue
        
        def map_age_range(val):
            s_val = str(val).strip()
            if "2014-10-01" in s_val: return "10 - 14"
            if "09-05" in s_val or "Sep-05" in s_val: return "05 - 09"
            if "-01-04" in s_val: return "00 - 04"
            try:
                v = int(float(val))
                for l, h, lab in ranges:
                    if l <= v <= h: return lab
                return val
            except: return val
        
        survey_df[target_col] = survey_df[target_col].apply(map_age_range)
        decoded_count += 1
        
        used_vars = age_meta_rows['Variable'].unique()
        unique_vars = [v for v in unique_vars if v not in used_vars]

    # --- PART 2: STANDARD LOOP (Strict & Range Fix) ---
    for var_code in unique_vars:
        if "AGE" in str(var_code).upper(): continue

        subset = meta_df[meta_df['Variable'] == var_code].copy()
        if subset['Description'].isnull().all(): continue
        target_desc = subset['Description'].dropna().iloc[0].strip()

        target_col = find_column_smart_strict(survey_cols, target_desc)
        if not target_col: continue

        lookup = {}
        for _, row in subset.iterrows():
            label = str(row['Label']).strip()
            if label in ['0', '0.0', 'nan', 'NaN', '']: continue
            
            # --- FIX: RANGE EXPANSION ---
            range_added = False
            min_val = str(row.get('min_value', '')).strip()
            max_val = str(row.get('max_value', '')).strip()
            if min_val and max_val and min_val != max_val:
                try:
                    start = int(float(min_val))
                    end = int(float(max_val))
                    # Prevent expanding huge accidental ranges
                    if start < end and (end - start) < 1000:
                        for k in range(start, end + 1):
                            lookup[k] = label
                        range_added = True
                except: pass

            if not range_added:
                for field in ['min_value', 'max_value', 'additional_value']:
                    val = str(row.get(field, '')).strip()
                    if val and val not in ['0', 'nan', 'NaN']:
                        try: lookup[int(float(val))] = label
                        except: lookup[val] = label

        if not lookup: continue
        survey_df[target_col] = survey_df[target_col].apply(
            lambda x: lookup.get(int(float(x)), x) if str(x).replace('.','',1).isdigit() else lookup.get(str(x), x)
        )
        decoded_count += 1

    return survey_df, decoded_count

## Age Deocder Verifier

In [4]:
import os
import pandas as pd
import re

# ============================================================
# CONFIGURATION
# ============================================================
BASE_PATH = "G:\\.shortcut-targets-by-id\\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\\Labor Force Survey"
HEADER_ENCODED_FOLDER = "NEW Header Encoded Surveys"

# ============================================================
# HARDCODED MAP
# ============================================================
AGE_DESCRIPTION_MAP = {
    ("2018", "January"): "C07-Age as of Last Birthday", 
    ("2018", "April"): "C07-Age as of Last Birthday",
    ("2018", "July"): "C07-Age as of Last Birthday", 
    ("2018", "October"): "C07-Age as of Last Birthday",
    ("2019", "January"): "C07-Age as of Last Birthday", 
    ("2019", "April"): "C07-Age as of Last Birthday",
    ("2019", "July"): "C07-Age as of Last Birthday", 
    ("2019", "October"): "C07-Age as of Last Birthday",
    ("2022", "January"): "C07-Age as of Last Birthday", 
    ("2022", "February"): "Age Group",
    ("2022", "March"): "Age Group", 
    ("2022", "April"): "C07-Age as of Last Birthday", 
    ("2022", "May"): "Age Group", 
    ("2022", "June"): "Age Group", 
    ("2022", "July"): "C07-Age as of Last Birthday", 
    ("2022", "August"): "Age Group", 
    ("2022", "September"): "Age Group", 
    ("2022", "October"): "C07-Age as of Last Birthday", 
    ("2022", "November"): "Age Group", 
    ("2022", "December"): "Age Group", 
    ("2023", "January"): "C07-Age as of Last Birthday", 
    ("2023", "February"): "Age Group", 
    ("2023", "March"): "Age Group", 
    ("2023", "April"): "C07-Age as of Last Birthday", 
    ("2023", "May"): "Age Group", 
    ("2023", "June"): "Age Group", 
    ("2023", "July"): "C05-Age as of Last Birthday", 
    ("2023", "August"): "Age Group", 
    ("2023", "September"): "Age Group", 
    ("2023", "October"): "C07-Age as of Last Birthday", 
    ("2023", "November"): "Age Group", 
    ("2023", "December"): "Age Group", 
    ("2024", "January"): "C07-Age as of Last Birthday", 
    ("2024", "February"): "Age Group", 
    ("2024", "March"): "Age Group", 
    ("2024", "April"): "C07-Age as of Last Birthday", 
    ("2024", "May"): "Age Group", 
    ("2024", "June"): "Age Group", 
    ("2024", "July"): "C07-Age as of Last Birthday", 
    ("2024", "August"): "Age Group"
}

def normalize_key(s):
    if pd.isna(s): return ""
    return re.sub(r'[^a-z0-9]', '', str(s).lower())

def find_consistent_age_column(survey_columns):
    for col in survey_columns:
        if col.upper().startswith("C05"): return col
    for col in survey_columns:
        if col.upper().startswith("C07"): return col
    for col in survey_columns:
        if "AGE" in col.upper(): return col
    return None

def verify_age_logic(df_survey, df_meta, year, month):
    print(f"\n========================================================")
    print(f"AUDIT REPORT: {month.upper()} {year}")
    print(f"========================================================")

    # 1. CHECK LOGIC SELECTION
    target_map_desc = AGE_DESCRIPTION_MAP.get((str(year), str(month)), "C05-Age as of Last Birthday")
    print(f"[1] LOGIC SELECTED FROM MAP:  '{target_map_desc}'")

    # 2. CHECK METADATA RULES (Sheet 2)
    age_meta_rows = pd.DataFrame()
    code_match = re.match(r'^(C\d+)', target_map_desc, re.IGNORECASE)
    
    if code_match:
        required_code = code_match.group(1).upper()
        age_meta_rows = df_meta[
            (df_meta['Variable'].str.upper().str.contains("AGE", na=False)) & 
            (df_meta['Description'].str.upper().str.contains(required_code, na=False))
        ]
    else:
        target_clean = normalize_key(target_map_desc)
        df_meta['temp_norm'] = df_meta['Description'].apply(normalize_key)
        age_meta_rows = df_meta[
            (df_meta['Variable'].str.upper().str.contains("AGE", na=False)) & 
            (df_meta['temp_norm'] == target_clean)
        ]

    if age_meta_rows.empty:
        print("    ERROR: No matching rules found in Metadata Sheet 2!")
        return
    
    print(f"[2] METADATA RULES FOUND:     {len(age_meta_rows)} rows")
    print("    --- Rules Sample (Sheet 2 Values) ---")
    for _, row in age_meta_rows.iterrows():
        l = row.get('min_value', 'nan')
        h = row.get('max_value', 'nan')
        lbl = row.get('Label', 'No Label')
        print(f"    Rule: {str(l):>5} to {str(h):<5} -> '{lbl}'")

    # 3. CHECK FILE COLUMN
    target_col = find_consistent_age_column(list(df_survey.columns))
    print(f"[3] TARGET FILE COLUMN:       '{target_col}'")
    
    if not target_col:
        print("    ERROR: No Age column found in file!")
        return

    # 4. PREPARE DECODER
    ranges = []
    for _, row in age_meta_rows.iterrows():
        try:
            low = int(float(row['min_value']))
            high = int(float(row['max_value']))
            label = str(row['Label']).strip()
            # Date Sanitizer
            if "2014-10-01" in label or "Oct-14" in label: label = "10 - 14"
            elif "2023-09-05" in label or "Sep-05" in label or "05-09" in label: label = "05 - 09"
            elif "-01-04" in label or "Jan-04" in label: label = "00 - 04"
            ranges.append((low, high, label))
        except: continue

    def map_age_range(val):
        s_val = str(val).strip()
        # File Data Sanitizer
        if "2014-10-01" in s_val: return "10 - 14"
        if "09-05" in s_val or "Sep-05" in s_val: return "05 - 09"
        if "-01-04" in s_val: return "00 - 04"
        try:
            v = int(float(val))
            for l, h, lab in ranges:
                if l <= v <= h: return lab
            return val
        except: return val

    # 5. TEST CONVERSION
    print(f"[4] CONVERSION TEST (Sample of 10 values)")
    print(f"    {'RAW (In File)':<20} | {'DECODED (Output)':<30} | {'STATUS'}")
    print("    " + "-"*70)

    # Get sample values (prioritizing numbers to test grouping)
    raw_vals = df_survey[target_col].dropna().unique()
    sample_vals = [x for x in raw_vals if str(x).replace('.','').isdigit()]
    if len(sample_vals) < 5: sample_vals = raw_vals[:10]
    else: sample_vals = sample_vals[:10]

    converted_count = 0
    total_checked = 0

    for val in sample_vals:
        decoded = map_age_range(val)
        status = "FAIL"
        
        # Check if it changed from number to text/group
        if str(decoded) != str(val):
            status = "CONVERTED"
            converted_count += 1
        elif "-" in str(decoded) or "Over" in str(decoded):
            status = "ALREADY GROUP"
            converted_count += 1
        else:
            status = "RAW NUMBER"
        
        print(f"    {str(val):<20} | {str(decoded):<30} | {status}")
        total_checked += 1

    if converted_count > 0:
        print(f"\n[5] VERDICT: SUCCESS ({converted_count}/{total_checked} samples converted to groups)")
    else:
        print(f"\n[5] VERDICT: FAILURE (Values remained raw numbers. Check Metadata ranges vs File values)")

# ============================================================
# RUNNER
# ============================================================
def run_verifier():
    input_root = os.path.join(BASE_PATH, HEADER_ENCODED_FOLDER)
    if not os.path.exists(input_root):
        print("Folder not found.")
        return

    year_folders = [f for f in os.listdir(input_root) if f.isdigit()]
    
    for year in sorted(year_folders):
        year_in = os.path.join(input_root, year)
        files = [f for f in os.listdir(year_in) if f.lower().endswith(".csv")]
        months_done = []

        for filename in files:
            match = re.search(r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)", filename, re.IGNORECASE)
            if not match: continue
            month = match.group(1).capitalize()
            
            # Avoid duplicate checks per month
            if month in months_done: continue
            
            try:
                # Load Metadata
                meta_path = os.path.join(BASE_PATH, "NEW Metadata Sheet 2 CSV's", str(year), f"Sheet2_{month}_{year}.csv")
                if not os.path.exists(meta_path):
                    print(f"SKIP: No metadata for {month} {year}")
                    continue
                df_meta = pd.read_csv(meta_path, dtype=str)

                # Load Survey (Fast Load)
                df_survey = pd.read_csv(os.path.join(year_in, filename), low_memory=False, nrows=1000)
                
                verify_age_logic(df_survey, df_meta, year, month)
                months_done.append(month)

            except Exception as e:
                print(f"Error checking {month} {year}: {e}")

run_verifier()


AUDIT REPORT: JANUARY 2018
[1] LOGIC SELECTED FROM MAP:  'C07-Age as of Last Birthday'
[2] METADATA RULES FOUND:     14 rows
    --- Rules Sample (Sheet 2 Values) ---
    Rule:    15 to 24.0  -> '15 - 24'
    Rule:    25 to 34.0  -> '25 - 34'
    Rule:    35 to 44.0  -> '35 - 44'
    Rule:    45 to 54.0  -> '45 - 54'
    Rule:    55 to 64.0  -> '55 - 64'
    Rule:    65 to 98.0  -> '65  and Over'
    Rule:    99 to 0     -> 'Not Reported'
    Rule:     0 to 14.0  -> '00 - 14'
    Rule:    15 to 24.0  -> '15 - 24'
    Rule:    25 to 34.0  -> '25 - 34'
    Rule:    35 to 44.0  -> '35 - 44'
    Rule:    45 to 54.0  -> '45 - 54'
    Rule:    55 to 64.0  -> '55 - 64'
    Rule:    65 to 99.0  -> '65 and Over'
[3] TARGET FILE COLUMN:       'C05-Age as of Last Birthday'
[4] CONVERSION TEST (Sample of 10 values)
    RAW (In File)        | DECODED (Output)               | STATUS
    ----------------------------------------------------------------------
    70                   | 65  and Over   

In [5]:
# ============================================================
# PIPELINE ENTRY POINT
# ============================================================

def show_decoding_options():
    print("==============================================")
    print("SURVEY VALUE DECODING OPTIONS")
    print("==============================================")
    print("1. run_sample()      → Small GitHub-safe sample")
    print("2. run_full_batch() → Full local decoding")
    print("==============================================\n")


def run_sample():
    run_sample_decoding(
        base_path=BASE_PATH,
        year="2018",
        month="January"
    )


def run_full_batch():
    run_batch_decoding(base_path=BASE_PATH)


show_decoding_options()


SURVEY VALUE DECODING OPTIONS
1. run_sample()      → Small GitHub-safe sample
2. run_full_batch() → Full local decoding



In [6]:
run_sample()

SAMPLE DECODING: JANUARY 2018
Source: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW Header Encoded Surveys\2018
Dest:   G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\data\interim\NEW Fully Decoded Survey Sample\2018

[OK] Decoded 39 columns
[SAVED] G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\data\interim\NEW Fully Decoded Survey Sample\2018\JANUARY_2018.CSV


In [7]:
run_full_batch()

STARTING FULL BATCH DECODING
Source: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW Header Encoded Surveys
Dest:   G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW Fully Decoded Surveys

Processing: JANUARY 2018...
[OK] Decoded 39 columns
----------------------------------------
Processing: APRIL 2018...
[OK] Decoded 38 columns
----------------------------------------
Processing: JULY 2018...
[OK] Decoded 38 columns
----------------------------------------
Processing: OCTOBER 2018...
[OK] Decoded 39 columns
----------------------------------------
Processing: APRIL 2019...
[OK] Decoded 39 columns
----------------------------------------
Processing: JULY 2019...
[OK] Decoded 42 columns
----------------------------------------
Processing: JANUARY 2019...
[OK] Decoded 39 columns
----------------------------------------
Processing: OCTOBER 2019...
[OK] Decoded 42 columns
----------------------------------------
Process