
This notebook applies metadata Sheet 2 definitions to decode survey values.  
This notebook maps coded responses (e.g., 1, 2, 3) into human‑readable labels (e.g., Employed, Unemployed).

Dependencies:
- Run `00_Settings.ipynb` and `01_Inventory.ipynb` first.
- Requires outputs from `03_Metadata_Decoder.ipynb` (Header Encoded Surveys).
- Note: Survey CSVs already carry meaning from Sheet 2 reshaping. This notebook applies the final decoding logic.

Output:

- Fully decoded survey CSVs saved into **NEW Fully Decoded Surveys**.
- Reports per survey showing number of columns successfully decoded.

Notes:
- **Sheet 1 metadata** → header translation (column names).  
- **Sheet 2 metadata** → value translation (coded responses).  

**INTENT:** This notebook performs the **final decoding stage** of the Labor Force Survey pipeline.  
It applies **Sheet 2 metadata** to translate coded survey values into human‑readable labels.

- Next steps: duplicate variable detection, integrity checks, coverage scanning.


In [1]:
import json
from pathlib import Path
import os
import pandas as pd

# ------------------------------------------------------------
# Load settings from config.json (produced by 00_Settings.ipynb)
# ------------------------------------------------------------
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# ------------------------------------------------------------
# Load inventory (produced by 01_Inventory.ipynb)
# ------------------------------------------------------------
with open(Path(INTERIM_DIR) / "inventory.json") as f:
    inventory = json.load(f)

# Alias for compatibility
base_path = str(BASE_PATH)


### Interim Sample 

#### Intent: Interim Sample Decoder

This interim function generates a **single sample output** (e.g., `NEW Fully Decoded Survey Sample` for **January 2018**) using the same decoding logic as the batch runner.  

Unlike the full batch process, which redirects all decoded surveys to Google Drive, this sample run saves directly into the local **interim repository path**.  

The purpose is to provide a quick preview of how a fully decoded CSV looks without requiring you to download the large, heavy files from Google Drive.  

Use this when:
- You want to validate decoding logic on a small subset before running the full batch.  
- You need a lightweight example for documentation, testing, or demonstration.  
- You want to inspect decoded values locally without waiting for the complete dataset.


In [2]:
import json
from pathlib import Path
import os
import pandas as pd
import re

# ------------------------------------------------------------
# Shared folder names
# ------------------------------------------------------------
HEADER_ENCODED_FOLDER = "NEW Header Encoded Surveys"
FULLY_DECODED_FOLDER = "NEW Fully Decoded Surveys"
FULLY_DECODED_SAMPLE_FOLDER = "NEW Fully Decoded Survey Sample"

# ------------------------------------------------------------
# Month parsing
# ------------------------------------------------------------
MONTHS = [
    "JANUARY","FEBRUARY","MARCH","APRIL","MAY","JUNE",
    "JULY","AUGUST","SEPTEMBER","OCTOBER","NOVEMBER","DECEMBER"
]

MONTH_PATTERN = re.compile(
    r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)",
    re.IGNORECASE
)

# ============================================================
# Metadata loader
# ============================================================
def load_clean_sheet2(base_path, year, month):
    path = os.path.join(
        base_path,
        "NEW Metadata Sheet 2 CSV's",
        str(year),
        f"Sheet2_{month}_{year}.csv"
    )
    if not os.path.exists(path):
        raise FileNotFoundError(f"Metadata not found at: {path}")
    return pd.read_csv(path, dtype=str)

# ============================================================
# Helper: Normalize Text
# ============================================================
def normalize_key(s):
    if pd.isna(s): return ""
    return re.sub(r'[^a-z0-9]', '', str(s).lower())

# ============================================================
# COLUMN MATCHING LOGIC
# ============================================================
def find_target_column(survey_columns, meta_desc):
    if pd.isna(meta_desc): return None
    meta_desc = str(meta_desc).strip()
    
    # 1. EXTRACT ID (e.g. C29)
    meta_code_match = re.match(r'^(C\d+)', meta_desc, re.IGNORECASE)
    required_code = meta_code_match.group(1).upper() if meta_code_match else None
    
    # --- STRATEGY A: CODE MATCH (Priority) ---
    if required_code:
        # Regex: Start with CODE, followed by NOT a digit.
        # This prevents C10 from matching C101.
        pattern = re.compile(f"^{required_code}(?![0-9])", re.IGNORECASE)
        candidates = [col for col in survey_columns if pattern.match(col.strip())]
        
        if not candidates:
            return None
        
        # If multiple candidates exist (e.g. C29 Month vs C29 Year), use text match to decide.
        if len(candidates) > 1:
            target_clean = normalize_key(meta_desc)
            for cand in candidates:
                if normalize_key(cand) == target_clean:
                    return cand
            # Soft match fallback
            meta_text_only = normalize_key(re.sub(r'^(C\d+)', '', meta_desc))
            for cand in candidates:
                if meta_text_only and meta_text_only in normalize_key(cand):
                    return cand
            
        return candidates[0]

    # --- STRATEGY B: TEXT MATCH (Fallback) ---
    else:
        # Exact Match
        if meta_desc in survey_columns: 
            return meta_desc
        # Clean Match
        elif re.sub(r'^C\d+[\s\-_]+', '', meta_desc, flags=re.IGNORECASE).strip() in survey_columns:
            return re.sub(r'^C\d+[\s\-_]+', '', meta_desc, flags=re.IGNORECASE).strip()
        # Suffix Match
        else:
            for col in survey_columns:
                if col.endswith(meta_desc):
                    prefix = col[:-len(meta_desc)].strip()
                    if re.search(r'^C\d+[\s\-_]*$', prefix, re.IGNORECASE) or prefix == "":
                        return col
    
    return None

# ============================================================
# Safe decoder
# ============================================================
def decode_survey_safe(survey_df, meta_df, year=None, month=None):
    # This logic is overridden by Code 2 but provided here for completeness.
    unique_vars = meta_df['Variable'].unique()
    decoded_count = 0
    survey_cols = list(survey_df.columns)

    for var_code in unique_vars:
        subset = meta_df[meta_df['Variable'] == var_code].copy()
        if subset['Description'].isnull().all(): continue

        raw_desc = subset['Description'].dropna().iloc[0].strip()
        target_col = find_target_column(survey_cols, raw_desc)
        
        if not target_col: continue

        lookup = {}
        for _, row in subset.iterrows():
            label = str(row['Label']).strip()
            if label in ['0', '0.0', 'nan', 'NaN', '']: continue
            
            # --- RANGE EXPANSION LOGIC ---
            range_added = False
            min_val = str(row.get('min_value', '')).strip()
            max_val = str(row.get('max_value', '')).strip()
            if min_val and max_val and min_val != max_val:
                try:
                    start = int(float(min_val))
                    end = int(float(max_val))
                    # Prevent expanding huge accidental ranges
                    if start < end and (end - start) < 1000:
                        for k in range(start, end + 1):
                            lookup[k] = label
                        range_added = True
                except: pass

            if not range_added:
                for field in ['min_value', 'max_value', 'additional_value']:
                    val = str(row.get(field, '')).strip()
                    if val and val not in ['0', 'nan', 'NaN']:
                        try: lookup[int(float(val))] = label
                        except: lookup[val] = label

        if not lookup: continue

        def safe_map(val):
            try: return lookup.get(int(float(val)), val)
            except: return lookup.get(str(val), val)

        survey_df[target_col] = survey_df[target_col].apply(safe_map)
        decoded_count += 1

    return survey_df, decoded_count

# ============================================================
# INTERIM SAMPLE DECODER
# ============================================================
def run_sample_decoding(base_path, year="2018", month="January", interim_root=None):
    if interim_root is None:
        interim_root = os.path.join(base_path, "data", "interim")

    month = month.strip().capitalize()
    if month.upper() not in MONTHS:
        raise ValueError(f"Invalid month: {month}")

    input_root = os.path.join(base_path, HEADER_ENCODED_FOLDER, year)
    output_root = os.path.join(interim_root, FULLY_DECODED_SAMPLE_FOLDER, year)
    os.makedirs(output_root, exist_ok=True)

    if not os.path.exists(input_root):
        print(f"[SKIP] Input folder not found: {input_root}")
        return

    files = [f for f in os.listdir(input_root) if f.lower().endswith(".csv")]
    target_file = next((f for f in files if month.upper() in f.upper()), None)

    if not target_file:
        print(f"[SKIP] No survey file found for {month} {year}")
        return

    print("================================================")
    print(f"SAMPLE DECODING: {month.upper()} {year}")
    print(f"Source: {input_root}")
    print(f"Dest:   {output_root}")
    print("================================================\n")

    df_survey = pd.read_csv(os.path.join(input_root, target_file), low_memory=False)
    df_meta = load_clean_sheet2(base_path, year, month)

    df_final, count = decode_survey_safe(df_survey, df_meta, year, month)

    save_path = os.path.join(output_root, target_file)
    df_final.to_csv(save_path, index=False)

    print(f"[OK] Decoded {count} columns")
    print(f"[SAVED] {save_path}")

# ============================================================
# FULL BATCH DECODER
# ============================================================
def run_batch_decoding(base_path):
    input_root = os.path.join(base_path, HEADER_ENCODED_FOLDER)
    output_root = os.path.join(base_path, FULLY_DECODED_FOLDER)
    os.makedirs(output_root, exist_ok=True)

    print("================================================")
    print("STARTING FULL BATCH DECODING")
    print(f"Source: {input_root}")
    print(f"Dest:   {output_root}")
    print("================================================\n")

    if not os.path.exists(input_root):
        print(f"[ERROR] Input folder not found: {input_root}")
        return

    success, errors = 0, 0

    year_folders = [
        f for f in os.listdir(input_root)
        if f.isdigit() and os.path.isdir(os.path.join(input_root, f))
    ]

    for year in sorted(year_folders):
        year_in = os.path.join(input_root, year)
        year_out = os.path.join(output_root, year)
        os.makedirs(year_out, exist_ok=True)

        files = [f for f in os.listdir(year_in) if f.lower().endswith(".csv")]
        for filename in files:
            match = MONTH_PATTERN.search(filename)
            if not match: continue

            month = match.group(1).capitalize()
            print(f"Processing: {month.upper()} {year}...")

            try:
                df_survey = pd.read_csv(os.path.join(year_in, filename), low_memory=False)
                df_meta = load_clean_sheet2(base_path, year, month)
                
                df_final, count = decode_survey_safe(df_survey, df_meta, year, month)

                save_path = os.path.join(year_out, filename)
                df_final.to_csv(save_path, index=False)

                print(f"[OK] Decoded {count} columns")
                success += 1

            except Exception as e:
                print(f"[ERROR] {e}")
                errors += 1

            print("-" * 40)

    print(f"\nCOMPLETED | Success: {success} | Errors: {errors}")

## Age Scheme Decoder Verifier

In [3]:
import pandas as pd
import os
import numpy as np

# --- CONFIGURATION ---
# Palitan mo ito base sa actual path ng folder mo
base_path = r"G:\My Drive\Labor Force Survey"

# Use your provided mapping for descriptions
AGE_VAR_MAP = {
    "JANUARY 2018": "C07-Age as of Last Birthday", "APRIL 2018": "C07-Age as of Last Birthday",
    "JULY 2018": "C07-Age as of Last Birthday", "OCTOBER 2018": "C07-Age as of Last Birthday",
    "JANUARY 2019": "C07-Age as of Last Birthday", "APRIL 2019": "C07-Age as of Last Birthday",
    "JULY 2019": "C07-Age as of Last Birthday", "OCTOBER 2019": "C07-Age as of Last Birthday",
    "JANUARY 2022": "C07-Age as of Last Birthday", "FEBRUARY 2022": "Age Group",
    "MARCH 2022": "Age Group", "APRIL 2022": "C07-Age as of Last Birthday",
    "MAY 2022": "Age Group", "JUNE 2022": "Age Group", "JULY 2022": "C07-Age as of Last Birthday",
    "AUGUST 2022": "Age Group", "SEPTEMBER 2022": "Age Group", "OCTOBER 2022": "C07-Age as of Last Birthday",
    "NOVEMBER 2022": "Age Group", "DECEMBER 2022": "Age Group", "JANUARY 2023": "C07-Age as of Last Birthday",
    "FEBRUARY 2023": "Age Group", "MARCH 2023": "Age Group", "APRIL 2023": "C07-Age as of Last Birthday",
    "MAY 2023": "Age Group", "JUNE 2023": "Age Group", "JULY 2023": "C05-Age as of Last Birthday",
    "AUGUST 2023": "Age Group", "SEPTEMBER 2023": "Age Group", "OCTOBER 2023": "C07-Age as of Last Birthday",
    "NOVEMBER 2023": "Age Group", "DECEMBER 2023": "Age Group", "JANUARY 2024": "C07-Age as of Last Birthday",
    "FEBRUARY 2024": "Age Group", "MARCH 2024": "Age Group", "APRIL 2024": "C07-Age as of Last Birthday",
    "MAY 2024": "Age Group", "JUNE 2024": "Age Group", "JULY 2024": "C07-Age as of Last Birthday",
    "AUGUST 2024": "Age Group"
}

YEARS = ["2018", "2019", "2022", "2023", "2024"]
MONTHS_FULL = ["JANUARY", "FEBRUARY", "MARCH", "APRIL", "MAY", "JUNE", "JULY", "AUGUST", "SEPTEMBER", "OCTOBER", "NOVEMBER", "DECEMBER"]

schemes = {}
total_months_processed = 0

print("--- STARTING AGE METADATA VERIFICATION ---\n")

for year in YEARS:
    for month in MONTHS_FULL:
        key = f"{month} {year}"
        if key not in AGE_VAR_MAP: continue
        
        target_desc = AGE_VAR_MAP[key]
        # Inaayos ang path para mahanap ang metadata files
        path = os.path.join(base_path, "NEW Metadata Sheet 2 CSV's", year, f"Sheet2_{month.capitalize()}_{year}.csv")
        
        if not os.path.exists(path):
            continue
            
        df_meta = pd.read_csv(path, dtype=str)
        
        # In-o-normalize natin ang string matching para hindi sensitive sa extra spaces
        age_meta = df_meta[df_meta['Description'].str.strip() == target_desc].copy()
        
        if age_meta.empty:
            continue
            
        def clean_num(x):
            try: return str(int(float(x)))
            except: return "0"

        rules = []
        for _, row in age_meta.iterrows():
            label = str(row['Label']).strip()
            # Nililinis ang numbers para standard ang comparison
            mi = clean_num(row.get('min_value', '0'))
            ma = clean_num(row.get('max_value', '0'))
            rules.append((label, mi, ma))
        
        # Ginagawang unique signature ang rules para malaman kung may kapareho sa ibang buwan
        scheme_id = tuple(sorted(rules))
        
        if scheme_id not in schemes:
            schemes[scheme_id] = []
        
        schemes[scheme_id].append(key)
        total_months_processed += 1

# --- PRINTING THE ANALYSIS ---
print(f"Total Months Audited: {total_months_processed}")
print(f"Detected {len(schemes)} unique coding scheme(s).")

# Consistency check
if len(schemes) == 1:
    print(" SUCCESS: The Age coding scheme is IDENTICAL across all audited months.")
else:
    print(f" NOTICE: Found {len(schemes)} different coding schemes.")

print("\n" + "="*60 + "\n")

for idx, (sig, months_list) in enumerate(schemes.items(), 1):
    num_months = len(months_list)
    print(f"### SCHEME #{idx} ({num_months} months)")
    print(f"Applies to: {', '.join(months_list)}")
    print(f"{'Label Category':<25} | {'Min':<5} | {'Max':<5}")
    print("-" * 45)
    for label, mi, ma in sig:
        print(f"{label:<25} | {mi:<5} | {ma:<5}")
    print("\n")

--- STARTING AGE METADATA VERIFICATION ---

Total Months Audited: 40
Detected 1 unique coding scheme(s).
 SUCCESS: The Age coding scheme is IDENTICAL across all audited months.


### SCHEME #1 (40 months)
Applies to: JANUARY 2018, APRIL 2018, JULY 2018, OCTOBER 2018, JANUARY 2019, APRIL 2019, JULY 2019, OCTOBER 2019, JANUARY 2022, FEBRUARY 2022, MARCH 2022, APRIL 2022, MAY 2022, JUNE 2022, JULY 2022, AUGUST 2022, SEPTEMBER 2022, OCTOBER 2022, NOVEMBER 2022, DECEMBER 2022, JANUARY 2023, FEBRUARY 2023, MARCH 2023, APRIL 2023, MAY 2023, JUNE 2023, JULY 2023, AUGUST 2023, SEPTEMBER 2023, OCTOBER 2023, NOVEMBER 2023, DECEMBER 2023, JANUARY 2024, FEBRUARY 2024, MARCH 2024, APRIL 2024, MAY 2024, JUNE 2024, JULY 2024, AUGUST 2024
Label Category            | Min   | Max  
---------------------------------------------
00 - 14                   | 0     | 14   
15 - 24                   | 15    | 24   
25 - 34                   | 25    | 34   
35 - 44                   | 35    | 44   
45 - 54     

## Age Decoder

In [4]:
import pandas as pd
import re

# ============================================================
# CONFIGURATION: AGE VARIABLE MAPPING
# ============================================================
AGE_VAR_MAP = {
    ("2018", "January"): "C07-Age as of Last Birthday", ("2018", "April"): "C07-Age as of Last Birthday",
    ("2018", "July"): "C07-Age as of Last Birthday", ("2018", "October"): "C07-Age as of Last Birthday",
    ("2019", "January"): "C07-Age as of Last Birthday", ("2019", "April"): "C07-Age as of Last Birthday",
    ("2019", "July"): "C07-Age as of Last Birthday", ("2019", "October"): "C07-Age as of Last Birthday",
    ("2022", "January"): "C07-Age as of Last Birthday", ("2022", "February"): "Age Group",
    ("2022", "March"): "Age Group", ("2022", "April"): "C07-Age as of Last Birthday",
    ("2022", "May"): "Age Group", ("2022", "June"): "Age Group", ("2022", "July"): "C07-Age as of Last Birthday",
    ("2022", "August"): "Age Group", ("2022", "September"): "Age Group", ("2022", "October"): "C07-Age as of Last Birthday",
    ("2022", "November"): "Age Group", ("2022", "December"): "Age Group", ("2023", "January"): "C07-Age as of Last Birthday",
    ("2023", "February"): "Age Group", ("2023", "March"): "Age Group", ("2023", "April"): "C07-Age as of Last Birthday",
    ("2023", "May"): "Age Group", ("2023", "June"): "Age Group", ("2023", "July"): "C05-Age as of Last Birthday",
    ("2023", "August"): "Age Group", ("2023", "September"): "Age Group", ("2023", "October"): "C07-Age as of Last Birthday",
    ("2023", "November"): "Age Group", ("2023", "December"): "Age Group", ("2024", "January"): "C07-Age as of Last Birthday",
    ("2024", "February"): "Age Group", ("2024", "March"): "Age Group", ("2024", "April"): "C07-Age as of Last Birthday",
    ("2024", "May"): "Age Group", ("2024", "June"): "Age Group", ("2024", "July"): "C07-Age as of Last Birthday",
    ("2024", "August"): "Age Group"
}

def normalize_key(s):
    if pd.isna(s): return ""
    return re.sub(r'[^a-z0-9]', '', str(s).lower())

def find_consistent_age_column(survey_columns):
    # Priority order for finding the age column in survey files
    for col in survey_columns:
        if col.upper().startswith("C05"): return col
    for col in survey_columns:
        if col.upper().startswith("C07"): return col
    for col in survey_columns:
        if "AGE" in col.upper(): return col
    return None

def find_column_smart_strict(survey_columns, meta_desc):
    desc_str = str(meta_desc).strip()
    meta_code_match = re.match(r'^(C\d+)', desc_str, re.IGNORECASE)
    required_code = meta_code_match.group(1).upper() if meta_code_match else None
    
    if required_code:
        # Match code but ensure it's not a longer ID (e.g., C10 matches C10 but not C101)
        pattern = re.compile(f"^{required_code}(?![0-9])", re.IGNORECASE)
        candidates = [col for col in survey_columns if pattern.match(col.strip())]
        if not candidates: return None
        if len(candidates) == 1: return candidates[0]
        
        # Tie-breaker using normalized description
        target_clean = normalize_key(desc_str)
        for cand in candidates:
            if normalize_key(cand) == target_clean: return cand
        return candidates[0]
    return None

def decode_survey_safe(survey_df, meta_df, year, month):
    unique_vars = meta_df['Variable'].unique()
    decoded_count = 0
    survey_cols = list(survey_df.columns)

    # --- PART 1: UNIFIED AGE LOGIC ---
    target_map_desc = AGE_VAR_MAP.get((str(year), str(month)))
    age_meta_rows = pd.DataFrame()
    
    if target_map_desc:
        age_meta_rows = meta_df[meta_df['Description'].str.strip() == target_map_desc].copy()

    target_col = find_consistent_age_column(survey_cols)

    if target_col and not age_meta_rows.empty:
        # Build range list but filter out non-standard/broad categories
        standard_ranges = []
        for _, row in age_meta_rows.iterrows():
            try:
                label = str(row['Label']).strip()
                # FILTER: Skip broad rules like '31 and over' or 'not reported'
                if "and over" in label.lower() and "65" not in label: continue
                if "reported" in label.lower(): continue
                
                low = int(float(row['min_value']))
                high = int(float(row['max_value']))
                standard_ranges.append((low, high, label))
            except: continue
        
        def map_age_range(val):
            try:
                v = int(float(val))
                for l, h, lab in standard_ranges:
                    if l <= v <= h: return lab
                # Final fallback for seniors if the bracket is missing
                if v >= 65: return "65 and Over"
                return val
            except: return val
        
        survey_df[target_col] = survey_df[target_col].apply(map_age_range)
        decoded_count += 1
        
        used_vars = age_meta_rows['Variable'].unique()
        unique_vars = [v for v in unique_vars if v not in used_vars]

    # --- PART 2: STANDARD VARIABLE LOOP ---
    for var_code in unique_vars:
        if "AGE" in str(var_code).upper(): continue
        subset = meta_df[meta_df['Variable'] == var_code].copy()
        if subset['Description'].isnull().all(): continue
        target_desc = subset['Description'].dropna().iloc[0].strip()

        target_col = find_column_smart_strict(survey_cols, target_desc)
        if not target_col: continue

        lookup = {}
        for _, row in subset.iterrows():
            label = str(row['Label']).strip()
            if label in ['0', '0.0', 'nan', 'NaN', '']: continue
            
            # --- RANGE EXPANSION LOGIC ---
            range_added = False
            min_val = str(row.get('min_value', '')).strip()
            max_val = str(row.get('max_value', '')).strip()
            if min_val and max_val and min_val != max_val:
                try:
                    start = int(float(min_val))
                    end = int(float(max_val))
                    # Prevent expanding huge accidental ranges (>1000 numbers)
                    if start < end and (end - start) < 1000:
                        for k in range(start, end + 1): lookup[k] = label
                        range_added = True
                except: pass

            if not range_added:
                # Standard individual value lookup
                for field in ['min_value', 'max_value', 'additional_value']:
                    val = str(row.get(field, '')).strip()
                    if val and val not in ['0', 'nan', 'NaN']:
                        try: lookup[int(float(val))] = label
                        except: lookup[val] = label

        if not lookup: continue
        survey_df[target_col] = survey_df[target_col].apply(
            lambda x: lookup.get(int(float(x)), x) if str(x).replace('.','',1).isdigit() else lookup.get(str(x), x)
        )
        decoded_count += 1

    return survey_df, decoded_count

## Age Deocder Verifier

In [5]:
import os
import pandas as pd
import re

# ============================================================
# CONFIGURATION
# ============================================================
BASE_PATH = r"G:\My Drive\Labor Force Survey"
HEADER_ENCODED_FOLDER = "NEW Header Encoded Surveys"

AGE_VAR_MAP = {
    ("2018", "January"): "C07-Age as of Last Birthday", ("2018", "April"): "C07-Age as of Last Birthday",
    ("2018", "July"): "C07-Age as of Last Birthday", ("2018", "October"): "C07-Age as of Last Birthday",
    ("2019", "January"): "C07-Age as of Last Birthday", ("2019", "April"): "C07-Age as of Last Birthday",
    ("2019", "July"): "C07-Age as of Last Birthday", ("2019", "October"): "C07-Age as of Last Birthday",
    ("2022", "January"): "C07-Age as of Last Birthday", ("2022", "February"): "Age Group",
    ("2022", "March"): "Age Group", ("2022", "April"): "C07-Age as of Last Birthday",
    ("2022", "May"): "Age Group", ("2022", "June"): "Age Group", ("2022", "July"): "C07-Age as of Last Birthday",
    ("2022", "August"): "Age Group", ("2022", "September"): "Age Group", ("2022", "October"): "C07-Age as of Last Birthday",
    ("2022", "November"): "Age Group", ("2022", "December"): "Age Group", ("2023", "January"): "C07-Age as of Last Birthday",
    ("2023", "February"): "Age Group", ("2023", "March"): "Age Group", ("2023", "April"): "C07-Age as of Last Birthday",
    ("2023", "May"): "Age Group", ("2023", "June"): "Age Group", ("2023", "July"): "C05-Age as of Last Birthday",
    ("2023", "August"): "Age Group", ("2023", "September"): "Age Group", ("2023", "October"): "C07-Age as of Last Birthday",
    ("2023", "November"): "Age Group", ("2023", "December"): "Age Group", ("2024", "January"): "C07-Age as of Last Birthday",
    ("2024", "February"): "Age Group", ("2024", "March"): "Age Group", ("2024", "April"): "C07-Age as of Last Birthday",
    ("2024", "May"): "Age Group", ("2024", "June"): "Age Group", ("2024", "July"): "C07-Age as of Last Birthday",
    ("2024", "August"): "Age Group"
}

# Container for Final Consistency Summary
schemes = {}

def find_consistent_age_column(survey_columns):
    for col in survey_columns:
        if col.upper().startswith("C05"): return col
    for col in survey_columns:
        if col.upper().startswith("C07"): return col
    for col in survey_columns:
        if "AGE" in col.upper(): return col
    return None

def verify_age_logic(df_survey, df_meta, year, month):
    target_map_desc = AGE_VAR_MAP.get((str(year), str(month)))
    if not target_map_desc: return
    
    age_meta_rows = df_meta[df_meta['Description'].str.strip() == target_map_desc].copy()
    
    # 1. Filter Rules (Applying the Broad Rule Filter)
    active_rules = []
    for _, row in age_meta_rows.iterrows():
        try:
            label = str(row['Label']).strip()
            if "and over" in label.lower() and "65" not in label: continue
            if "reported" in label.lower(): continue
            low = int(float(row['min_value']))
            high = int(float(row['max_value']))
            active_rules.append((label, low, high))
        except: continue

    # 2. Scheme Identification for Final Summary
    scheme_id = tuple(sorted(active_rules))
    key = f"{month.upper()} {year}"
    if scheme_id not in schemes:
        schemes[scheme_id] = []
    schemes[scheme_id].append(key)

    allowed_labels = {r[0] for r in active_rules}
    allowed_labels.add("65 and Over")

    def test_map(val):
        try:
            v = int(float(val))
            for lab, l, h in active_rules:
                if l <= v <= h: return lab
            if v >= 65: return "65 and Over"
            return f"UNEXPECTED: {val}"
        except: return f"INVALID: {val}"

    # 3. Individual Audit Printing
    print(f"\nAUDIT: {key}")
    print("-" * 105)
    print(f"{'Age Range':<15} | {'Expected Label':<20} | {'Row':<8} | {'Conversion Check'}")
    print("-" * 105)

    target_col = find_consistent_age_column(list(df_survey.columns))
    matched_ranges = set()
    all_converted_values = []

    for idx, row_data in df_survey.iterrows():
        raw_val = row_data[target_col]
        if pd.isna(raw_val): continue
        output = test_map(raw_val)
        all_converted_values.append(output)
        
        for label, low, high in active_rules:
            if (label, low, high) not in matched_ranges:
                try:
                    v = int(float(raw_val))
                    if low <= v <= high:
                        print(f"{str(low)+'-'+str(high):<15} | {label:<20} | {str(idx):<8} | {str(raw_val):>3} -> {output}")
                        matched_ranges.add((label, low, high))
                except: pass

    # --- Individual Month Status ---
    unique_outputs = {v for v in all_converted_values if "UNEXPECTED" not in str(v)}
    unexpected = [v for v in set(all_converted_values) if v not in allowed_labels]
    
    if not unexpected:
        print(f"   [SUCCESS] 100% Alignment for {key}. Standard rules only.")
    else:
        print(f"   [WARNING] {key} has non-standard values: {unexpected}")

def run_full_verifier():
    input_root = os.path.join(BASE_PATH, HEADER_ENCODED_FOLDER)
    if not os.path.exists(input_root):
        print(f"CRITICAL ERROR: Folder not found: {input_root}")
        return

    for year in sorted([d for d in os.listdir(input_root) if d.isdigit()]):
        year_path = os.path.join(input_root, year)
        for filename in [f for f in os.listdir(year_path) if f.lower().endswith(".csv")]:
            match = re.search(r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)", filename, re.IGNORECASE)
            if not match: continue
            month_cap = match.group(1).capitalize()
            month_upper = match.group(1).upper()
            
            # Try both ALL CAPS and Capitalized metadata filenames
            meta_paths = [
                os.path.join(BASE_PATH, "NEW Metadata Sheet 2 CSV's", year, f"Sheet2_{month_upper}_{year}.csv"),
                os.path.join(BASE_PATH, "NEW Metadata Sheet 2 CSV's", year, f"Sheet2_{month_cap}_{year}.csv")
            ]
            
            meta_path = next((p for p in meta_paths if os.path.exists(p)), None)
            
            if meta_path:
                try:
                    df_meta = pd.read_csv(meta_path, dtype=str)
                    df_survey = pd.read_csv(os.path.join(year_path, filename), low_memory=False, nrows=2000)
                    verify_age_logic(df_survey, df_meta, year, month_cap)
                except Exception as e:
                    print(f"Error {month_cap} {year}: {e}")

    # --- FINAL GLOBAL CONSISTENCY SUMMARY (Classmate-style) ---
    print("\n" + "="*80)
    print("FINAL GLOBAL CONSISTENCY CHECK")
    print("="*80)
    print(f"Unique Decoding Patterns: {len(schemes)}")
    
    if len(schemes) == 1:
        print("SUCCESS: 100% Consistent logic across all 40 months. Unified scheme verified.")
    else:
        print(f"WARNING: {len(schemes)} unique patterns detected. Review overlapping rules.")

    for idx, (sig, months_list) in enumerate(schemes.items(), 1):
        print(f"\nPATTERN #{idx} (Shared by {len(months_list)} months):")
        print(f"Applies to: {', '.join(months_list)}")
        print(f"{'Label':<25} | {'Min':<5} | {'Max':<5}")
        print("-" * 45)
        for label, mi, ma in sig:
            print(f"{label:<25} | {mi:<5} | {ma:<5}")

if __name__ == "__main__":
    run_full_verifier()


AUDIT: JANUARY 2018
---------------------------------------------------------------------------------------------------------
Age Range       | Expected Label       | Row      | Conversion Check
---------------------------------------------------------------------------------------------------------
65-99           | 65 and Over          | 0        |  70 -> 65 and Over
45-54           | 45 - 54              | 1        |  45 -> 45 - 54
0-14            | 00 - 14              | 3        |   5 -> 00 - 14
25-34           | 25 - 34              | 4        |  30 -> 25 - 34
35-44           | 35 - 44              | 9        |  44 -> 35 - 44
15-24           | 15 - 24              | 11       |  19 -> 15 - 24
55-64           | 55 - 64              | 25       |  60 -> 55 - 64
   [SUCCESS] 100% Alignment for JANUARY 2018. Standard rules only.

AUDIT: APRIL 2018
---------------------------------------------------------------------------------------------------------
Age Range       | Expected Label 

In [6]:
# ============================================================
# PIPELINE ENTRY POINT
# ============================================================

def show_decoding_options():
    print("==============================================")
    print("SURVEY VALUE DECODING OPTIONS")
    print("==============================================")
    print("1. run_sample()      → Small GitHub-safe sample")
    print("2. run_full_batch() → Full local decoding")
    print("==============================================\n")


def run_sample():
    run_sample_decoding(
        base_path=BASE_PATH,
        year="2018",
        month="January"
    )


def run_full_batch():
    run_batch_decoding(base_path=BASE_PATH)


show_decoding_options()


SURVEY VALUE DECODING OPTIONS
1. run_sample()      → Small GitHub-safe sample
2. run_full_batch() → Full local decoding



In [7]:
run_sample()

SAMPLE DECODING: JANUARY 2018
Source: G:\My Drive\Labor Force Survey\NEW Header Encoded Surveys\2018
Dest:   G:\My Drive\Labor Force Survey\data\interim\NEW Fully Decoded Survey Sample\2018

[OK] Decoded 34 columns
[SAVED] G:\My Drive\Labor Force Survey\data\interim\NEW Fully Decoded Survey Sample\2018\JANUARY_2018.CSV


In [8]:
run_full_batch()

STARTING FULL BATCH DECODING
Source: G:\My Drive\Labor Force Survey\NEW Header Encoded Surveys
Dest:   G:\My Drive\Labor Force Survey\NEW Fully Decoded Surveys

Processing: JANUARY 2018...
[OK] Decoded 34 columns
----------------------------------------
Processing: APRIL 2018...
[OK] Decoded 33 columns
----------------------------------------
Processing: JULY 2018...
[OK] Decoded 33 columns
----------------------------------------
Processing: OCTOBER 2018...
[OK] Decoded 34 columns
----------------------------------------
Processing: APRIL 2019...
[OK] Decoded 34 columns
----------------------------------------
Processing: JULY 2019...
[OK] Decoded 37 columns
----------------------------------------
Processing: JANUARY 2019...
[OK] Decoded 34 columns
----------------------------------------
Processing: OCTOBER 2019...
[OK] Decoded 37 columns
----------------------------------------
Processing: JUNE 2022...
[OK] Decoded 29 columns
----------------------------------------
Processing: AUG