## Survey Value Decoding (Metadata Sheet 2)

This notebook applies **Metadata Sheet 2** definitions to decode coded survey values into human-readable labels (e.g., `1, 2, 3` → `Employed, Unemployed`).  
It represents the **final value-level decoding stage** of the Labor Force Survey data pipeline.

To handle known metadata inconsistencies, particularly for the `C05-Age as of Last Birthday` variable, a **two-pass decoding strategy** is used to ensure accuracy, completeness, and reproducibility across all survey months.

**Dependencies**

- `00_Settings.ipynb` – project paths and configuration  
- `01_Inventory.ipynb` – survey and metadata inventory  
- Output from `03_Metadata_Decoder.ipynb` (Header-Encoded Surveys)  

> Note: Survey headers have already been standardized using **Metadata Sheet 1**.  
> This notebook focuses exclusively on **value decoding** using **Metadata Sheet 2**.

#### **Decoding Strategy**

**Pass 1: General Value Decoder (Age Excluded)**

- Decodes all variables whose metadata descriptions are stable across survey months.
- Numeric-only variables (e.g., line number, hours worked) are preserved as raw values.
- Age-related variables are explicitly excluded due to inconsistent metadata descriptions.
- All applicable coded responses are decoded using numeric range rules.

Output Folder: 
- `NEW Fully Decoded Surveys`

**Pass 2: Targeted Age Decoder (Special Case Handling)**

- Applies a month-specific mapping to identify the correct age metadata description.
- Decodes the age column using verified numeric range rules.
- Cleans Excel-induced formatting artifacts prior to decoding.
- Ensures all age values are mapped into standardized, consistent categories.

Output Folder: 
- `NEW Fully Decoded Surveys (with Special Case Variables)`

#### Outputs

- Fully decoded survey CSVs with human-readable values
- Consistent decoding across all survey months
- Verified and standardized age categories
- Clean separation between general decoding and special-case handling

Notes

- **Metadata Sheet 1** → column/header translation  
- **Metadata Sheet 2** → value/response decoding  


In [1]:
import json
from pathlib import Path
import os
import pandas as pd

# ------------------------------------------------------------
# Load settings from config.json (produced by 00_Settings.ipynb)
# ------------------------------------------------------------
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# ------------------------------------------------------------
# Load inventory (produced by 01_Inventory.ipynb)
# ------------------------------------------------------------
with open(Path(INTERIM_DIR) / "inventory.json") as f:
    inventory = json.load(f)

# Alias for compatibility
base_path = str(BASE_PATH)


## Two-Pass Decoding Strategy for Survey Values  
### Handling Inconsistent Age Metadata Across Survey Months
During metadata inspection, inconsistencies were identified in the metadata description used to define the age variable across survey months. Although the survey column representing age (`C05-Age as of Last Birthday`) is structurally present and consistent in all survey datasets, its corresponding Description field in the metadata dictionary varies by month (e.g., C07-Age as of Last Birthday, Age Group, C05-Age as of Last Birthday).

To avoid the potential for mismatched mapping or silent decoding failures, a two-pass framework is adopted to treat the age variable with the necessary caution.

---
#### Pass 1: General Value Decoder (Age Variable Excluded)

**Objective:**  
Decode all survey variables using Metadata Sheet 2 except `C05-Age as of Last Birthday`, which require special handling.

**Approach:**  
- Each survey file is decoded using its corresponding Metadata Sheet 2 CSV.
- Variables with no metadata labels are skipped to preserve raw values.

**Output Folder:** `NEW Fully Decoded Surveys`

**Rationale:** Isolating the age variable is a measure to ensure that inconsistent metadata headers do not cause decoding issues. 

In [2]:
import pandas as pd
import numpy as np
import os
import re
import csv
from pathlib import Path

def load_clean_sheet2(base_path, year, filename):
    # Extracts month from filename
    match = re.search(r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)", filename, re.IGNORECASE)
    if not match:
        raise ValueError(f"Could not determine month from filename: {filename}")
    month = match.group(1).capitalize()
    
    path = os.path.join(base_path, "NEW Metadata Sheet 2 CSV's", str(year), f"Sheet2_{month}_{year}.csv")
    if not os.path.exists(path):
        raise FileNotFoundError(f"Metadata not found at: {path}")
    return pd.read_csv(path, dtype=str)

def find_target_column(survey_columns, meta_desc):
    norm_meta = re.sub(r'[^a-z0-9]', '', str(meta_desc).lower())
    for col in survey_columns:
        if re.sub(r'[^a-z0-9]', '', col.lower()) == norm_meta: return col
    match = re.match(r'^(C\d+)', str(meta_desc), re.IGNORECASE)
    if match:
        code = match.group(1).upper()
        for col in survey_columns:
            if col.upper().startswith(code): return col
    return None

In [3]:
def decode_survey_safe(survey_df, meta_df, year=None, month=None):
    unique_vars = meta_df['Variable'].unique()
    decoded_count = 0
    survey_cols = list(survey_df.columns)

    for var_code in unique_vars:
        subset = meta_df[meta_df['Variable'] == var_code].copy()
        
        # --- NEW LOGIC: DETECT NUMERIC-ONLY VARIABLES ---
        # If the only unique label is "0" or "0.0", it's a numeric column (Hours, Line Number, etc.)
        # We skip decoding to preserve the raw values.
        unique_labels = subset['Label'].dropna().unique()
        if len(unique_labels) == 1 and str(unique_labels[0]).strip() in ['0', '0.0']:
            continue

        raw_desc = str(subset['Description'].dropna().iloc[0]).strip()
        
        # --- PINPOINT SKIP LOGIC (AGE) ---
        desc_clean = raw_desc.upper()
        is_age_var = ("AGE" in desc_clean and "BIRTHDAY" in desc_clean) or (desc_clean == "AGE GROUP")
        is_grade_var = "GRADE" in desc_clean or "EDUCATION" in desc_clean or "SCHOOL" in desc_clean

        if is_age_var and not is_grade_var:
            continue 

        target_col = find_target_column(survey_cols, raw_desc)
        if not target_col: continue

        # Vectorized application setup
        original_col = survey_df[target_col].copy()
        numeric_series = pd.to_numeric(survey_df[target_col], errors='coerce')

        rules = []
        for _, row in subset.iterrows():
            try:
                m_val = float(str(row.get('min_value', 0)).strip())
                x_val = float(str(row.get('max_value', 0)).strip())
                width = (x_val - m_val) if (x_val > m_val) else 0
                rules.append({'min': m_val, 'max': x_val, 'label': str(row['Label']).strip(), 'width': width})
            except: continue

        # SORT: Widest (Umbrella) first, Smallest (Specific) last
        sorted_rules = sorted(rules, key=lambda x: x['width'], reverse=True)
        new_values = original_col.values 
        
        for r in sorted_rules:
            if r['width'] == 0:
                mask = (numeric_series == r['min'])
            else:
                mask = (numeric_series >= r['min']) & (numeric_series <= r['max'])
            
            # Update matching indices with text labels
            new_values[mask] = r['label']

        survey_df[target_col] = new_values
        decoded_count += 1
        
    return survey_df, decoded_count

In [4]:
def run_batch_pass_1(base_path):
    input_root = os.path.join(base_path, "NEW Header Encoded Surveys")
    output_root = os.path.join(base_path, "NEW Fully Decoded Surveys")
    os.makedirs(output_root, exist_ok=True)
    
    # Process years numerically
    years = sorted([f for f in os.listdir(input_root) if f.isdigit()])
    
    for year in years:
        year_in = os.path.join(input_root, year)
        year_out = os.path.join(output_root, year)
        os.makedirs(year_out, exist_ok=True)
        
        files = [f for f in os.listdir(year_in) if f.lower().endswith(".csv")]
        for filename in sorted(files):
            try:
                df_survey = pd.read_csv(os.path.join(year_in, filename), low_memory=False, dtype=str)
                df_meta = load_clean_sheet2(base_path, year, filename) 
                
                df_final, _ = decode_survey_safe(df_survey, df_meta, year, filename)
                
                # QUOTE_NONNUMERIC wraps text in "" to prevent Excel from guessing data types
                df_final.to_csv(os.path.join(year_out, filename), index=False, quoting=csv.QUOTE_NONNUMERIC)
                print(f"[OK] Pass 1 (General) Saved: {filename}")
            except Exception as e: 
                print(f"[ERROR] {filename}: {e}")

# EXECUTION
run_batch_pass_1(base_path)

[OK] Pass 1 (General) Saved: APRIL_2018.CSV
[OK] Pass 1 (General) Saved: JANUARY_2018.CSV
[OK] Pass 1 (General) Saved: JULY_2018.CSV
[OK] Pass 1 (General) Saved: OCTOBER_2018.CSV
[OK] Pass 1 (General) Saved: APRIL_2019.CSV
[OK] Pass 1 (General) Saved: JANUARY_2019.CSV
[OK] Pass 1 (General) Saved: JULY_2019.CSV
[OK] Pass 1 (General) Saved: OCTOBER_2019.CSV
[OK] Pass 1 (General) Saved: APRIL_2022.csv
[OK] Pass 1 (General) Saved: AUGUST_2022.CSV
[OK] Pass 1 (General) Saved: DECEMBER_2022.CSV
[OK] Pass 1 (General) Saved: FEBRUARY_2022.csv
[OK] Pass 1 (General) Saved: JANUARY_2022.csv
[OK] Pass 1 (General) Saved: JULY_2022.CSV
[OK] Pass 1 (General) Saved: JUNE_2022.csv
[OK] Pass 1 (General) Saved: MARCH_2022.csv
[OK] Pass 1 (General) Saved: MAY_2022.csv
[OK] Pass 1 (General) Saved: NOVEMBER_2022.CSV
[OK] Pass 1 (General) Saved: OCTOBER_2022.CSV
[OK] Pass 1 (General) Saved: SEPTEMBER_2022.CSV
[OK] Pass 1 (General) Saved: APRIL_2023.CSV
[OK] Pass 1 (General) Saved: AUGUST_2023.CSV
[OK] Pass 1

---
#### Verification Step: Consistency of Age Coding Schemes Across Survey Months

**Objective:**
Determine whether the age category labels are consistent across survey months despite differences in metadata description names.

In [5]:
from pathlib import Path
import pandas as pd

# 1. Define the map (The script needs to see this to run the loop)
AGE_VAR_MAP = {
    ("2018", "January"): "C07-Age as of Last Birthday", ("2018", "April"): "C07-Age as of Last Birthday",
    ("2018", "July"): "C07-Age as of Last Birthday", ("2018", "October"): "C07-Age as of Last Birthday",
    ("2019", "January"): "C07-Age as of Last Birthday", ("2019", "April"): "C07-Age as of Last Birthday",
    ("2019", "July"): "C07-Age as of Last Birthday", ("2019", "October"): "C07-Age as of Last Birthday",
    ("2022", "January"): "C07-Age as of Last Birthday", ("2022", "February"): "Age Group",
    ("2022", "March"): "Age Group", ("2022", "April"): "C07-Age as of Last Birthday",
    ("2022", "May"): "Age Group", ("2022", "June"): "Age Group", ("2022", "July"): "C07-Age as of Last Birthday",
    ("2022", "August"): "Age Group", ("2022", "September"): "Age Group", ("2022", "October"): "C07-Age as of Last Birthday",
    ("2022", "November"): "Age Group", ("2022", "December"): "Age Group", ("2023", "January"): "C07-Age as of Last Birthday",
    ("2023", "February"): "Age Group", ("2023", "March"): "Age Group", ("2023", "April"): "C07-Age as of Last Birthday",
    ("2023", "May"): "Age Group", ("2023", "June"): "Age Group", ("2023", "July"): "C05-Age as of Last Birthday",
    ("2023", "August"): "Age Group", ("2023", "September"): "Age Group", ("2023", "October"): "C07-Age as of Last Birthday",
    ("2023", "November"): "Age Group", ("2023", "December"): "Age Group", ("2024", "January"): "C07-Age as of Last Birthday",
    ("2024", "February"): "Age Group", ("2024", "March"): "Age Group", ("2024", "April"): "C07-Age as of Last Birthday",
    ("2024", "May"): "Age Group", ("2024", "June"): "Age Group", ("2024", "July"): "C07-Age as of Last Birthday",
    ("2024", "August"): "Age Group"
}

# 2. Audit Logic
schemes = {}
total_months_processed = 0

print("--- STARTING AGE METADATA VERIFICATION ---\n")

for (year, month) in AGE_VAR_MAP.keys():
    target_desc = AGE_VAR_MAP[(year, month)]
    
    # Path construction using Pathlib (Ensure BASE_PATH is defined in your config cell!)
    path = Path(BASE_PATH) / "NEW Metadata Sheet 2 CSV's" / year / f"Sheet2_{month.capitalize()}_{year}.csv"
    
    if not path.exists():
        continue
            
    df_meta = pd.read_csv(path, dtype=str)
    age_meta = df_meta[df_meta['Description'].str.strip() == target_desc].copy()
    
    if age_meta.empty:
        continue
            
    def clean_num(x):
        try: return str(int(float(x)))
        except: return "0"

    rules = []
    for _, row in age_meta.iterrows():
        label = str(row['Label']).strip()
        mi = clean_num(row.get('min_value', '0'))
        ma = clean_num(row.get('max_value', '0'))
        rules.append((label, mi, ma))
    
    scheme_id = tuple(sorted(rules))
    if scheme_id not in schemes:
        schemes[scheme_id] = []
    
    schemes[scheme_id].append(f"{month} {year}")
    total_months_processed += 1

# 3. Print Results
print(f"Total Months Audited: {total_months_processed}")
print(f"Detected {len(schemes)} unique coding scheme(s).")

for idx, (sig, months_list) in enumerate(schemes.items(), 1):
    num_months = len(months_list)
    print(f"\n### SCHEME #{idx} ({num_months} months)")
    print(f"Applies to: {', '.join(months_list)}")
    print(f"{'Label Category':<25} | {'Min':<5} | {'Max':<5}")
    print("-" * 45)
    for label, mi, ma in sig:
        print(f"{label:<25} | {mi:<5} | {ma:<5}")

--- STARTING AGE METADATA VERIFICATION ---

Total Months Audited: 40
Detected 1 unique coding scheme(s).

### SCHEME #1 (40 months)
Applies to: January 2018, April 2018, July 2018, October 2018, January 2019, April 2019, July 2019, October 2019, January 2022, February 2022, March 2022, April 2022, May 2022, June 2022, July 2022, August 2022, September 2022, October 2022, November 2022, December 2022, January 2023, February 2023, March 2023, April 2023, May 2023, June 2023, July 2023, August 2023, September 2023, October 2023, November 2023, December 2023, January 2024, February 2024, March 2024, April 2024, May 2024, June 2024, July 2024, August 2024
Label Category            | Min   | Max  
---------------------------------------------
00 - 14                   | 0     | 14   
15 - 24                   | 15    | 24   
25 - 34                   | 25    | 34   
35 - 44                   | 35    | 44   
45 - 54                   | 45    | 54   
55 - 64                   | 55    | 64   
6

The audit confirms that **age category definitions are consistent across survey months**, even though the metadata description field varies. This verification establishes that a unified decoding logic can be applied once the correct metadata description is identified per month.

---
#### Pass 2: Targeted Age Variable Decoder (Special Case Handling)

**Objective:**
Accurately decode the age column using the correct metadata description specific to each survey month.

**Approach:**

- The appropriate metadata definition is dynamically selected for each file.
- Raw age values are cleaned to correct Excel-induced date formatting artifacts.
- Age values are decoded using numeric range matching into standardized age group labels.

**Output Folder:** `NEW Fully Decoded Surveys (with Special Case Variables)`

**Rationale:**
This specialized second pass decoding ensures that each file is precisely paired with its corresponding metadata reference for the age variable, resolving potential mapping conflicts.

In [6]:
# Mapping based on your provided list
# Note: Most months use "Age Group", so we specifically list the "C07" and "C05" exceptions.
C07_MONTHS = ["January", "April", "July", "October"]

def get_age_meta_description(year, month):
    y_str = str(year)
    m_str = month.capitalize()
    
    # Special case: July 2023 uses C05
    if y_str == "2023" and m_str == "July":
        return "C05-Age as of Last Birthday"
    
    # Quarterly months usually use C07 (except July 2023)
    if m_str in C07_MONTHS:
        return "C07-Age as of Last Birthday"
    
    # All other months (monthly series) use "Age Group"
    return "Age Group"

def scrub_date_ghosts(val):
    """
    Cleans date-formatted strings.
    
    """
    s = str(val).strip()
    if "-" in s and ":" in s:
        # Splits '2023-09-05' and takes the '05' part, then removes leading zero
        clean_val = s.split('-')[-1].split(' ')[0].lstrip('0')
        return clean_val if clean_val != "" else "0"
    return s

In [7]:
def decode_age_dynamically(survey_df, meta_df, target_desc):
    # 1. Filter metadata for the specific Age description for this month
    age_meta = meta_df[meta_df['Description'].str.strip() == target_desc].copy()
    
    if age_meta.empty:
        print(f"   [WARNING] No metadata found for description: {target_desc}")
        return survey_df

    # 2. Build mathematical rules from metadata min/max values
    rules = []
    for _, row in age_meta.iterrows():
        try:
            m_val = float(str(row.get('min_value', 0)).strip())
            x_val = float(str(row.get('max_value', 0)).strip())
            label = str(row['Label']).strip()
            
            # We focus on the standard 10-year brackets (width < 20) 
            # and the "65 and over" category.
            width = (x_val - m_val) if (x_val > m_val) else 0
            if width < 20 or "65" in label:
                rules.append({'min': m_val, 'max': x_val, 'label': label})
        except: continue

    # 3. Identify the Survey Column (Fixed as 'C05-Age as of Last Birthday')
    # We use a flexible search in case of minor naming variations
    age_col = next((c for c in survey_df.columns if "AGE" in c.upper() and "BIRTHDAY" in c.upper()), None)
    
    if age_col and rules:
        # A. Clean the 'Date Ghost' corruption (vectorized)
        survey_df[age_col] = survey_df[age_col].apply(scrub_date_ghosts)
        
        # B. Convert to numeric to allow for range comparison math
        numeric_series = pd.to_numeric(survey_df[age_col], errors='coerce')
        new_values = survey_df[age_col].values # Working with raw array for speed

        # C. Map values using Numeric Ranges (Vectorized)
        for r in rules:
            mask = (numeric_series >= r['min']) & (numeric_series <= r['max'])
            new_values[mask] = r['label']
            
        survey_df[age_col] = new_values
        
    return survey_df

In [8]:
def run_batch_pass_2_final(base_path):
    # Folders
    input_root = os.path.join(base_path, "NEW Fully Decoded Surveys")
    output_root = os.path.join(base_path, "NEW Fully Decoded Surveys (with Special Case Variables)")
    os.makedirs(output_root, exist_ok=True)
    
    years = sorted([f for f in os.listdir(input_root) if f.isdigit()])
    
    for year in years:
        year_in = os.path.join(input_root, year)
        year_out = os.path.join(output_root, year)
        os.makedirs(year_out, exist_ok=True)
        
        files = sorted([f for f in os.listdir(year_in) if f.lower().endswith(".csv")])
        for filename in files:
            try:
                # Load Survey from Pass 1
                df_survey = pd.read_csv(os.path.join(year_in, filename), low_memory=False, dtype=str)
                
                # Identify Month from filename
                match = re.search(r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)", filename, re.IGNORECASE)
                month = match.group(1).capitalize()
                
                # 1. Get correct metadata description for this file
                target_desc = get_age_meta_description(year, month)
                
                # 2. Load the metadata CSV
                df_meta = load_clean_sheet2(base_path, year, filename)
                
                # 3. Process Age via dynamic math
                df_final = decode_age_dynamically(df_survey, df_meta, target_desc)
                
                # 4. FINAL SAVE: QUOTE_ALL is mandatory here
                # It wraps every value in " " to stop Excel from auto-formatting brackets into dates
                df_final.to_csv(os.path.join(year_out, filename), index=False, quoting=csv.QUOTE_ALL)
                print(f"[SUCCESS] Pass 2: {filename} processed using '{target_desc}'")
                
            except Exception as e: 
                print(f"[ERROR] Pass 2 failed for {filename}: {e}")

# EXECUTION
run_batch_pass_2_final(base_path)

[SUCCESS] Pass 2: APRIL_2018.CSV processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: JANUARY_2018.CSV processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: JULY_2018.CSV processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: OCTOBER_2018.CSV processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: APRIL_2019.CSV processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: JANUARY_2019.CSV processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: JULY_2019.CSV processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: OCTOBER_2019.CSV processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: APRIL_2022.csv processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pass 2: AUGUST_2022.CSV processed using 'Age Group'
[SUCCESS] Pass 2: DECEMBER_2022.CSV processed using 'Age Group'
[SUCCESS] Pass 2: FEBRUARY_2022.csv processed using 'Age Group'
[SUCCESS] Pass 2: JANUARY_2022.csv processed using 'C07-Age as of Last Birthday'
[SUCCESS] Pa

---
#### Final Verification: Unique Age Values After Decoding

**Objective:**
Confirm that age decoding produced a clean and consistent set of age categories across all survey months.

**Approach:**

- All fully decoded survey files are scanned to identify the age column.
- Unique decoded age labels are collected across all months.

In [9]:
import pandas as pd
import os
from pathlib import Path

# SETUP PATH
FINAL_FOLDER_PATH = Path(BASE_PATH) / "NEW Fully Decoded Surveys (with Special Case Variables)"

def print_unique_age_values(input_root):
    all_unique_labels = set()
    files_checked = 0

    # Iterate through year folders
    for year_dir in sorted([d for d in input_root.iterdir() if d.is_dir()]):
        # Iterate through CSV files
        for file_path in sorted(year_dir.glob("*.csv")):
            try:
                # Use a generator to find the Age column efficiently
                df = pd.read_csv(file_path, usecols=lambda x: "AGE" in x.upper() and "BIRTHDAY" in x.upper(), dtype=str)
                
                if not df.empty:
                    col_name = df.columns[0]
                    all_unique_labels.update(df[col_name].dropna().unique())
                    files_checked += 1
            except Exception:
                continue

    # PRINT FINAL LIST
    print(f"Total Files Audited: {files_checked}")
    print("=" * 30)
    print("UNIQUE AGE VALUES FOUND:")
    print("=" * 30)
    
    # Sort for a clean, organized list
    for label in sorted(list(all_unique_labels)):
        print(label)

# EXECUTE
if FINAL_FOLDER_PATH.exists():
    print_unique_age_values(FINAL_FOLDER_PATH)
else:
    print(f"Folder not found: {FINAL_FOLDER_PATH}")

Total Files Audited: 40
UNIQUE AGE VALUES FOUND:
00 - 14
15 - 24
25 - 34
35 - 44
45 - 54
55 - 64
65 and Over


**Outcome:**
- The final verification confirms that age values are consistently decoded into the intended standardized categories.
- No unexpected labels or mixed numeric-text values remain.