
This notebook applies metadata Sheet 2 definitions to decode survey values.  
This notebook maps coded responses (e.g., 1, 2, 3) into human‑readable labels (e.g., Employed, Unemployed).

Dependencies:
- Run `00_Settings.ipynb` and `01_Inventory.ipynb` first.
- Requires outputs from `03_Metadata_Decoder.ipynb` (Header Encoded Surveys).
- Note: Survey CSVs already carry meaning from Sheet 2 reshaping. This notebook applies the final decoding logic.

Output:

- Fully decoded survey CSVs saved into **NEW Fully Decoded Surveys**.
- Reports per survey showing number of columns successfully decoded.

Notes:
- **Sheet 1 metadata** → header translation (column names).  
- **Sheet 2 metadata** → value translation (coded responses).  

**INTENT:** This notebook performs the **final decoding stage** of the Labor Force Survey pipeline.  
It applies **Sheet 2 metadata** to translate coded survey values into human‑readable labels.

- Next steps: duplicate variable detection, integrity checks, coverage scanning.


In [3]:
import json
from pathlib import Path
import os
import pandas as pd

# ------------------------------------------------------------
# Load settings from config.json (produced by 00_Settings.ipynb)
# ------------------------------------------------------------
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# ------------------------------------------------------------
# Load inventory (produced by 01_Inventory.ipynb)
# ------------------------------------------------------------
with open(Path(INTERIM_DIR) / "inventory.json") as f:
    inventory = json.load(f)

# Alias for compatibility
base_path = str(BASE_PATH)


### Interim Sample 

#### Intent: Interim Sample Decoder

This interim function generates a **single sample output** (e.g., `NEW Fully Decoded Survey Sample` for **January 2018**) using the same decoding logic as the batch runner.  

Unlike the full batch process, which redirects all decoded surveys to Google Drive, this sample run saves directly into the local **interim repository path**.  

The purpose is to provide a quick preview of how a fully decoded CSV looks without requiring you to download the large, heavy files from Google Drive.  

Use this when:
- You want to validate decoding logic on a small subset before running the full batch.  
- You need a lightweight example for documentation, testing, or demonstration.  
- You want to inspect decoded values locally without waiting for the complete dataset.


In [4]:
import json
from pathlib import Path
import os
import pandas as pd
import re

# ------------------------------------------------------------
# Shared folder names
# ------------------------------------------------------------
HEADER_ENCODED_FOLDER = "NEW Header Encoded Surveys"
FULLY_DECODED_FOLDER = "NEW Fully Decoded Surveys"
FULLY_DECODED_SAMPLE_FOLDER = "NEW Fully Decoded Survey Sample"

# ------------------------------------------------------------
# Month parsing
# ------------------------------------------------------------
MONTHS = [
    "JANUARY","FEBRUARY","MARCH","APRIL","MAY","JUNE",
    "JULY","AUGUST","SEPTEMBER","OCTOBER","NOVEMBER","DECEMBER"
]

MONTH_PATTERN = re.compile(
    r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)",
    re.IGNORECASE
)

# ============================================================
# Metadata loader
# ============================================================
def load_clean_sheet2(base_path, year, month):
    """
    Load the Clean Sheet 2 Metadata for value decoding.
    Official folder: NEW Metadata Sheet 2 CSV's
    """
    path = os.path.join(
        base_path,
        "NEW Metadata Sheet 2 CSV's",
        str(year),
        f"Sheet2_{month}_{year}.csv"
    )
    if not os.path.exists(path):
        raise FileNotFoundError(f"Metadata not found at: {path}")
    return pd.read_csv(path, dtype=str)

# ============================================================
# Column matching logic
# ============================================================
def find_target_column(survey_columns, meta_desc):
    """
    Align metadata descriptions with survey columns.
    """
    if pd.isna(meta_desc):
        return None

    meta_desc = str(meta_desc).strip()

    # Exact match
    if meta_desc in survey_columns:
        return meta_desc

    # Remove metadata prefix (e.g. C06-Status -> Status)
    clean_meta = re.sub(r'^C\d+[\s\-_]+', '', meta_desc, flags=re.IGNORECASE).strip()
    if clean_meta in survey_columns:
        return clean_meta

    # Survey column has prefix
    for col in survey_columns:
        if col.endswith(meta_desc):
            prefix = col[:-len(meta_desc)].strip()
            if re.search(r'^C\d+[\s\-_]*$', prefix, re.IGNORECASE) or prefix == "":
                return col

    return None

# ============================================================
# Safe decoder
# ============================================================
def decode_survey_safe(survey_df, meta_df):
    """
    Decode survey values using metadata Sheet 2.
    Uses Variable, Label, min_value, max_value, additional_value.
    """
    unique_vars = meta_df['Variable'].unique()
    decoded_count = 0
    survey_cols = list(survey_df.columns)

    for var_code in unique_vars:
        subset = meta_df[meta_df['Variable'] == var_code].copy()
        if subset['Description'].isnull().all():
            continue

        raw_desc = subset['Description'].dropna().iloc[0].strip()
        target_col = find_target_column(survey_cols, raw_desc)
        if not target_col:
            continue

        lookup = {}
        for _, row in subset.iterrows():
            label = str(row['Label']).strip()
            if label in ['0', '0.0', 'nan', 'NaN', '']:
                continue

            # Collect codes
            codes = []
            for field in ['min_value', 'max_value', 'additional_value']:
                val = str(row.get(field, '')).strip()
                if val and val not in ['0', 'nan', 'NaN']:
                    codes.append(val)

            # Build mapping
            for code in codes:
                try:
                    lookup[int(float(code))] = label
                except:
                    lookup[code] = label  # fallback for non-numeric codes

        if not lookup:
            continue

        def safe_map(val):
            try:
                return lookup.get(int(float(val)), val)
            except:
                return lookup.get(str(val), val)

        survey_df[target_col] = survey_df[target_col].apply(safe_map)
        decoded_count += 1

    return survey_df, decoded_count

# ============================================================
# INTERIM SAMPLE DECODER
# ============================================================
def run_sample_decoding(base_path, year="2018", month="January", interim_root=None):
    """
    Decode a single survey file for demonstration purposes.
    Output is GitHub-safe (small sample only).
    """
    if interim_root is None:
        interim_root = os.path.join(base_path, "data", "interim")

    month = month.strip().capitalize()
    if month.upper() not in MONTHS:
        raise ValueError(f"Invalid month: {month}")

    input_root = os.path.join(base_path, HEADER_ENCODED_FOLDER, year)
    output_root = os.path.join(interim_root, FULLY_DECODED_SAMPLE_FOLDER, year)
    os.makedirs(output_root, exist_ok=True)

    if not os.path.exists(input_root):
        print(f"[SKIP] Input folder not found: {input_root}")
        return

    files = [f for f in os.listdir(input_root) if f.lower().endswith(".csv")]
    target_file = next((f for f in files if month.upper() in f.upper()), None)

    if not target_file:
        print(f"[SKIP] No survey file found for {month} {year}")
        return

    print("================================================")
    print(f"SAMPLE DECODING: {month.upper()} {year}")
    print(f"Source: {input_root}")
    print(f"Dest:   {output_root}")
    print("================================================\n")

    df_survey = pd.read_csv(os.path.join(input_root, target_file), low_memory=False)
    df_meta = load_clean_sheet2(base_path, year, month)

    df_final, count = decode_survey_safe(df_survey, df_meta)

    save_path = os.path.join(output_root, target_file)
    df_final.to_csv(save_path, index=False)

    print(f"[OK] Decoded {count} columns")
    print(f"[SAVED] {save_path}")

# ============================================================
# FULL BATCH DECODER
# ============================================================
def run_batch_decoding(base_path):
    """
    Decode all survey files using Sheet 2 metadata.
    Intended for local execution only.
    """
    input_root = os.path.join(base_path, HEADER_ENCODED_FOLDER)
    output_root = os.path.join(base_path, FULLY_DECODED_FOLDER)
    os.makedirs(output_root, exist_ok=True)

    print("================================================")
    print("STARTING FULL BATCH DECODING")
    print(f"Source: {input_root}")
    print(f"Dest:   {output_root}")
    print("================================================\n")

    if not os.path.exists(input_root):
        print(f"[ERROR] Input folder not found: {input_root}")
        return

    success, errors = 0, 0

    year_folders = [
        f for f in os.listdir(input_root)
        if f.isdigit() and os.path.isdir(os.path.join(input_root, f))
    ]

    for year in sorted(year_folders):
        year_in = os.path.join(input_root, year)
        year_out = os.path.join(output_root, year)
        os.makedirs(year_out, exist_ok=True)

        files = [f for f in os.listdir(year_in) if f.lower().endswith(".csv")]
        for filename in files:
            match = MONTH_PATTERN.search(filename)
            if not match:
                continue

            month = match.group(1).capitalize()
            print(f"Processing: {month.upper()} {year}...")

            try:
                df_survey = pd.read_csv(os.path.join(year_in, filename), low_memory=False)
                df_meta = load_clean_sheet2(base_path, year, month)
                df_final, count = decode_survey_safe(df_survey, df_meta)

                save_path = os.path.join(year_out, filename)
                df_final.to_csv(save_path, index=False)

                print(f"[OK] Decoded {count} columns")
                success += 1

            except Exception as e:
                print(f"[ERROR] {e}")
                errors += 1

            print("-" * 40)

    print(f"\nCOMPLETED | Success: {success} | Errors: {errors}")

In [5]:
# ============================================================
# PIPELINE ENTRY POINT
# ============================================================

def show_decoding_options():
    print("==============================================")
    print("SURVEY VALUE DECODING OPTIONS")
    print("==============================================")
    print("1. run_sample()      → Small GitHub-safe sample")
    print("2. run_full_batch() → Full local decoding")
    print("==============================================\n")


def run_sample():
    run_sample_decoding(
        base_path=BASE_PATH,
        year="2018",
        month="January"
    )


def run_full_batch():
    run_batch_decoding(base_path=BASE_PATH)


show_decoding_options()


SURVEY VALUE DECODING OPTIONS
1. run_sample()      → Small GitHub-safe sample
2. run_full_batch() → Full local decoding



In [6]:
run_sample()

SAMPLE DECODING: JANUARY 2018
Source: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW Header Encoded Surveys\2018
Dest:   G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\data\interim\NEW Fully Decoded Survey Sample\2018

[OK] Decoded 39 columns
[SAVED] G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\data\interim\NEW Fully Decoded Survey Sample\2018\JANUARY_2018.CSV


In [7]:
run_full_batch()

STARTING FULL BATCH DECODING
Source: G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW Header Encoded Surveys
Dest:   G:\.shortcut-targets-by-id\1VctTphaltRx4xcPxmTJlRTrxLalyuEt8\Labor Force Survey\NEW Fully Decoded Surveys

Processing: APRIL 2018...
[OK] Decoded 37 columns
----------------------------------------
Processing: JULY 2018...
[OK] Decoded 37 columns
----------------------------------------
Processing: JANUARY 2018...
[OK] Decoded 39 columns
----------------------------------------
Processing: OCTOBER 2018...
[OK] Decoded 39 columns
----------------------------------------
Processing: APRIL 2019...
[OK] Decoded 39 columns
----------------------------------------
Processing: JULY 2019...
[OK] Decoded 41 columns
----------------------------------------
Processing: OCTOBER 2019...
[OK] Decoded 41 columns
----------------------------------------
Processing: JANUARY 2019...
[OK] Decoded 38 columns
----------------------------------------
Process