
This notebook performs **post-decoding verification checks** to ensure the integrity and correctness of survey decoding.  

It follows the decoder notebook (`04_Survey_Decoder_sheet2.ipynb`) and provides two complementary QA steps:

1. **Record Integrity Check**  
   - Confirms that the number of rows in each decoded survey matches the raw header-encoded survey.  
   - Detects missing or extra records.  
   - Produces a summary DataFrame with PASS/FAIL status per month.

2. **Coverage Scanner (Value Decoding Integrity)**  
   - Verifies that variables were decoded correctly against metadata.  
   - Distinguishes between:  
     - Successfully decoded categorical variables.  
     - Quantitative variables (numeric-only, e.g., Household Size, Year).  
     - Columns with no metadata (left unchanged).  
     - Failed decodes (should have been text but remained numeric).  
   - Produces detailed reports per month and highlights failures.

**Purpose**
- Provide **auditability** and **confidence** in the decoding pipeline.  
- Allow users to validate both **row-level integrity** and **column-level decoding coverage** before moving on to imputation, reshaping, or analysis.  
- Ensure reproducibility and transparency in survey preprocessing.

**Usage**
Run this notebook after completing the decoder notebook.  
It will:
- Print progress and summary reports.  
- Return DataFrames (`record_integrity_df`, `df_integrity_check`) for further inspection or export.  



In [1]:
import json
from pathlib import Path
import os
import pandas as pd

# ------------------------------------------------------------
# Load settings from config.json (produced by 00_Settings.ipynb)
# ------------------------------------------------------------
with open(Path("./data/interim/config.json")) as f:
    cfg = json.load(f)

BASE_PATH = Path(cfg["BASE_PATH"])
INTERIM_DIR = Path(cfg["INTERIM_DIR"])
PROCESSED_DIR = Path(cfg["PROCESSED_DIR"])
LOG_DIR = Path(cfg["LOG_DIR"])
MONTH_ORDER = cfg["MONTH_ORDER"]

# ------------------------------------------------------------
# Load inventory (produced by 01_Inventory.ipynb)
# ------------------------------------------------------------
with open(Path(INTERIM_DIR) / "inventory.json") as f:
    inventory = json.load(f)

# Alias for compatibility
base_path = str(BASE_PATH)


### Record Integrity Verifier

In [2]:
import re

# ============================================================
# Shared regex for month detection
# ============================================================
MONTH_PATTERN = re.compile(
    r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)",
    re.IGNORECASE
)

# ============================================================
# Record Integrity Verifier
# Purpose: Ensure decoded surveys have same row counts as raw
# ============================================================
def verify_decoded_record_integrity(base_path,
                                    raw_folder="NEW Header Encoded Surveys",
                                    decoded_folder="NEW Fully Decoded Surveys"):
    results = []
    raw_root = os.path.join(base_path, raw_folder)
    decoded_root = os.path.join(base_path, decoded_folder)

    if not os.path.exists(raw_root) or not os.path.exists(decoded_root):
        raise FileNotFoundError("Missing input or decoded folder.")

    year_folders = [y for y in os.listdir(raw_root) if y.isdigit()]
    for year in sorted(year_folders):
        year_raw = os.path.join(raw_root, year)
        year_dec = os.path.join(decoded_root, year)

        for filename in os.listdir(year_raw):
            if not filename.lower().endswith(".csv"): continue
            match = MONTH_PATTERN.search(filename)
            if not match: continue
            month = match.group(1).capitalize()

            raw_path = os.path.join(year_raw, filename)
            dec_path = os.path.join(year_dec, filename)

            try:
                raw_count = len(pd.read_csv(raw_path, low_memory=False))
            except Exception as e:
                results.append({"Year":year,"Month":month,
                                "Raw Total Records":f"ERROR {e}",
                                "Decoded Total Records":"N/A",
                                "Integrity Status":"FAIL"})
                continue

            if not os.path.exists(dec_path):
                results.append({"Year":year,"Month":month,
                                "Raw Total Records":raw_count,
                                "Decoded Total Records":"Missing",
                                "Integrity Status":"FAIL"})
                continue

            try:
                dec_count = len(pd.read_csv(dec_path, low_memory=False))
            except Exception as e:
                results.append({"Year":year,"Month":month,
                                "Raw Total Records":raw_count,
                                "Decoded Total Records":f"ERROR {e}",
                                "Integrity Status":"FAIL"})
                continue

            status = "PASS" if raw_count == dec_count else "FAIL"
            results.append({"Year":year,"Month":month,
                            "Raw Total Records":raw_count,
                            "Decoded Total Records":dec_count,
                            "Integrity Status":status})

    df_summary = pd.DataFrame(results).sort_values(["Year","Month"]).reset_index(drop=True)
    print("\n===== RECORD DECODING INTEGRITY CHECK COMPLETE =====")
    fails = (df_summary["Integrity Status"]!="PASS").sum()
    print("SUCCESS: All decoded surveys match raw counts." if fails==0 else f"WARNING: {fails} months failed integrity checks.")
    return df_summary


### Coverage Scanner

In [3]:
import os
import pandas as pd
import re
from IPython.display import display, HTML

# ============================================================
# Shared regex for month detection
# ============================================================
MONTH_PATTERN = re.compile(
    r"(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)",
    re.IGNORECASE
)

# ============================================================
# Coverage Scanner
# Purpose: Verify variables decoded correctly vs metadata
# ============================================================
def check_value_decoding_integrity_smart(base_path,
                                         decoded_folder="NEW Fully Decoded Surveys",
                                         meta_folder="NEW Metadata Sheet 2 CSV's"):
    input_folder = os.path.join(base_path, decoded_folder)
    meta_root = os.path.join(base_path, meta_folder)
    all_results = []

    for year in sorted(os.listdir(input_folder)):
        year_path = os.path.join(input_folder, year)
        if not os.path.isdir(year_path):
            continue

        for file in sorted(os.listdir(year_path)):
            if not file.lower().endswith(".csv"):
                continue
            match = MONTH_PATTERN.search(file)
            if not match:
                continue
            month = match.group(1).capitalize()

            df_survey = pd.read_csv(os.path.join(year_path, file), low_memory=False)
            meta_path = os.path.join(meta_root, year, f"Sheet2_{month}_{year}.csv")
            if not os.path.exists(meta_path):
                print(f"[SKIP] Metadata missing for {month} {year}")
                continue

            df_meta = pd.read_csv(meta_path, dtype=str)
            df_meta['Description_Clean'] = df_meta['Description'].fillna('').astype(str).str.strip()
            meta_descriptions = set(df_meta["Description_Clean"].unique())

            decoded_count, unchanged_count, failed_count = 0, 0, 0
            sheet_results = []

            for col in df_survey.columns:
                col_values = df_survey[col].dropna().astype(str)
                is_numeric = col_values.str.replace(r'\.0$', '', regex=True).str.isnumeric().all() if not col_values.empty else False
                exists_in_meta = col in meta_descriptions

                if not exists_in_meta:
                    status = "OK (No Metadata)"; unchanged_count += 1
                elif not is_numeric:
                    status = "OK (Decoded)"; decoded_count += 1
                else:
                    subset = df_meta[df_meta['Description_Clean'] == col]
                    labels = subset['Label'].astype(str).replace(['0','0.0','nan','None'],'')
                    real_labels = labels[labels != '']
                    if real_labels.empty:
                        status = "OK (Quantitative - No Labels)"; unchanged_count += 1
                    elif real_labels.str.isnumeric().all():
                        status = "OK (Quantitative - Numeric Labels)"; unchanged_count += 1
                    else:
                        status = "FAILED (Should be Text)"; failed_count += 1

                sheet_results.append({
                    "Column": col,
                    "In_Metadata": "Yes" if exists_in_meta else "No",
                    "Data_Type": "Numeric" if is_numeric else "Text",
                    "Status": status
                })

            print("\n" + "="*70)
            print(f"VERIFICATION: {month.upper()} {year}")
            print("="*70)
            print(f"Total Columns: {len(df_survey.columns)}")
            print(f"Successful Decodes: {decoded_count}")
            print(f"Correctly Numeric: {unchanged_count}")
            print(f"Failures: {failed_count}")

            failures = [res for res in sheet_results if "FAILED" in res['Status']]
            if failures:
                print(f"\nWARNING: {len(failures)} columns failed to decode:")
                display(HTML(pd.DataFrame(failures).to_html(index=False)))
            else:
                print("\nPASSED: All columns accounted for.")

            all_results.extend(sheet_results)

    return pd.DataFrame(all_results)


In [4]:
# ============================================================
# EXECUTION
# ============================================================

# Run record integrity check
record_integrity_df = verify_decoded_record_integrity(base_path,
                                                     raw_folder="NEW Header Encoded Surveys",
                                                     decoded_folder="NEW Fully Decoded Surveys")
display(record_integrity_df)

# Run coverage scanner
df_integrity_check = check_value_decoding_integrity_smart(base_path,
                                                          decoded_folder="NEW Fully Decoded Surveys",
                                                          meta_folder="Metadata Sheet 2 CSV's")
display(df_integrity_check.head())



===== RECORD DECODING INTEGRITY CHECK COMPLETE =====
SUCCESS: All decoded surveys match raw counts.


Unnamed: 0,Year,Month,Raw Total Records,Decoded Total Records,Integrity Status
0,2018,April,179815,179815,PASS
1,2018,January,180262,180262,PASS
2,2018,July,182956,182956,PASS
3,2018,October,179204,179204,PASS
4,2019,April,172284,172284,PASS
5,2019,January,181233,181233,PASS
6,2019,July,175438,175438,PASS
7,2019,October,178067,178067,PASS
8,2022,April,184237,184237,PASS
9,2022,August,45054,45054,PASS



VERIFICATION: APRIL 2018
Total Columns: 50
Successful Decodes: 40
Correctly Numeric: 10
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: JANUARY 2018
Total Columns: 50
Successful Decodes: 41
Correctly Numeric: 9
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: JULY 2018
Total Columns: 51
Successful Decodes: 40
Correctly Numeric: 11
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: OCTOBER 2018
Total Columns: 51
Successful Decodes: 41
Correctly Numeric: 10
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: APRIL 2019
Total Columns: 49
Successful Decodes: 41
Correctly Numeric: 8
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: JANUARY 2019
Total Columns: 49
Successful Decodes: 41
Correctly Numeric: 8
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: JULY 2019
Total Columns: 49
Successful Decodes: 41
Correctly Numeric: 8
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: OCTOBER 2019
Total Columns

Column,In_Metadata,Data_Type,Status
Survey Month,Yes,Numeric,FAILED (Should be Text)



VERIFICATION: JANUARY 2022
Total Columns: 52
Successful Decodes: 44
Correctly Numeric: 8
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: JULY 2022
Total Columns: 52
Successful Decodes: 45
Correctly Numeric: 7
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: JUNE 2022
Total Columns: 42
Successful Decodes: 33
Correctly Numeric: 9
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: MARCH 2022
Total Columns: 41
Successful Decodes: 31
Correctly Numeric: 9
Failures: 1



Column,In_Metadata,Data_Type,Status
Survey Month,Yes,Numeric,FAILED (Should be Text)



VERIFICATION: MAY 2022
Total Columns: 42
Successful Decodes: 33
Correctly Numeric: 9
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: NOVEMBER 2022
Total Columns: 42
Successful Decodes: 34
Correctly Numeric: 8
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: OCTOBER 2022
Total Columns: 52
Successful Decodes: 44
Correctly Numeric: 8
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: SEPTEMBER 2022
Total Columns: 42
Successful Decodes: 33
Correctly Numeric: 9
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: APRIL 2023
Total Columns: 52
Successful Decodes: 44
Correctly Numeric: 8
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: AUGUST 2023
Total Columns: 41
Successful Decodes: 32
Correctly Numeric: 9
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: DECEMBER 2023
Total Columns: 41
Successful Decodes: 32
Correctly Numeric: 9
Failures: 0

PASSED: All columns accounted for.

VERIFICATION: FEBRUARY 2023
Total Co

Unnamed: 0,Column,In_Metadata,Data_Type,Status
0,Region,Yes,Text,OK (Decoded)
1,province,No,Numeric,OK (No Metadata)
2,province_recode,No,Numeric,OK (No Metadata)
3,household_seq_number,No,Numeric,OK (No Metadata)
4,2010Urban-RuralFIES,Yes,Text,OK (Decoded)
