# 02 ‚Äî Data Validation & Cleaning

**Goal:** Validate schema, types, missingness, and logical consistency; then produce a clean dataset for downstream preprocessing.

**Inputs:** `resources/data/raw/telco_customer_churn.csv`  
**Outputs:** 
- `resources/data/processed/telco_clean.csv`
- `Level_3/reports/validation_summary.csv` (issues & fixes applied)
- `Level_3/reports/missingness_summary.csv`

---


In [1]:
# ==========================================================
# üèóÔ∏è Setup: env, paths, and load raw dataset
# ==========================================================
from pathlib import Path
import pandas as pd
import numpy as np

# 1) Reuse 00_Setup if available; else minimal inline setup
try:
    %run ./00_Setup.ipynb
    print("‚úÖ Environment loaded from 00_Setup.ipynb")
except Exception as e:
    print(f"‚ö†Ô∏è Fallback setup (00_Setup failed: {e})")
    current_path = Path.cwd().resolve()
    for parent in [current_path] + list(current_path.parents):
        if parent.name == "Telco":
            PROJECT_ROOT = parent
            break
    else:
        raise FileNotFoundError("‚ùå 'Telco' repo root not found.")
    DATA_ROOT = PROJECT_ROOT / "resources" / "data"
    DATA_RAW_DIR = DATA_ROOT / "raw"
    DATA_PROCESSED_DIR = DATA_ROOT / "processed"
    RAW_PATH = DATA_RAW_DIR / "telco_customer_churn.csv"
    CLEAN_PATH = DATA_PROCESSED_DIR / "telco_clean.csv"

# 2) Level_3 outputs
LEVEL_DIR = PROJECT_ROOT / "Level_3"
REPORTS = LEVEL_DIR / "reports"
REPORTS.mkdir(parents=True, exist_ok=True)
DATA_PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# 3) Load raw (no mutations here)
df = pd.read_csv(RAW_PATH)
print(f"üì• Raw loaded: {df.shape[0]:,} rows √ó {df.shape[1]:,} cols")


‚úÖ Found raw dataset: /Users/b/DATA/PROJECTS/Telco/resources/data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv
‚úÖ Environment loaded from 00_Setup.ipynb
üì• Raw loaded: 7,043 rows √ó 21 cols



## 1) Structural Validation (Schema & Dtypes)
- Validate required columns
- Flag extras
- Normalize dtypes (coerce where appropriate)


In [2]:
# ==========================================================
# üîé Schema & Dtypes
# ==========================================================
validation_log = []

# Expected baseline columns (adjust per dataset if needed)
expected_columns = [
    "customerID", "gender", "SeniorCitizen", "Partner", "Dependents",
    "tenure", "PhoneService", "MultipleLines", "InternetService",
    "OnlineSecurity", "OnlineBackup", "DeviceProtection", "TechSupport",
    "StreamingTV", "StreamingMovies", "Contract", "PaperlessBilling",
    "PaymentMethod", "MonthlyCharges", "TotalCharges", "Churn"
]

missing = [c for c in expected_columns if c not in df.columns]
extra   = [c for c in df.columns if c not in expected_columns]

if missing:
    validation_log.append({"issue": "missing_columns", "detail": ",".join(missing), "action": "review upstream or adjust expectations"})
if extra:
    validation_log.append({"issue": "extra_columns", "detail": ",".join(extra), "action": "retain for now (non-blocking)"})


# Coerce common problematic dtypes safely (examples)
# Leave cleaning/imputation decisions for later steps
coerce_numeric_cols = ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]
for c in coerce_numeric_cols:
    if c in df.columns:
        before_na = df[c].isna().sum()
        df[c] = pd.to_numeric(df[c], errors="coerce")
        after_na = df[c].isna().sum()
        if after_na > before_na:
            validation_log.append({"issue": "dtype_coercion_to_numeric",
                                   "detail": f"{c}: {before_na}‚Üí{after_na} NaN after coercion",
                                   "action": "handle missing in step 2"})

print("‚úÖ Schema check complete.")


‚úÖ Schema check complete.


---

## 2) Missingness & Empty Strings
- Quantify missing/blank/space-only
- Decide imputation vs. leave as NaN


In [None]:

# ==========================================================
# üß≠ Missingness & blanks
# ==========================================================
# Empty-string & space-only counts
empty_counts = {}
for c in df.columns:
    if df[c].dtype == "object":
        empty_counts[c] = {
            "empty_string": (df[c] == "").sum(),
            "single_space": (df[c] == " ").sum(),
            "strip_to_empty": df[c].astype(str).str.strip().eq("").sum()
        }
empty_df = pd.DataFrame(empty_counts).T.sort_index()
empty_df.to_csv(REPORTS / "missingness_summary.csv")

validation_log.append({"issue": "blank_values", "detail": "see missingness_summary.csv", "action": "normalize below"})


# Normalize object columns: strip whitespace; set "" ‚Üí NaN
obj_cols = df.select_dtypes(include=["object"]).columns
for c in obj_cols:
    df[c] = df[c].astype(str).str.strip()
    df[c].replace({"": np.nan}, inplace=True)

# Example policy (adjust to your needs):
# - Leave NaN for modeling pipeline to handle OR
# - Minimal imputation for known logic (commented examples below)

# Telco-specific example (optional): tenure==0 often implies TotalCharges‚âà0
# if "tenure" in df.columns and "TotalCharges" in df.columns:
#     mask = df["tenure"].fillna(-1).eq(0) & df["TotalCharges"].isna()
#     df.loc[mask, "TotalCharges"] = 0
#     if mask.sum():
#         validation_log.append({"issue": "impute_TotalCharges",
#                                "detail": f"Set TotalCharges=0 for {mask.sum()} rows where tenure=0",
#                                "action": "documented explicit rule"})

print("‚úÖ Missingness normalization complete.")


## 3) Domain & Logical Consistency
- Range checks (non-negative, reasonable upper bounds)
- Category membership checks
- Cross-field logic (e.g., internet service vs. downstream features)


In [None]:
# ==========================================================
# ‚öñÔ∏è Domain & Logical Rules
# ==========================================================
def add_log(issue, detail, action):
    validation_log.append({"issue": issue, "detail": detail, "action": action})

# --- Range checks (examples) ---
if "tenure" in df.columns:
    bad = df.query("tenure < 0").shape[0]
    if bad:
        add_log("invalid_tenure", f"{bad} rows tenure<0", "clip to 0")
        df["tenure"] = df["tenure"].clip(lower=0)

for col in ["MonthlyCharges", "TotalCharges"]:
    if col in df.columns:
        neg = df[df[col] < 0].shape[0]
        if neg:
            add_log("negative_values", f"{col}: {neg} rows", "set negatives to NaN")
            df.loc[df[col] < 0, col] = np.nan

# --- Category membership (examples) ---
def enforce_membership(col, allowed):
    if col in df.columns:
        vals = set(df[col].dropna().unique())
        invalid = vals - set(allowed)
        if invalid:
            add_log("invalid_category", f"{col}: {invalid}", "normalize case/whitespace; then set invalid‚ÜíNaN")
            # normalize case/title (idempotent)
            df[col] = df[col].astype(str).str.strip()
            # strategy: set truly invalid to NaN
            df.loc[~df[col].isin(allowed), col] = np.nan

enforce_membership("Contract", {"Month-to-month", "One year", "Two year"})
enforce_membership("InternetService", {"DSL", "Fiber optic", "No"})
enforce_membership("PaperlessBilling", {"Yes", "No"})
enforce_membership("Churn", {"Yes", "No"})

# --- Cross-field logic (examples) ---
# If no internet service, dependent features should reflect that
internet_no_mask = df.get("InternetService", pd.Series(index=df.index)).eq("No") if "InternetService" in df else pd.Series(False, index=df.index)
internet_dependent = ["OnlineSecurity","OnlineBackup","DeviceProtection","TechSupport","StreamingTV","StreamingMovies"]
for c in internet_dependent:
    if c in df.columns:
        mism = (~df[c].isin({"No internet service", "Yes", "No"})).sum()
        if mism:
            add_log("unexpected_values", f"{c}: {mism} rows", "coerce unexpected ‚Üí NaN")
            df.loc[~df[c].isin({"No internet service", "Yes", "No"}), c] = np.nan
        # enforce logical mapping
        fix_mask = internet_no_mask & df[c].ne("No internet service")
        if fix_mask.any():
            df.loc[fix_mask, c] = "No internet service"
            add_log("logical_fix", f"{c}: set 'No internet service' for {int(fix_mask.sum())} rows", "rule-based correction")

print("‚úÖ Domain & logical rules applied.")


## 4) Duplicates & Outliers (flag; minimal mutation)
- Remove exact-duplicate rows
- Flag extreme numeric outliers (don‚Äôt drop by default in cleaning)


In [None]:

# ==========================================================
# üßæ Duplicates & Outlier Flags
# ==========================================================
# Exact duplicates
dups = df.duplicated().sum()
if dups:
    validation_log.append({"issue": "duplicate_rows", "detail": f"{dups} duplicates", "action": "drop duplicates"})
    df = df.drop_duplicates()

# Simple outlier flag (IQR) ‚Äî do not remove here, just log counts
def iqr_outlier_count(series):
    q1, q3 = series.quantile([0.25, 0.75])
    iqr = q3 - q1
    lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
    return int(((series < lower) | (series > upper)).sum())

for col in ["MonthlyCharges", "TotalCharges", "tenure"]:
    if col in df.columns and pd.api.types.is_numeric_dtype(df[col]):
        count = iqr_outlier_count(df[col].dropna())
        if count:
            validation_log.append({"issue": "outliers_flagged", "detail": f"{col}: {count} flagged (IQR)", "action": "review in 03_Preprocessing"})
print("‚úÖ Duplicates handled; outliers flagged.")



## 5) Final sanity check & save artifacts
- Save validation logs
- Save cleaned dataset (CSV)


In [None]:
# ==========================================================
# üíæ Save reports & clean dataset
# ==========================================================
val_df = pd.DataFrame(validation_log) if validation_log else pd.DataFrame(columns=["issue","detail","action"])
val_path = REPORTS / "validation_summary.csv"
val_df.to_csv(val_path, index=False)

clean_path = DATA_PROCESSED_DIR / "telco_clean.csv"
df.to_csv(clean_path, index=False)

print(f"üìù Validation summary: {val_path}")
print(f"‚úÖ Clean dataset saved: {clean_path}")
print(df.info())


---

## (Optional) Upgrade path
- Swap ad-hoc checks for **Pandera** or **Great Expectations** in a future pass.
- Add unit tests to assert schema and critical invariants.


---

If you want, I can also generate a **matching header cell for `03_Preprocessing.ipynb`** that expects `telco_clean.csv` and sets up encoders/scalers/splits cleanly.
