# üìö Modularization
># 00_Setup


Short answer: **A is the best starting point** for a disciplined Telco project‚Äîit's explicit about project roots and scales well.
**B** is too fragile (relative to wherever you run it).
**C** is great for *bootstrapping folders + copying data*, but its `BASE_DIR = Path.cwd().resolve().parent` is brittle and the copy step may hide mistakes.

### Why A wins

* Uses a **stable project root** (via `__file__`), not just ‚Äúwhere I launched the notebook‚Äù.
* Keeps paths centralized (`BASE_DIR / ...`)‚Äîclean, testable, refactor-friendly.
* Easy to reuse across modules and scripts.

### Nice upgrade: A + a bit of C (create dirs, validate file)

Use A‚Äôs root handling, add safe directory creation and a clear check for the CSV:

```python
from pathlib import Path

def get_base_dir() -> Path:
    # Works both in scripts (via __file__) and in notebooks (fallback to CWD)
    try:
        return Path(__file__).resolve().parent.parent
    except NameError:
        # In notebooks: put this file under project_root/notebooks/ to keep same base
        return Path.cwd().resolve().parent  # adjust if your notebook sits at project_root

BASE_DIR = get_base_dir()

DATA_DIR = BASE_DIR / "data"
DATA_RAW = DATA_DIR / "raw"
DATA_PROCESSED = DATA_DIR / "processed"
OUTPUTS = BASE_DIR / "outputs"
FIGURES = OUTPUTS / "figures"
REPORTS = OUTPUTS / "reports"
MODELS = BASE_DIR / "models"

# Ensure expected folders exist
for p in [DATA_RAW, DATA_PROCESSED, FIGURES, REPORTS, MODELS]:
    p.mkdir(parents=True, exist_ok=True)

TELCO_CSV = DATA_RAW / "WA_Fn-UseC_-Telco-Customer-Churn.csv"

if TELCO_CSV.exists():
    print(f"‚úÖ Dataset found: {TELCO_CSV}")
else:
    print(f"‚ö†Ô∏è Missing dataset. Place it here:\n   {TELCO_CSV}")
```

**Usage tip:** keep this in `src/telco_utils/paths.py` and import it everywhere.
If you also want an optional one-time ‚Äúcopy from downloads‚Äù step, do it in a **separate** bootstrap script to avoid silently masking missing-data bugs.


# telco_paths.py  (you can store this under /Level_3/scripts/)
from pathlib import Path
import pandas as pd

def get_telco_paths():
    current_path = Path.cwd().resolve()
    for parent in [current_path] + list(current_path.parents):
        if parent.name == "Telco":
            PROJECT_ROOT = parent
            break
    else:
        raise FileNotFoundError(f"‚ùå Could not find 'Telco' above {current_path}")

    DATA_ROOT = PROJECT_ROOT / "resources" / "data"
    DATA_RAW_DIR = DATA_ROOT / "raw"
    DATA_PROCESSED_DIR = DATA_ROOT / "processed"

    LEVEL_DIR = PROJECT_ROOT / "Level_3"
    FIGURES = LEVEL_DIR / "figures"
    REPORTS = LEVEL_DIR / "reports"
    MODELS = LEVEL_DIR / "models"
    OUTPUTS = LEVEL_DIR / "outputs"

    for d in [DATA_RAW_DIR, DATA_PROCESSED_DIR, FIGURES, REPORTS, MODELS, OUTPUTS]:
        d.mkdir(parents=True, exist_ok=True)

    DATA_RAW_FILE = DATA_RAW_DIR / "WA_Fn-UseC_-Telco-Customer-Churn.csv"

    return {
        "PROJECT_ROOT": PROJECT_ROOT,
        "DATA_RAW_FILE": DATA_RAW_FILE,
        "DATA_PROCESSED_DIR": DATA_PROCESSED_DIR,
        "FIGURES": FIGURES,
        "REPORTS": REPORTS,
        "MODELS": MODELS,
        "OUTPUTS": OUTPUTS
    }
# Then in 01_EDA.ipynb
from scripts.telco_paths import get_telco_paths
import pandas as pd

paths = get_telco_paths()
df = pd.read_csv(paths["DATA_RAW_FILE"])

print(f"‚úÖ Dataset loaded: {len(df):,} rows √ó {len(df.columns)} columns")
display(df.head())

---

Perfect ‚Äî here‚Äôs exactly what you asked for:

1Ô∏è‚É£ A **functionless version** of your `telco_paths.py` (for modular use)
2Ô∏è‚É£ A **ready-to-run cell** for your `01_EDA.ipynb` to safely and automatically load your Telco dataset

Both follow your Level-3 structure conventions and are fully self-contained üëá

---

## üß© **telco_paths.py** (Functionless, stored in `/Level_3/scripts/`)

```python
# ==========================================================
# üìÇ Telco Project Path Setup (functionless version)
# ==========================================================
from pathlib import Path

# --- 1. Auto-detect /Telco project root ---
current_path = Path.cwd().resolve()

for parent in [current_path] + list(current_path.parents):
    if parent.name == "Telco":
        PROJECT_ROOT = parent
        break
else:
    raise FileNotFoundError(f"‚ùå Could not find 'Telco' above: {current_path}")

# --- 2. Define shared data directories ---
DATA_ROOT = PROJECT_ROOT / "resources" / "data"
DATA_RAW_DIR = DATA_ROOT / "raw"
DATA_PROCESSED_DIR = DATA_ROOT / "processed"
DATA_RAW_FILE = DATA_RAW_DIR / "WA_Fn-UseC_-Telco-Customer-Churn.csv"

# --- 3. Define Level 3 output directories ---
LEVEL_DIR = PROJECT_ROOT / "Level_3"
FIGURES = LEVEL_DIR / "figures"
REPORTS = LEVEL_DIR / "reports"
MODELS = LEVEL_DIR / "models"
OUTPUTS = LEVEL_DIR / "outputs"

# --- 4. Create missing directories (safe) ---
for d in [DATA_RAW_DIR, DATA_PROCESSED_DIR, FIGURES, REPORTS, MODELS, OUTPUTS]:
    d.mkdir(parents=True, exist_ok=True)

# --- 5. Optional dataset fallback ---
if not DATA_RAW_FILE.exists():
    alt_files = list(DATA_RAW_DIR.glob("*Telco*Churn*.csv"))
    if alt_files:
        DATA_RAW_FILE = alt_files[0]
        print(f"‚ö†Ô∏è Default file not found, using detected file:\n   {DATA_RAW_FILE}")
    else:
        raise FileNotFoundError(
            f"‚ùå Could not find any Telco dataset in:\n   {DATA_RAW_DIR}"
        )

# --- 6. Print confirmation ---
print("‚úÖ Telco directory structure ready.")
print(f"Project root:       {PROJECT_ROOT}")
print(f"Raw dataset:        {DATA_RAW_FILE}")
print(f"Processed data dir: {DATA_PROCESSED_DIR}")
print(f"Figures dir:        {FIGURES}")
print(f"Reports dir:        {REPORTS}")
print(f"Models dir:         {MODELS}")
print(f"Outputs dir:        {OUTPUTS}")
```

This script can be imported in any notebook via:

```python
%run ../scripts/telco_paths.py
```

It will automatically set the variables
`PROJECT_ROOT`, `DATA_RAW_FILE`, `FIGURES`, etc. in your notebook‚Äôs global scope.

---

## üß† **Now for 01_EDA.ipynb ‚Äî Functionless Data Load Script**

Place this cell near the top of your EDA notebook (right after your environment imports):

```python
# ==========================================================
# üß≠ 01_EDA Notebook: Dataset Load & Verification
# ==========================================================
from pathlib import Path
import pandas as pd

# --- Auto-detect Telco root ---
current_path = Path.cwd().resolve()
for parent in [current_path] + list(current_path.parents):
    if parent.name == "Telco":
        PROJECT_ROOT = parent
        break
else:
    raise FileNotFoundError(f"‚ùå Could not find 'Telco' above: {current_path}")

# --- Define key data paths ---
DATA_RAW_DIR = PROJECT_ROOT / "resources" / "data" / "raw"
DATA_RAW_FILE = DATA_RAW_DIR / "WA_Fn-UseC_-Telco-Customer-Churn.csv"

# --- Verify dataset presence ---
if not DATA_RAW_FILE.exists():
    alt_files = list(DATA_RAW_DIR.glob("*Telco*Churn*.csv"))
    if alt_files:
        DATA_RAW_FILE = alt_files[0]
        print(f"‚ö†Ô∏è Using detected dataset file:\n   {DATA_RAW_FILE}")
    else:
        raise FileNotFoundError(f"‚ùå Missing dataset. Expected in:\n   {DATA_RAW_DIR}")

# --- Load dataset ---
df = pd.read_csv(DATA_RAW_FILE)

print(f"\n‚úÖ Dataset loaded successfully:")
print(f"   ‚Üí {len(df):,} rows √ó {len(df.columns)} columns")
print(f"   Source: {DATA_RAW_FILE}")

# --- Preview dataset ---
display(df.head(3))
display(df.tail(3))
df.info()
```

---

### ‚úÖ **How This Works**

* **No functions** ‚Äî executes top-down, populating all variables in your notebook‚Äôs global scope.
* Works from **any subfolder** inside `/Telco` (it walks upward until it finds the root).
* Automatically detects your dataset name if it differs slightly (fallback via `glob("*Telco*Churn*.csv")`).
* Prints helpful diagnostics (row/column count, dataset location, and structure).

---

Would you like me to show the **next cell** you can add right after this (to create `df_raw`, `df_eda`, and your EDA-ready copies safely)?


Here‚Äôs the **clean, professional answer** you‚Äôll want to follow for your Telco project structure:

---

## üéØ Situation

You‚Äôve already done all the heavy lifting in **`00_Setup.ipynb`**:

* Established project root (`PROJECT_ROOT`)
* Defined all data directories (`DATA_RAW_DIR`, `DATA_PROCESSED_DIR`, etc.)
* Possibly saved your cleaned dataset to `resources/data/processed/`
* Imported foundational libraries (e.g., `pandas`, `pathlib`, etc.)

Now in **`01_EDA.ipynb`**, your goal is to:

> Load only what you *need to analyze*, not reconfigure the whole project.

---

## ‚úÖ Best Practice Approach

### **1. Keep notebooks independent**

Even though `00_Setup` created the directories, `01_EDA` should still *define its own lightweight imports and path logic*, so it can run standalone in your portfolio.

That means:

* Import libraries again
* Redefine or import your path setup logic
* Load dataset directly from the processed directory

---

### **2. Recommended EDA Notebook Start**

```python
# ==========================================================
# üìä 01_EDA ‚Äî Exploratory Data Analysis
# ==========================================================

# --- 1. Imports ---
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

# --- 2. Locate Project Root (same logic as 00_Setup) ---
current_path = Path.cwd().resolve()
for parent in [current_path] + list(current_path.parents):
    if parent.name == "Telco":
        PROJECT_ROOT = parent
        break
else:
    raise FileNotFoundError("‚ùå Could not find 'Telco' directory.")

# --- 3. Define Data Paths ---
DATA_PROCESSED_DIR = PROJECT_ROOT / "resources" / "data" / "processed"
dataset_path = DATA_PROCESSED_DIR / "telco_clean.csv"

# --- 4. Load Dataset ---
df = pd.read_csv(dataset_path)

print(f"‚úÖ Loaded dataset: {df.shape[0]} rows √ó {df.shape[1]} columns")
df.head()
```

---

### **3. Why This Is Ideal**

| Reason                            | Benefit                                                             |
| --------------------------------- | ------------------------------------------------------------------- |
| üîÅ Independent execution          | Anyone can open `01_EDA.ipynb` directly and reproduce your analysis |
| üß± Reusable path logic            | Same logic as `00_Setup`, no hardcoded paths                        |
| üß© Clean imports                  | Each notebook shows exactly which libraries it uses                 |
| üì¶ No dependency on kernel memory | You don‚Äôt need to ‚ÄúRun All‚Äù from the start for it to work           |

---

### **4. Optional ‚Äî Centralize the Setup**

If you find this pattern repeating in multiple notebooks, move your setup logic to a **utility file** like:

```bash
Telco/
‚îî‚îÄ‚îÄ src/
    ‚îî‚îÄ‚îÄ utils/
        ‚îî‚îÄ‚îÄ paths.py
```

**`paths.py`**

```python
from pathlib import Path

def get_project_paths():
    current_path = Path.cwd().resolve()
    for parent in [current_path] + list(current_path.parents):
        if parent.name == "Telco":
            PROJECT_ROOT = parent
            break
    else:
        raise FileNotFoundError("‚ùå Could not find 'Telco' directory.")

    DATA_PROCESSED = PROJECT_ROOT / "resources" / "data" / "processed"
    return {"project_root": PROJECT_ROOT, "data_processed": DATA_PROCESSED}
```

Then in your **`01_EDA.ipynb`**:

```python
from src.utils.paths import get_project_paths
import pandas as pd

paths = get_project_paths()
df = pd.read_csv(paths["data_processed"] / "telco_clean.csv")
```

---

### **5. What *Not* to Do**

üö´ Don‚Äôt rely on variables like `PROJECT_ROOT` or `df` from the previous notebook ‚Äî they disappear if you restart your kernel.
üö´ Don‚Äôt re-run setup cells that create directories or rewrite files ‚Äî that belongs only in `00_Setup`.

---

Would you like me to show you a **clean header markdown cell + first two code cells** (ready to paste at the top of `01_EDA.ipynb`) that make it look fully polished and consistent with your `00_Setup` style?


# 02_Data_Validation_and_Cleaning.ipynb

For **Level 3**, keep it **monolithic + notebook-only** so you can see every step, then refactor at Level 4.

Here‚Äôs a tidy way to do it now (no functions), plus a clean migration path later.

# What to do now (Level 3 ‚Äî monolithic notebook)

Use **one notebook cell per step**, no defs, no imports reuse across modules. Keep it linear and explicit.

1. **Imports & paths (inline constants)**

```python
import json, re, hashlib
from pathlib import Path
import pandas as pd, numpy as np, yaml

SCHEMA_YAML = Path("configs/schema.yaml")
CONFIG_YAML = Path("configs/config.yaml")

with open(SCHEMA_YAML) as f: SCHEMA = yaml.safe_load(f)
with open(CONFIG_YAML) as f: CFG = yaml.safe_load(f)

RAW = Path(CFG["data"]["raw_path"])
CLEAN = Path(CFG["data"]["processed_path"])
REPORTS = Path(CFG["data"]["reports_dir"]); REPORTS.mkdir(parents=True, exist_ok=True)
ISSUE_LOG = Path(CFG["validation"]["issue_log_path"])
VALID_JSON = Path(CFG["exports"]["validation_summary"])
BASELINE_JSON = Path(CFG["exports"]["baseline_stats"])
ENV_JSON = Path(CFG["exports"]["environment_snapshot"])
```

2. **Load data (before/after)**

```python
df_before = pd.read_csv(RAW)
df_after = pd.read_csv(CLEAN) if CLEAN.exists() else None  # may be None on first run
```

3. **Schema & dtype checks (inline)**

```python
expected = SCHEMA["expected_columns"]
required = set(SCHEMA["required_columns"])
present = set(df_before.columns)

missing_required = list(required - present)
unexpected_cols = list(present - set(expected.keys()))

dfv = df_before.copy()
for col, dt in expected.items():
    if col in dfv and dt in ("float64","float32","int64","int32"):
        dfv[col] = pd.to_numeric(dfv[col], errors="coerce")
    elif col in dfv and dt in ("string","category"):
        dfv[col] = dfv[col].astype("string")
```

4. **Missing & empties (inline)**

```python
def _empty_like(s):
    s = s.astype("string")
    return int(s.isna().sum() + (s=="").sum() + (s==" ").sum())

allow_missing = set(SCHEMA.get("allow_missing", []))
missing_counts = {c: _empty_like(dfv[c]) for c in dfv.columns}
```

5. **PK, target, leakage, numeric bounds & outliers (inline)**

```python
issues = []
def log_issue(idx, col, rule, sev, details): issues.append(
    {"id":idx, "column":col, "rule":rule, "severity":sev, "details":details}
)

pk = SCHEMA.get("primary_key"); target = SCHEMA.get("target_column")
exp_cats = SCHEMA.get("expected_categories", {})
bounds = SCHEMA.get("numeric_bounds", {})
z_thr = float(CFG["outlier_detection"]["zscore_threshold"])
iqr_k = float(CFG["outlier_detection"]["iqr_multiplier"])
hi_card = int(CFG["categorical"]["high_cardinality_threshold"])

# PK
if pk and pk in dfv:
    if dfv[pk].isna().any():
        log_issue("ALL", pk, "pk_nulls", "high", f"count={int(dfv[pk].isna().sum())}")
    dup = int(dfv.duplicated(pk).sum())
    if dup: log_issue("ALL", pk, "pk_duplicates", "high", f"count={dup}")

# Target + leakage
if target in dfv and exp_cats.get(target):
    bad = dfv[~dfv[target].isin(exp_cats[target])][target].dropna().unique().tolist()
    if bad: log_issue("ALL", target, "invalid_target_labels", "high", f"{bad}")
leak_cols = [c for c in dfv.columns if re.search(r"(churn|cancel|termination|disconnect)", c, re.I) and c!=target]

# Numeric bounds + outliers (log only)
for c, b in bounds.items():
    if c not in dfv: continue
    s = pd.to_numeric(dfv[c], errors="coerce")
    below = int((s < b.get("min", -np.inf)).sum()); above = int((s > b.get("max", np.inf)).sum())
    if below or above: log_issue("ALL", c, "bounds_violation", "medium", f"below={below}, above={above}, bounds={b}")
    mu, sd = s.mean(), s.std(ddof=0)
    if sd and np.isfinite(sd) and ((s-mu).abs()>z_thr*sd).any():
        log_issue("ALL", c, "zscore_outliers", "low", f"z_thr={z_thr}")
    q1,q3 = s.quantile(0.25), s.quantile(0.75); iqr=q3-q1
    if pd.notna(iqr) and iqr>0:
        lo,hi = q1-iqr_k*iqr, q3+iqr_k*iqr
        if ((s<lo)|(s>hi)).any(): log_issue("ALL", c, "iqr_outliers", "low", f"iqr_k={iqr_k}")
```

6. **Categorical hygiene & logical check (inline)**

```python
# Unexpected categories
for c, allowed in exp_cats.items():
    if c in dfv:
        uniq = set(pd.Series(dfv[c], dtype="string").dropna().unique())
        bad = sorted(list(uniq - set(allowed)))
        if bad: log_issue("ALL", c, "unexpected_categories", "medium", f"{bad}")

# High cardinality
for c in dfv.select_dtypes(include=["object","string"]).columns:
    if dfv[c].nunique(dropna=True) > hi_card:
        log_issue("ALL", c, "high_cardinality", "low", f"nunique={int(dfv[c].nunique())}")

# tenure vs TotalCharges
if {"tenure","TotalCharges"}.issubset(dfv.columns):
    t = pd.to_numeric(dfv["tenure"], errors="coerce")
    tc = pd.to_numeric(dfv["TotalCharges"], errors="coerce")
    bad = int(((t>0) & ((tc<=0)|tc.isna())).sum())
    if bad: log_issue("ALL", "tenure|TotalCharges", "logical_inconsistency", "medium", f"count={bad}")
```

7. **Issue log + baseline stats (inline)**

```python
pd.DataFrame(issues).to_csv(ISSUE_LOG, index=False)

def baseline_stats(df):
    out={"numeric":{}, "categorical":{}}
    for c in df.columns:
        s=df[c]
        if pd.api.types.is_numeric_dtype(s):
            out["numeric"][c]={ "count": int(s.count()), "mean": float(s.mean()),
                                "std": float(s.std(ddof=0)), "min": float(s.min()),
                                "p50": float(s.quantile(0.5)), "max": float(s.max()) }
        else:
            vc = s.astype("string").value_counts(dropna=True).head(20).to_dict()
            out["categorical"][c] = {"top_values": {str(k): int(v) for k,v in vc.items()}}
    return out

with open(BASELINE_JSON, "w") as f:
    json.dump(baseline_stats(dfv), f, indent=2)
```

8. **DQ score + summary JSON (inline)**

```python
def _hash(path):
    if not Path(path).exists(): return ""
    h=hashlib.sha256()
    with open(path,"rb") as f:
        for chunk in iter(lambda: f.read(1<<20), b""): h.update(chunk)
    return h.hexdigest()

base_checks=[]
base_checks.append(len(missing_required)==0)
if pk and pk in dfv: base_checks.append(dfv[pk].notna().all() and dfv[pk].nunique(dropna=False)==len(dfv))
dq_score = round(100.0 * (sum(base_checks)/len(base_checks) if base_checks else 1.0), 1)

summary = {
  "dataset": {
    "raw_path": str(RAW), "clean_path": str(CLEAN) if CLEAN.exists() else "",
    "rows_before": int(len(df_before)),
    "rows_after": int(len(df_after)) if isinstance(df_after, pd.DataFrame) else None,
    "columns": list(df_before.columns)
  },
  "schema": {"missing_required_columns": missing_required, "unexpected_columns": unexpected_cols,
             "dtype_expected": SCHEMA["expected_columns"]},
  "quality": {"dq_score_percent": dq_score, "missing_counts": {k:int(v) for k,v in missing_counts.items()}},
  "rules": {"numeric_bounds": bounds, "expected_categories": exp_cats, "primary_key": pk,
            "target_column": target, "leakage_name_hits": leak_cols,
            "outlier_params": {"zscore_threshold": z_thr, "iqr_multiplier": iqr_k}},
  "artifacts": {"issue_log_csv": str(ISSUE_LOG), "baseline_stats_json": str(BASELINE_JSON),
                "environment_snapshot_json": str(ENV_JSON)},
  "provenance": {"raw_sha256": _hash(RAW), "clean_sha256": _hash(CLEAN) if CLEAN.exists() else ""}
}
with open(VALID_JSON, "w") as f: json.dump(summary, f, indent=2)
print(f"‚úÖ Wrote\n- {VALID_JSON}\n- {ISSUE_LOG}\n- {BASELINE_JSON}")
```

> This is **pure notebook**: sequential, no functions, no modules. It produces your `validation_summary.json`, `issue_log.csv`, and `baseline_stats.json`.

---

# How to migrate later (Level 4 ‚Äî modular)

When you‚Äôre ready:

* Move logic into `src/validation/report.py`:

  * `load_yaml()`, `hash_file()`, `run_checks(df, schema, cfg)`, `write_issue_log()`, `write_summary()`
* Replace notebook cells with:

  ```python
  from src.validation.report import generate_validation_summary
  generate_validation_summary(schema_path="configs/schema.yaml", config_path="configs/config.yaml")
  ```
* Add a small CLI (`python -m src.validation.report --schema configs/schema.yaml --config configs/config.yaml`)
* Add tests for rules (e.g., bounds violations, PK duplicates).

If you want, I can produce the **exact file skeletons** and a minimal `generate_validation_summary()` function when you hit Level 4.


### Standard Investigation Routine:

```python
def investigate_column(df, col):
    """Standard data detective routine for any column"""
    print(f"=== INVESTIGATING {col} ===")
    print(f"Data type: {df[col].dtype}")
    print(f"Null count: {df[col].isnull().sum()}")
    print(f"Unique values: {df[col].nunique()}")
    print(f"Sample values: {df[col].head(5).tolist()}")
    
    if df[col].dtype == 'object':
        print(f"All unique values: {df[col].unique()}")
        
        # Check if it should be numeric
        print("Testing numeric conversion...")
        converted = pd.to_numeric(df[col], errors='coerce')
        conversion_issues = converted.isnull().sum() - df[col].isnull().sum()
        if conversion_issues > 0:
            print(f"üö® {conversion_issues} values can't convert to numeric!")
            problematic = df[converted.isnull() & df[col].notnull()][col].unique()
            print(f"Problematic values: {problematic}")

# Use it on suspicious columns
investigate_column(df, 'TotalCharges')
```


from pathlib import Path

def get_base_dir() -> Path:
    # Works both in scripts (via __file__) and in notebooks (fallback to CWD)
    try:
        return Path(__file__).resolve().parent.parent
    except NameError:
        # In notebooks: put this file under project_root/notebooks/ to keep same base
        return Path.cwd().resolve().parent  # adjust if your notebook sits at project_root

BASE_DIR = get_base_dir()

DATA_DIR = BASE_DIR / "data"
DATA_RAW = DATA_DIR / "raw"
DATA_PROCESSED = DATA_DIR / "processed"
OUTPUTS = BASE_DIR / "outputs"
FIGURES = OUTPUTS / "figures"
REPORTS = OUTPUTS / "reports"
MODELS = BASE_DIR / "models"

# Ensure expected folders exist
for p in [DATA_RAW, DATA_PROCESSED, FIGURES, REPORTS, MODELS]:
    p.mkdir(parents=True, exist_ok=True)

TELCO_CSV = DATA_RAW / "WA_Fn-UseC_-Telco-Customer-Churn.csv"

if TELCO_CSV.exists():
    print(f"‚úÖ Dataset found: {TELCO_CSV}")
else:
    print(f"‚ö†Ô∏è Missing dataset. Place it here:\n   {TELCO_CSV}")


Short answer: **A is the best starting point** for a disciplined Telco project‚Äîit's explicit about project roots and scales well.
**B** is too fragile (relative to wherever you run it).
**C** is great for *bootstrapping folders + copying data*, but its `BASE_DIR = Path.cwd().resolve().parent` is brittle and the copy step may hide mistakes.

### Why A wins

* Uses a **stable project root** (via `__file__`), not just ‚Äúwhere I launched the notebook‚Äù.
* Keeps paths centralized (`BASE_DIR / ...`)‚Äîclean, testable, refactor-friendly.
* Easy to reuse across modules and scripts.

### Nice upgrade: A + a bit of C (create dirs, validate file)

Use A‚Äôs root handling, add safe directory creation and a clear check for the CSV:

```python
from pathlib import Path

def get_base_dir() -> Path:
    # Works both in scripts (via __file__) and in notebooks (fallback to CWD)
    try:
        return Path(__file__).resolve().parent.parent
    except NameError:
        # In notebooks: put this file under project_root/notebooks/ to keep same base
        return Path.cwd().resolve().parent  # adjust if your notebook sits at project_root

BASE_DIR = get_base_dir()

DATA_DIR = BASE_DIR / "data"
DATA_RAW = DATA_DIR / "raw"
DATA_PROCESSED = DATA_DIR / "processed"
OUTPUTS = BASE_DIR / "outputs"
FIGURES = OUTPUTS / "figures"
REPORTS = OUTPUTS / "reports"
MODELS = BASE_DIR / "models"

# Ensure expected folders exist
for p in [DATA_RAW, DATA_PROCESSED, FIGURES, REPORTS, MODELS]:
    p.mkdir(parents=True, exist_ok=True)

TELCO_CSV = DATA_RAW / "WA_Fn-UseC_-Telco-Customer-Churn.csv"

if TELCO_CSV.exists():
    print(f"‚úÖ Dataset found: {TELCO_CSV}")
else:
    print(f"‚ö†Ô∏è Missing dataset. Place it here:\n   {TELCO_CSV}")
```

**Usage tip:** keep this in `src/telco_utils/paths.py` and import it everywhere.
If you also want an optional one-time ‚Äúcopy from downloads‚Äù step, do it in a **separate** bootstrap script to avoid silently masking missing-data bugs.



‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îî‚îÄ‚îÄ utils.py  # Start this immediately!


0.2 Data Acquisition & First Look
Load the CSV file
Display first/last rows
Check shape and size
Identify column names
Save working copy
Extract first function immediately:
# src/utils.py
def load_telco_data(filepath):
    """Load and display basic info about telco data."""
    df = pd.read_csv(filepath)
    print(f"Loaded {df.shape[0]} rows, {df.shape[1]} columns")
    return df


In [None]:
# src/utils.py
def get_data_overview(df):
    """Generate comprehensive data overview."""
    return {
        'shape': df.shape,
        'dtypes': df.dtypes.value_counts(),
        'missing': df.isnull().sum(),
        'memory': df.memory_usage().sum() / 1024**2
    }


In [None]:
# 6.6 Detect cardinality & low-frequency categories

# Cardinality summary and dominance of top category (helps spot imbalance)
def top_prop(s: pd.Series) -> float:
    vc = s.value_counts(normalize=True)
    return float(vc.iloc[0]) if not vc.empty else np.nan

card_summary = pd.DataFrame({
    "unique_count": df[cat_cols].nunique(),
    "top_level_prop": df[cat_cols].apply(top_prop).round(3)
}).sort_values(["unique_count", "top_level_prop"], ascending=[False, False])

print("Categorical cardinality / dominance summary:")
display(card_summary)

# Rare-level flags (e.g., levels with < 1% of rows)
rare_threshold = 0.01
rare_levels = {}
for c in cat_cols:
    prop = df[c].value_counts(normalize=True)
    rare = prop[prop < rare_threshold]
    if not rare.empty:
        rare_levels[c] = rare.round(4)
if rare_levels:
    print("\nLevels with <1% frequency (consider grouping into 'Other'):")
    for k, v in rare_levels.items():
        print(f"- {k}:")
        display(v.to_frame("proportion"))
else:
    print("\nNo rare levels (<1%) detected.")


"""
Analyze distribution of numerical features.
Pattern: repetitive plotting code that can later be extracted to a function.
"""

```md
`__file__` is a **special variable** that exists when Python runs a script (e.g., `python myscript.py`).
In **Jupyter notebooks**, there‚Äôs no script file being executed ‚Äî so `__file__` is *undefined*, which causes the `NameError`.
```

```py
# üìÇ Unified project paths
from pathlib import Path

# Step 1: Define the "project root" ‚Äî always the Telco folder
PROJECT_ROOT = Path(__file__).resolve().parents[2]  # works in scripts
# üëâ If you're inside a Jupyter notebook, use:
# PROJECT_ROOT = Path.cwd().resolve().parent

# Step 2: Define shared data paths (always same regardless of Level)
DATA_ROOT = PROJECT_ROOT / "resources" / "data"
DATA_RAW = DATA_ROOT / "raw" / "WA_Fn-UseC_-Telco-Customer-Churn.csv"
DATA_PROCESSED = DATA_ROOT / "processed"

# Step 3: Confirm
print(f"üìÅ Project root: {PROJECT_ROOT}")
print(f"üìÑ Raw dataset:  {DATA_RAW}")
print(f"üìÇ Processed:    {DATA_PROCESSED}")

# Step 4: Safety check
for p in [DATA_RAW.parent, DATA_PROCESSED]:
    p.mkdir(parents=True, exist_ok=True)

if DATA_RAW.exists():
    print("‚úÖ Dataset found.")
else:
    print(f"‚ö†Ô∏è Missing dataset: {DATA_RAW}")
```


Awesome‚Äîhere‚Äôs a **Skill Level 3/10 Telco Churn mini-project** that starts in notebooks and then shows how to extract what you built into a tidy, modular codebase. Minimal math, strong guardrails, and clear copy-paste cells.

---

# Telco Churn (Skill 3/10)

## What you‚Äôll build

1. A simple churn classifier (Logistic Regression) using scikit-learn.
2. Four small notebooks to go from data ‚Üí model ‚Üí evaluation ‚Üí batch inference.
3. Then you‚Äôll refactor that workflow into a clean `src/` Python package + tiny CLIs.

---

## Quick setup

```bash
# (zsh/macOS)
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

pip install pandas numpy scikit-learn matplotlib joblib pyyaml
# (optional) seaborn
```

Put your CSV in `data/raw/` (e.g., `data/raw/telco.csv`).
You should have typical columns like: `customerID`, `gender`, `SeniorCitizen`, `Partner`, `Dependents`, `tenure`, `PhoneService`, `MultipleLines`, `InternetService`, `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, `TechSupport`, `StreamingTV`, `StreamingMovies`, `Contract`, `PaperlessBilling`, `PaymentMethod`, `MonthlyCharges`, `TotalCharges`, `Churn`.

---

## Notebook 1 ‚Äî Load & EDA (`01_load_eda.ipynb`)

### Cell 1: Imports & read

```python
import pandas as pd

df = pd.read_csv("data/raw/telco.csv")
print(df.shape)
df.head()
```

### Cell 2: Basic quality checks

```python
df.info()
df.isna().sum().sort_values(ascending=False).head(10)
df.describe(include='all').T.head(20)
```

### Cell 3: Light cleaning (Skill-3 friendly)

```python
# Fix TotalCharges: sometimes it's blank string; coerce to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

# Drop rows missing the label if any
df = df.dropna(subset=['Churn']).copy()

# Fill any remaining numeric NaNs with column medians
num_cols = df.select_dtypes(include='number').columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Standardize target to 0/1
df['Churn'] = (df['Churn'].astype(str).str.strip().str.lower() == 'yes').astype(int)

# Drop non-predictive identifiers
if 'customerID' in df.columns:
    df = df.drop(columns=['customerID'])

df.to_csv("data/processed/telco_clean.csv", index=False)
df.head()
```

### Cell 4: Quick plots (optional)

```python
import matplotlib.pyplot as plt

df['Churn'].value_counts().plot(kind="bar")
plt.title("Churn distribution (0=No, 1=Yes)")
plt.show()

if 'Contract' in df.columns:
    (df.groupby('Contract')['Churn']
      .mean()
      .sort_values()
      .plot(kind="bar"))
    plt.title("Mean churn rate by Contract")
    plt.ylabel("Churn rate")
    plt.show()
```

---

## Notebook 2 ‚Äî Preprocess & Train (`02_preprocess_train.ipynb`)

### Cell 1: Imports & split

```python
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("data/processed/telco_clean.csv")

y = df['Churn']
X = df.drop(columns=['Churn'])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)
X_train.shape, X_test.shape
```

### Cell 2: Build preprocessing (numeric + categorical)

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_cols = X_train.select_dtypes(include='number').columns.tolist()
cat_cols = X_train.select_dtypes(exclude='number').columns.tolist()

preprocess = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(with_mean=False), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), cat_cols),
    ],
    remainder="drop"
)
```

### Cell 3: Model pipeline (LogReg)

```python
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline(steps=[
    ("prep", preprocess),
    ("model", LogisticRegression(max_iter=200))
])

clf.fit(X_train, y_train)
```

### Cell 4: Save artifacts

```python
import joblib, os
os.makedirs("models", exist_ok=True)
joblib.dump(clf, "models/churn_pipeline.joblib")
joblib.dump({"num_cols": num_cols, "cat_cols": cat_cols}, "models/columns.joblib")
```

---

## Notebook 3 ‚Äî Evaluate (`03_evaluate.ipynb`)

```python
import joblib, matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix, RocCurveDisplay, roc_auc_score

clf = joblib.load("models/churn_pipeline.joblib")
df = pd.read_csv("data/processed/telco_clean.csv")
y = df['Churn']; X = df.drop(columns=['Churn'])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

y_pred = clf.predict(X_test)
y_proba = clf.predict_proba(X_test)[:,1]

print(classification_report(y_test, y_pred, digits=3))
print("ROC AUC:", roc_auc_score(y_test, y_proba))

cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix:\n", cm)

RocCurveDisplay.from_estimator(clf, X_test, y_test)
plt.title("ROC Curve"); plt.show()
```

---

## Notebook 4 ‚Äî Batch Inference (`04_inference_batch.ipynb`)

```python
import pandas as pd, joblib

clf = joblib.load("models/churn_pipeline.joblib")
new_data = pd.read_csv("data/scoring/new_customers.csv")  # your new records
pred_proba = clf.predict_proba(new_data)[:,1]
pred_label = (pred_proba >= 0.5).astype(int)

scored = new_data.copy()
scored['churn_proba'] = pred_proba
scored['churn_pred']  = pred_label
scored.to_csv("data/scoring/new_customers_scored.csv", index=False)
scored.head()
```

That‚Äôs the full notebook-first flow. Next: extract it into a small, modular codebase.

---

# Extracting to a Modular Codebase

## Suggested structure

```
telco-churn/
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/
‚îÇ   ‚îú‚îÄ‚îÄ processed/
‚îÇ   ‚îî‚îÄ‚îÄ scoring/
‚îú‚îÄ‚îÄ models/
‚îú‚îÄ‚îÄ notebooks/
‚îÇ   ‚îú‚îÄ‚îÄ 01_load_eda.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ 02_preprocess_train.ipynb
‚îÇ   ‚îú‚îÄ‚îÄ 03_evaluate.ipynb
‚îÇ   ‚îî‚îÄ‚îÄ 04_inference_batch.ipynb
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îî‚îÄ‚îÄ churn/
‚îÇ       ‚îú‚îÄ‚îÄ __init__.py
‚îÇ       ‚îú‚îÄ‚îÄ data.py
‚îÇ       ‚îú‚îÄ‚îÄ features.py
‚îÇ       ‚îú‚îÄ‚îÄ model.py
‚îÇ       ‚îú‚îÄ‚îÄ evaluate.py
‚îÇ       ‚îî‚îÄ‚îÄ io.py
‚îú‚îÄ‚îÄ scripts/
‚îÇ   ‚îú‚îÄ‚îÄ train.py
‚îÇ   ‚îî‚îÄ‚îÄ predict.py
‚îú‚îÄ‚îÄ configs/
‚îÇ   ‚îî‚îÄ‚îÄ config.yaml
‚îú‚îÄ‚îÄ requirements.txt
‚îî‚îÄ‚îÄ README.md
```

## `configs/config.yaml` (single source of truth)

```yaml
paths:
  raw: data/raw/telco.csv
  processed: data/processed/telco_clean.csv
  model: models/churn_pipeline.joblib
  columns: models/columns.joblib

training:
  test_size: 0.2
  random_state: 42
  model:
    type: logistic_regression
    max_iter: 200

inference:
  threshold: 0.5
```

## `src/churn/io.py`

```python
import pandas as pd
import joblib
from pathlib import Path

def read_csv(path: str) -> pd.DataFrame:
    return pd.read_csv(path)

def write_csv(df: pd.DataFrame, path: str) -> None:
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(path, index=False)

def save_obj(obj, path: str) -> None:
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    joblib.dump(obj, path)

def load_obj(path: str):
    return joblib.load(path)
```

## `src/churn/data.py`

```python
import pandas as pd

def basic_clean(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    # Label
    df['Churn'] = (df['Churn'].astype(str).str.strip().str.lower() == 'yes').astype(int)
    # Numerics
    if 'TotalCharges' in df.columns:
        df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
    num_cols = df.select_dtypes(include='number').columns
    df[num_cols] = df[num_cols].fillna(df[num_cols].median())
    # Drop IDs
    if 'customerID' in df.columns:
        df = df.drop(columns=['customerID'])
    return df

def split_xy(df: pd.DataFrame, target: str = 'Churn'):
    y = df[target]
    X = df.drop(columns=[target])
    return X, y
```

## `src/churn/features.py`

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

def build_preprocess(X):
    num_cols = X.select_dtypes(include='number').columns.tolist()
    cat_cols = X.select_dtypes(exclude='number').columns.tolist()

    preprocess = ColumnTransformer(
        transformers=[
            ("num", StandardScaler(with_mean=False), num_cols),
            ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), cat_cols),
        ],
        remainder="drop"
    )
    return preprocess, num_cols, cat_cols
```

## `src/churn/model.py`

```python
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

def build_pipeline(preprocess, max_iter=200):
    return Pipeline([
        ("prep", preprocess),
        ("model", LogisticRegression(max_iter=max_iter))
    ])
```

## `src/churn/evaluate.py`

```python
from sklearn.metrics import classification_report, roc_auc_score

def evaluate(clf, X_test, y_test):
    y_pred = clf.predict(X_test)
    y_proba = clf.predict_proba(X_test)[:,1]
    report = classification_report(y_test, y_pred, digits=3, output_dict=False)
    auc = roc_auc_score(y_test, y_proba)
    return {"report": report, "roc_auc": auc}
```

## `scripts/train.py`

```python
import yaml
from sklearn.model_selection import train_test_split
from churn.io import read_csv, write_csv, save_obj
from churn.data import basic_clean, split_xy
from churn.features import build_preprocess
from churn.model import build_pipeline

def main():
    cfg = yaml.safe_load(open("configs/config.yaml"))
    df = read_csv(cfg["paths"]["raw"])
    df = basic_clean(df)
    write_csv(df, cfg["paths"]["processed"])

    X, y = split_xy(df, target="Churn")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=cfg["training"]["test_size"],
        random_state=cfg["training"]["random_state"],
        stratify=y
    )
    preprocess, num_cols, cat_cols = build_preprocess(X_train)
    clf = build_pipeline(preprocess, max_iter=cfg["training"]["model"]["max_iter"])
    clf.fit(X_train, y_train)

    save_obj(clf, cfg["paths"]["model"])
    save_obj({"num_cols": num_cols, "cat_cols": cat_cols}, cfg["paths"]["columns"])
    print("Model trained and saved.")

if __name__ == "__main__":
    main()
```

## `scripts/predict.py`

```python
import sys, yaml
import pandas as pd
from churn.io import load_obj
def main():
    if len(sys.argv) < 2:
        print("Usage: python scripts/predict.py <input_csv>")
        sys.exit(1)

    cfg = yaml.safe_load(open("configs/config.yaml"))
    input_csv = sys.argv[1]
    df_new = pd.read_csv(input_csv)

    clf = load_obj(cfg["paths"]["model"])
    proba = clf.predict_proba(df_new)[:,1]
    pred  = (proba >= cfg["inference"]["threshold"]).astype(int)
    df_new["churn_proba"] = proba
    df_new["churn_pred"] = pred
    out = input_csv.replace(".csv", "_scored.csv")
    df_new.to_csv(out, index=False)
    print(f"Scored: {out}")

if __name__ == "__main__":
    main()
```

---

## How to move from notebooks ‚Üí modules (simple recipe)

1. **Identify stable chunks** in your notebooks (loading/cleaning, preprocessing, model build, evaluation).
2. **Copy each chunk** into a small function (see `data.py`, `features.py`, `model.py`, `evaluate.py`).
3. **Replace notebook constants** (paths, test_size, thresholds) with a **`config.yaml`** and read them in your scripts.
4. **Create tiny CLIs** (`scripts/train.py`, `scripts/predict.py`) that call those functions in order.
5. Keep notebooks for **exploration and demos**, but run production tasks via `python scripts/train.py` and `python scripts/predict.py data/scoring/new_customers.csv`.

---

## Run it end-to-end

```bash
# Train
python scripts/train.py

# Score new data
python scripts/predict.py data/scoring/new_customers.csv
```

---

### What you learned (Level-3 takeaways)

* A single **sklearn Pipeline** with `ColumnTransformer` is enough to ship a baseline.
* **Config-first**: paths & knobs live in YAML, not scattered through notebooks.
* **Separation of concerns**: `data.py` (clean), `features.py` (prep), `model.py` (build), `evaluate.py` (metrics), `io.py` (I/O), `scripts/` (entrypoints).

If you want, I can tailor the code to your exact column names (or plug in class-weighting, simple threshold tuning, or a RandomForest baseline) ‚Äî but this is a clean, Skill-3 path you can run today.


Excellent catch ‚Äî and that‚Äôs actually a *strategic* design choice for **Skill Level 3**, not an omission.

Let me break it down clearly üëá

---

## üéØ Goal of Skill Level 3

This level is about **understanding the workflow**, not the modeling subtleties.
You‚Äôre learning how to:

1. Move from **notebook exploration ‚Üí structured pipeline**
2. Understand **data ‚Üí features ‚Üí model ‚Üí artifact ‚Üí inference** flow
3. Learn clean **modularization patterns** (YAML configs, folder structure, scripts)

Visualization is incredibly useful ‚Äî but at this level, it tends to:

* Distract from the *pipeline extraction process* (too many matplotlib/seaborn steps clutter the logic)
* Encourage *manual exploration*, whereas we‚Äôre training the mindset of *reproducible processing*
* Add dependencies (plotly, seaborn) before you‚Äôve nailed the reproducibility flow

So: Skill-3 keeps it **minimal, deterministic, and linear** ‚Äî every cell builds toward automation.

---

## üí° When Viz *Should* Enter

Here‚Äôs the progression:

| Skill Level      | Visualization Focus                          | Purpose                                      |
| ---------------- | -------------------------------------------- | -------------------------------------------- |
| **1‚Äì2**          | Bar/pie plots in notebook                    | Learn basic EDA                              |
| **3 (this one)** | Light EDA only                               | Focus on reproducible modular pipeline       |
| **4‚Äì5**          | Matplotlib/Seaborn dashboards                | Compare features, churn rates, tenure trends |
| **6+**           | Plotly dashboards, SHAP, feature importances | Interpretability & explainability            |

---

## üß≠ That said‚Ä¶

If you *do* want to integrate visualization at Skill 3 (say, for a learning bonus), add a single optional notebook called:

### `05_visualize_results.ipynb`

```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("data/processed/telco_clean.csv")

plt.figure(figsize=(6,4))
sns.barplot(x="Contract", y="Churn", data=df)
plt.title("Churn Rate by Contract Type")
plt.show()

plt.figure(figsize=(6,4))
sns.histplot(df["tenure"], hue=df["Churn"], multiple="stack", bins=30)
plt.title("Tenure Distribution by Churn")
plt.show()
```

It‚Äôs clean, safe, and doesn‚Äôt alter your codebase logic.

---

### TL;DR

I left out visualization intentionally because:

* This level focuses on **workflow discipline** (reproducibility + modularization)
* Visualization returns later as **interpretability** at Skill-4 and beyond


># Would you like me to show where to naturally plug visualizations into this pipeline (e.g., post-EDA and post-evaluation)?

Great‚Äîhere‚Äôs exactly where to plug visualizations into the Skill-3 pipeline, with drop-in cells and (optional) modularized viz so it stays reproducible.

# Where to add viz

## 1) In the notebooks

### A) `01_load_eda.ipynb` ‚Äî quick sanity EDA (after cleaning/saving)

```python
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("data/processed/telco_clean.csv")

# Target balance
df['Churn'].value_counts().sort_index().plot(kind="bar")
plt.title("Churn distribution (0=No, 1=Yes)"); plt.xlabel("Churn"); plt.ylabel("Count")
plt.tight_layout(); plt.show()

# Churn rate by Contract
if 'Contract' in df.columns:
    (df.groupby('Contract')['Churn'].mean()
      .sort_values()
      .plot(kind="bar"))
    plt.title("Mean churn rate by Contract"); plt.ylabel("Churn rate")
    plt.tight_layout(); plt.show()

# Tenure vs churn (binned)
if 'tenure' in df.columns:
    bins = [0, 6, 12, 24, 36, 48, 60, 72]
    cut = pd.cut(df['tenure'], bins=bins, right=False)
    (df.groupby(cut)['Churn'].mean()
      .plot(marker="o"))
    plt.title("Churn rate by Tenure bin"); plt.xlabel("Tenure bin (months)"); plt.ylabel("Churn rate")
    plt.tight_layout(); plt.show()
```

### B) `03_evaluate.ipynb` ‚Äî evaluation plots (after computing preds)

```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay, confusion_matrix

# ROC (already present, keep it)
RocCurveDisplay.from_estimator(clf, X_test, y_test)
plt.title("ROC Curve"); plt.tight_layout(); plt.show()

# Precision‚ÄìRecall
PrecisionRecallDisplay.from_estimator(clf, X_test, y_test)
plt.title("Precision‚ÄìRecall Curve"); plt.tight_layout(); plt.show()

# Confusion matrix heatmap (simple matplotlib)
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation='nearest')
ax.set_title("Confusion Matrix"); plt.colorbar(im, ax=ax)
ax.set_xlabel("Predicted"); ax.set_ylabel("Actual")
ax.set_xticks([0,1]); ax.set_yticks([0,1])
for (i,j),v in np.ndenumerate(cm):
    ax.text(j, i, str(v), ha='center', va='center')
plt.tight_layout(); plt.show()
```

### C) `04_inference_batch.ipynb` ‚Äî scoring distribution (after scoring)

```python
import matplotlib.pyplot as plt
scored['churn_proba'].plot(kind='hist', bins=30)
plt.title("Predicted churn probabilities"); plt.xlabel("P(churn)"); plt.ylabel("Count")
plt.tight_layout(); plt.show()
```

---

## 2) In the modular codebase (optional but clean)

Add a tiny viz module so you can generate static PNGs on demand (keeps notebooks tidy and enables CI later).

### `src/churn/visualize.py`

```python
from pathlib import Path
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay, confusion_matrix

def ensure_dir(p): Path(p).parent.mkdir(parents=True, exist_ok=True)

def churn_distribution(df: pd.DataFrame, out_path: str):
    ensure_dir(out_path)
    ax = df['Churn'].value_counts().sort_index().plot(kind="bar")
    ax.set_title("Churn distribution (0=No, 1=Yes)"); ax.set_xlabel("Churn"); ax.set_ylabel("Count")
    plt.tight_layout(); plt.savefig(out_path); plt.close()

def churn_by_contract(df: pd.DataFrame, out_path: str):
    if 'Contract' not in df.columns: return
    ensure_dir(out_path)
    ax = (df.groupby('Contract')['Churn'].mean().sort_values()).plot(kind="bar")
    ax.set_title("Mean churn rate by Contract"); ax.set_ylabel("Churn rate")
    plt.tight_layout(); plt.savefig(out_path); plt.close()

def churn_by_tenure_bin(df: pd.DataFrame, out_path: str):
    if 'tenure' not in df.columns: return
    ensure_dir(out_path)
    bins = [0, 6, 12, 24, 36, 48, 60, 72]
    cut = pd.cut(df['tenure'], bins=bins, right=False)
    ax = (df.groupby(cut)['Churn'].mean()).plot(marker="o")
    ax.set_title("Churn rate by Tenure bin"); ax.set_xlabel("Tenure bin (months)"); ax.set_ylabel("Churn rate")
    plt.tight_layout(); plt.savefig(out_path); plt.close()

def eval_curves(clf, X_test, y_test, out_dir: str):
    Path(out_dir).mkdir(parents=True, exist_ok=True)

    # ROC
    RocCurveDisplay.from_estimator(clf, X_test, y_test)
    plt.title("ROC Curve"); plt.tight_layout(); plt.savefig(f"{out_dir}/roc.png"); plt.close()

    # PR
    PrecisionRecallDisplay.from_estimator(clf, X_test, y_test)
    plt.title("Precision‚ÄìRecall Curve"); plt.tight_layout(); plt.savefig(f"{out_dir}/pr.png"); plt.close()

def cm_plot(y_true, y_pred, out_path: str):
    ensure_dir(out_path)
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest')
    ax.set_title("Confusion Matrix"); plt.colorbar(im, ax=ax)
    ax.set_xlabel("Predicted"); ax.set_ylabel("Actual")
    ax.set_xticks([0,1]); ax.set_yticks([0,1])
    for (i,j),v in np.ndenumerate(cm):
        ax.text(j, i, str(v), ha='center', va='center')
    plt.tight_layout(); plt.savefig(out_path); plt.close()
```

### `scripts/visualize.py`

```python
import yaml
import pandas as pd
from sklearn.model_selection import train_test_split
from churn.io import load_obj, read_csv
from churn.visualize import (
    churn_distribution, churn_by_contract, churn_by_tenure_bin, eval_curves, cm_plot
)

def main():
    cfg = yaml.safe_load(open("configs/config.yaml"))
    df = read_csv(cfg["paths"]["processed"])

    # EDA figs
    churn_distribution(df, "reports/figures/churn_dist.png")
    churn_by_contract(df, "reports/figures/churn_by_contract.png")
    churn_by_tenure_bin(df, "reports/figures/churn_by_tenure.png")

    # Eval figs
    clf = load_obj(cfg["paths"]["model"])
    X = df.drop(columns=['Churn']); y = df['Churn']
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=cfg["training"]["test_size"],
        random_state=cfg["training"]["random_state"], stratify=y
    )
    eval_curves(clf, X_test, y_test, "reports/figures")
    y_pred = clf.predict(X_test)
    cm_plot(y_test, y_pred, "reports/figures/confusion_matrix.png")

    print("Saved figures to reports/figures")

if __name__ == "__main__":
    main()
```

Add to `requirements.txt` (already have `matplotlib`):

```
matplotlib
```

Run:

```bash
python scripts/visualize.py
```

---

## What plugs in where (quick map)

* **After cleaning** (`01_load_eda`): target balance, churn by contract, tenure bins.
* **After training/eval** (`03_evaluate`): ROC, PR, confusion matrix.
* **After scoring** (`04_inference_batch`): histogram of predicted churn probabilities.
* **Modular route**: `scripts/visualize.py` can generate all plots headlessly for reports/CI.


># Want me to also drop in a minimal `reports/README.md` that explains each figure and how to interpret it (Skill-3 friendly)?
Perfect ‚Äî here‚Äôs a **Skill-3-friendly `reports/README.md`** you can drop right into your project so your visualizations are documented and interpretable.
It‚Äôs lightweight but professional‚Äîteaching you *how to read* what you‚Äôve built.

---

### üìÅ `reports/README.md`

````markdown
# üìä Telco Churn Reports

This folder stores all automatically generated visual outputs from your Telco Churn pipeline.  
All figures are saved under `reports/figures/` by the script:

```bash
python scripts/visualize.py
````

---

## üîç 1. churn_dist.png

**Purpose:**
Shows the distribution of churned vs retained customers in the dataset.

* **Balanced dataset** ‚Üí model won‚Äôt need heavy class weighting.
* **Imbalanced dataset (e.g., 25% churners)** ‚Üí consider using stratified sampling or rebalancing later.

**Interpretation:**
If the ‚Äú1‚Äù (churned) bar is much shorter than ‚Äú0‚Äù, your model will need to compensate for imbalance.

---

## üìÑ 2. churn_by_contract.png

**Purpose:**
Visualizes how churn rate varies by contract type (Month-to-month, One year, Two year).
**Interpretation:**

* Typically, **Month-to-month customers churn more often** (less commitment).
* **Longer contracts** usually mean lower churn risk.

**Usage:**
This helps identify key retention levers (e.g., offer longer contracts or loyalty incentives).

---

## üïí 3. churn_by_tenure.png

**Purpose:**
Shows average churn rate by tenure bins (0‚Äì6 months, 6‚Äì12, etc.).
**Interpretation:**

* **Early churn spike** = customers leaving soon after joining ‚Üí onboarding issues.
* **Flat low churn** at high tenure = loyal base.

**Usage:**
A retention team can target low-tenure customers with special offers or better onboarding experiences.

---

## üßÆ 4. confusion_matrix.png

**Purpose:**
Displays actual vs predicted churn outcomes.

| Term           | Meaning                              |
| -------------- | ------------------------------------ |
| True Positive  | Predicted churn and actually churned |
| True Negative  | Predicted stay and stayed            |
| False Positive | Predicted churn but stayed           |
| False Negative | Predicted stay but churned           |

**Interpretation:**

* **False Negatives** (missed churners) are typically the most costly.
* You can tune your threshold (e.g., 0.4 instead of 0.5) to reduce them.

---

## üìà 5. roc.png

**Purpose:**
Receiver Operating Characteristic curve ‚Äî measures model‚Äôs ability to separate churners from non-churners.

**Interpretation:**

* The **closer to the top-left corner**, the better.
* **AUC (area under curve)** > 0.75 = solid baseline.

**Usage:**
A quick way to compare models objectively (LogisticRegression vs RandomForest later).

---

## ‚öñÔ∏è 6. pr.png

**Purpose:**
Precision-Recall curve ‚Äî useful for imbalanced data.
Shows the trade-off between catching more churners (recall) and avoiding false alarms (precision).

**Interpretation:**

* A **steep curve** indicates a good model.
* When churners are rare, PR curves tell you more than ROC.

---

## üéØ 7. churn_proba_hist.png (optional)

**Purpose:**
Histogram of predicted churn probabilities for new customers.

**Interpretation:**

* **Sharp separation** between 0‚Äì0.3 and 0.7‚Äì1.0 = confident model.
* **Overlapping middle** = uncertain cases (worth deeper analysis).

**Usage:**
Helps decide what probability threshold to use for marketing or retention alerts.

---

## üß© Workflow summary

```text
data/raw/ ‚Üí data/processed/ ‚Üí models/ ‚Üí reports/figures/
```

Each report ties back to a stage:

| Notebook/Script      | Report                                         |
| -------------------- | ---------------------------------------------- |
| 01_load_eda          | churn_dist, churn_by_contract, churn_by_tenure |
| 03_evaluate          | confusion_matrix, roc, pr                      |
| 04_inference_batch   | churn_proba_hist                               |
| scripts/visualize.py | regenerates all                                |

---

## üß† Next steps

* Add `feature_importance.png` (Skill Level 4) to explain *why* customers churn.
* Integrate into a dashboard (Plotly/Dash or Streamlit) at Skill 5.
* Automate `python scripts/visualize.py` after each model retrain (Skill 6+ CI/CD).

---

**Author:** You, the Telco Data Analyst
**Level:** Skill 3/10 (Intro to modular pipelines + visual interpretation)

```

---


># Would you like me to generate the **`feature_importance.png` step** next (using coefficients from your logistic regression pipeline and saving it into `reports/figures/`)? That‚Äôs the perfect bridge to Skill-4.

Awesome‚Äîhere‚Äôs the clean Skill-3 add-on to generate **feature_importance.png** (from your Logistic Regression pipeline), plus a CSV you can inspect.

---

# Option A ‚Äî One cell in a notebook (quickest)

Drop this at the end of `03_evaluate.ipynb`:

```python
import joblib, numpy as np, pandas as pd, matplotlib.pyplot as plt

clf = joblib.load("models/churn_pipeline.joblib")

# 1) Get feature names from the ColumnTransformer
prep = clf.named_steps["prep"]
feature_names = prep.get_feature_names_out()  # e.g., "num__tenure", "cat__Contract_Month-to-month"

# 2) Get coefficients from LogisticRegression
lr = clf.named_steps["model"]
coefs = lr.coef_.ravel()  # shape: (n_features,)

# 3) Assemble DataFrame
imp = (pd.DataFrame({
    "feature": feature_names,
    "coef": coefs,
    "abs_coef": np.abs(coefs),
    "sign": np.where(coefs >= 0, "+", "-")
})
.sort_values("abs_coef", ascending=False)
.reset_index(drop=True))

# 4) Save CSV and plot top-20
imp.to_csv("reports/feature_importance.csv", index=False)

top_n = 20
plot_df = imp.head(top_n).iloc[::-1]  # reverse for nicer horizontal bars
plt.figure(figsize=(8, 6))
plt.barh(plot_df["feature"], plot_df["abs_coef"])
plt.title("Top Feature Importances (|LogReg Coefficients|)")
plt.xlabel("|coefficient| (magnitude)")
plt.tight_layout()
plt.savefig("reports/figures/feature_importance.png", dpi=150)
plt.show()

imp.head(10)
```

**Interpretation tip (Skill-3):**

* Positive coefficient ‚Üí increases churn risk (holding others constant).
* Negative coefficient ‚Üí decreases churn risk.
* Magnitude ‚Üí strength of impact.

---

# Option B ‚Äî Modular (keeps notebooks tidy)

## 1) `src/churn/importance.py`

```python
from __future__ import annotations
from typing import Tuple
import numpy as np
import pandas as pd

def extract_feature_importance_from_logreg_pipeline(clf) -> pd.DataFrame:
    """
    Returns a DataFrame with columns: feature, coef, abs_coef, sign
    Works for a sklearn Pipeline with steps: prep (ColumnTransformer) + model (LogisticRegression)
    """
    if "prep" not in clf.named_steps or "model" not in clf.named_steps:
        raise ValueError("Pipeline must have 'prep' and 'model' steps.")
    prep = clf.named_steps["prep"]
    model = clf.named_steps["model"]
    if not hasattr(prep, "get_feature_names_out"):
        raise ValueError("Preprocessor must implement get_feature_names_out().")
    if not hasattr(model, "coef_"):
        raise ValueError("Model must expose coef_ (LogisticRegression expected).")

    feature_names = prep.get_feature_names_out()
    coefs = model.coef_.ravel()
    if len(feature_names) != len(coefs):
        raise ValueError(f"Shape mismatch: {len(feature_names)} features vs {len(coefs)} coefs.")

    df = pd.DataFrame({
        "feature": feature_names,
        "coef": coefs,
    })
    df["abs_coef"] = df["coef"].abs()
    df["sign"] = np.where(df["coef"] >= 0, "+", "-")
    df = df.sort_values("abs_coef", ascending=False).reset_index(drop=True)
    return df
```

## 2) Extend `src/churn/visualize.py` (optional helper)

```python
import matplotlib.pyplot as plt
from pathlib import Path

def plot_top_feature_importance(df_imp, out_path: str, top_n: int = 20):
    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
    plot_df = df_imp.head(top_n).iloc[::-1]
    plt.figure(figsize=(8, 6))
    plt.barh(plot_df["feature"], plot_df["abs_coef"])
    plt.title("Top Feature Importances (|LogReg Coefficients|)")
    plt.xlabel("|coefficient| (magnitude)")
    plt.tight_layout()
    plt.savefig(out_path, dpi=150)
    plt.close()
```

## 3) `scripts/feature_importance.py`

```python
import yaml
from pathlib import Path
from churn.io import load_obj
from churn.importance import extract_feature_importance_from_logreg_pipeline
from churn.visualize import plot_top_feature_importance

def main():
    cfg = yaml.safe_load(open("configs/config.yaml"))
    clf = load_obj(cfg["paths"]["model"])

    imp = extract_feature_importance_from_logreg_pipeline(clf)

    Path("reports").mkdir(parents=True, exist_ok=True)
    Path("reports/figures").mkdir(parents=True, exist_ok=True)
    out_csv = "reports/feature_importance.csv"
    imp.to_csv(out_csv, index=False)

    plot_top_feature_importance(imp, "reports/figures/feature_importance.png", top_n=20)
    print(f"Saved {out_csv} and reports/figures/feature_importance.png")

if __name__ == "__main__":
    main()
```

## 4) Run it

```bash
python scripts/feature_importance.py
```

---

## Add a short section to `reports/README.md`

Append this:

```markdown
---

## üß† 8. feature_importance.png / feature_importance.csv
**What it is:**  
Top features by absolute Logistic Regression coefficient (magnitude).  
- **Positive sign** ‚Üí increases churn likelihood.  
- **Negative sign** ‚Üí decreases churn likelihood.

**How to use it:**  
- Pair with `churn_by_contract.png`/`churn_by_tenure.png` to validate patterns you see.  
- Use high-magnitude positives to design **retention actions** (e.g., month-to-month + paperless billing + electronic checks might indicate higher churn risk, depending on your data).

**Caveat:**  
Coefficients reflect associations after one-hot encoding & scaling; correlation ‚â† causation.
```

---


># If you want, I can also generate a **threshold-tuning helper** (Skill-3+): a tiny script that scans thresholds (0.1‚Üí0.9) and prints precision/recall/FN rate so you can pick a business-friendly operating point.

Perfect. Here‚Äôs your **Skill-3+ ‚Äúthreshold tuning helper‚Äù** ‚Äî a small but powerful addition that teaches you how to *choose the best decision cutoff* (not always 0.5!) for business goals like minimizing missed churners.

---

# ‚öôÔ∏è  Threshold Tuning Helper

## üìÅ New file: `scripts/tune_threshold.py`

```python
import yaml
import numpy as np
import pandas as pd
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from churn.io import load_obj, read_csv

def main():
    cfg = yaml.safe_load(open("configs/config.yaml"))

    # Load model and data
    clf = load_obj(cfg["paths"]["model"])
    df = read_csv(cfg["paths"]["processed"])

    y = df["Churn"]
    X = df.drop(columns=["Churn"])
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=cfg["training"]["test_size"],
        random_state=cfg["training"]["random_state"],
        stratify=y
    )

    # Predict probabilities
    y_proba = clf.predict_proba(X_test)[:, 1]

    # Sweep thresholds 0.1 ‚Üí 0.9
    results = []
    for t in np.arange(0.1, 0.91, 0.05):
        y_pred = (y_proba >= t).astype(int)
        prec = precision_score(y_test, y_pred, zero_division=0)
        rec = recall_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        auc = roc_auc_score(y_test, y_proba)
        fn_rate = np.mean((y_test == 1) & (y_pred == 0))  # proportion of missed churners
        results.append({
            "threshold": round(t, 2),
            "precision": round(prec, 3),
            "recall": round(rec, 3),
            "f1": round(f1, 3),
            "roc_auc": round(auc, 3),
            "false_negative_rate": round(fn_rate, 3)
        })

    df_results = pd.DataFrame(results)
    df_results.to_csv("reports/threshold_tuning.csv", index=False)
    print(df_results)
    print("\nSaved results ‚Üí reports/threshold_tuning.csv")

    # Recommend threshold by F1 (balanced metric)
    best_idx = df_results["f1"].idxmax()
    best_row = df_results.loc[best_idx]
    print("\n‚úÖ Recommended threshold by F1-score:", best_row["threshold"])
    print(best_row.to_dict())

if __name__ == "__main__":
    main()
```

---

## üß™ Run it

```bash
python scripts/tune_threshold.py
```

**Output:**

```
    threshold  precision  recall    f1  roc_auc  false_negative_rate
0         0.1      0.56    0.95  0.71    0.83                 0.05
1         0.15     0.60    0.91  0.72    0.83                 0.09
...
‚úÖ Recommended threshold by F1-score: 0.35
{'threshold': 0.35, 'precision': 0.74, 'recall': 0.81, 'f1': 0.77, ...}
```

---

## üìà Optional Plot (if you want visual feedback)

Add this at the end of the script (after `print(best_row.to_dict())`):

```python
import matplotlib.pyplot as plt

plt.figure(figsize=(8,5))
plt.plot(df_results["threshold"], df_results["precision"], label="Precision")
plt.plot(df_results["threshold"], df_results["recall"], label="Recall")
plt.plot(df_results["threshold"], df_results["f1"], label="F1-score")
plt.title("Precision / Recall / F1 vs Threshold")
plt.xlabel("Decision Threshold")
plt.ylabel("Score")
plt.legend()
plt.tight_layout()
plt.savefig("reports/figures/threshold_curve.png", dpi=150)
plt.show()
```

---

## üìÅ Update `reports/README.md`

Add:

```markdown
---

## üéöÔ∏è 9. threshold_tuning.csv / threshold_curve.png
**Purpose:**  
Shows how precision, recall, and F1 change as you vary the churn probability cutoff.

**Why it matters:**  
- A **lower threshold** (e.g., 0.3) catches more churners but risks more false alarms.  
- A **higher threshold** (e.g., 0.7) avoids false positives but misses real churners.  

**Interpretation:**  
Pick the threshold that fits your business:
- **Retention teams** ‚Üí prefer *high recall* (catch every churner).  
- **Finance teams** ‚Üí prefer *high precision* (don‚Äôt waste retention budget).  
- **Balanced** ‚Üí choose the F1-optimal threshold shown in the output.

**Usage:**  
Update `configs/config.yaml` under `inference.threshold` to your chosen value.
```

---

## ‚úÖ Integration recap

You now have:

| Script                          | Purpose              | Output                       |
| ------------------------------- | -------------------- | ---------------------------- |
| `scripts/train.py`              | Train model          | models/churn_pipeline.joblib |
| `scripts/predict.py`            | Score new data       | new_customers_scored.csv     |
| `scripts/visualize.py`          | Generate figures     | reports/figures/*.png        |
| `scripts/feature_importance.py` | Explain model        | feature_importance.csv/.png  |
| **`scripts/tune_threshold.py`** | Tune decision cutoff | threshold_tuning.csv/.png    |

---


># **Automatically update your YAML config** with the best threshold after running this script (Skill-4 feature)?

Awesome‚Äîlet‚Äôs make the tuner update your YAML automatically. Two options:

# Option A ‚Äî Add a `--apply` flag to the existing tuner

Drop-in replacement for `scripts/tune_threshold.py`:

```python
import argparse
import numpy as np
import pandas as pd
from pathlib import Path

# Try ruamel.yaml for comment-preserving edits; fall back to PyYAML
try:
    from ruamel.yaml import YAML
    yaml = YAML()
    yaml.preserve_quotes = True
    _USE_RUAMEL = True
except Exception:
    import yaml as pyyaml
    _USE_RUAMEL = False

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from churn.io import load_obj, read_csv

def load_config(path: str):
    if _USE_RUAMEL:
        with open(path, "r") as f:
            data = yaml.load(f)
        return data
    else:
        import yaml as pyyaml
        with open(path, "r") as f:
            return pyyaml.safe_load(f)

def save_config(cfg, path: str):
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    if _USE_RUAMEL:
        with open(path, "w") as f:
            yaml.dump(cfg, f)
    else:
        import yaml as pyyaml
        with open(path, "w") as f:
            pyyaml.safe_dump(cfg, f, sort_keys=False)

def main():
    parser = argparse.ArgumentParser(description="Tune threshold and (optionally) apply best to config.")
    parser.add_argument("--config", default="configs/config.yaml")
    parser.add_argument("--apply", action="store_true", help="Write best F1 threshold back to config.")
    parser.add_argument("--metric", choices=["f1","recall","precision"], default="f1",
                        help="Metric to optimize when selecting threshold.")
    parser.add_argument("--start", type=float, default=0.1)
    parser.add_argument("--stop", type=float, default=0.9)
    parser.add_argument("--step", type=float, default=0.05)
    args = parser.parse_args()

    cfg = load_config(args.config)
    model_path = cfg["paths"]["model"]
    processed_path = cfg["paths"]["processed"]

    clf = load_obj(model_path)
    df = read_csv(processed_path)
    y = df["Churn"]; X = df.drop(columns=["Churn"])

    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=cfg["training"]["test_size"],
        random_state=cfg["training"]["random_state"],
        stratify=y
    )

    y_proba = clf.predict_proba(X_test)[:, 1]

    thresholds = np.arange(args.start, args.stop + 1e-9, args.step)
    rows = []
    for t in thresholds:
        y_pred = (y_proba >= t).astype(int)
        rows.append({
            "threshold": round(float(t), 2),
            "precision": round(precision_score(y_test, y_pred, zero_division=0), 4),
            "recall": round(recall_score(y_test, y_pred), 4),
            "f1": round(f1_score(y_test, y_pred), 4),
            "roc_auc": round(roc_auc_score(y_test, y_proba), 4),
            "false_negative_rate": round(np.mean((y_test == 1) & (y_pred == 0)), 4),
        })

    res = pd.DataFrame(rows)
    Path("reports").mkdir(parents=True, exist_ok=True)
    res.to_csv("reports/threshold_tuning.csv", index=False)
    print(res)

    best_idx = res[args.metric].idxmax()
    best_row = res.loc[best_idx]
    best_t = float(best_row["threshold"])
    print(f"\n‚úÖ Best threshold by {args.metric.upper()}: {best_t}")
    print(best_row.to_dict())

    # Optional: plot
    try:
        import matplotlib.pyplot as plt
        Path("reports/figures").mkdir(parents=True, exist_ok=True)
        plt.figure(figsize=(8,5))
        plt.plot(res["threshold"], res["precision"], label="Precision")
        plt.plot(res["threshold"], res["recall"], label="Recall")
        plt.plot(res["threshold"], res["f1"], label="F1")
        plt.xlabel("Threshold"); plt.ylabel("Score"); plt.title("Precision/Recall/F1 vs Threshold")
        plt.legend(); plt.tight_layout()
        plt.savefig("reports/figures/threshold_curve.png", dpi=150)
        # plt.show()  # optional
        print("Saved plot ‚Üí reports/figures/threshold_curve.png")
    except Exception:
        pass

    # Apply to YAML if requested
    if args.apply:
        old_t = cfg.get("inference", {}).get("threshold", None)
        cfg.setdefault("inference", {})
        cfg["inference"]["threshold"] = round(best_t, 2)
        backup = f"{args.config}.bak"
        Path(backup).write_text(Path(args.config).read_text())
        save_config(cfg, args.config)
        print(f"\nüìù Updated {args.config} (backup at {backup})")
        print(f"Old threshold: {old_t}  ‚Üí  New threshold: {cfg['inference']['threshold']}")

if __name__ == "__main__":
    main()
```

Run it:

```bash
# just tune, don‚Äôt write
python scripts/tune_threshold.py

# optimize for F1 and write back to YAML
python scripts/tune_threshold.py --apply

# optimize for recall instead (catch more churners) and write
python scripts/tune_threshold.py --metric recall --apply
```

> Notes
> ‚Ä¢ If `ruamel.yaml` is installed, comments/formatting are preserved better. Otherwise PyYAML rewrites the file (functionally fine).
> ‚Ä¢ A `.bak` backup is created before writing.

Add to `requirements.txt` (optional but recommended to preserve YAML formatting):

```
ruamel.yaml
```

---

# Option B ‚Äî Separate ‚Äúapply-best‚Äù script

If you prefer to keep tuning and applying separate:

`scripts/apply_best_threshold.py`

```python
from pathlib import Path

try:
    from ruamel.yaml import YAML
    yaml = YAML(); yaml.preserve_quotes = True
    _USE_RUAMEL = True
except Exception:
    import yaml as pyyaml
    _USE_RUAMEL = False

import pandas as pd

CFG = "configs/config.yaml"
CSV = "reports/threshold_tuning.csv"

def load_config(path):
    if _USE_RUAMEL:
        with open(path) as f: return yaml.load(f)
    else:
        import yaml as pyyaml
        with open(path) as f: return pyyaml.safe_load(f)

def save_config(cfg, path):
    if _USE_RUAMEL:
        with open(path, "w") as f: yaml.dump(cfg, f)
    else:
        import yaml as pyyaml
        with open(path, "w") as f: pyyaml.safe_dump(cfg, f, sort_keys=False)

def main():
    df = pd.read_csv(CSV)
    best = df.loc[df['f1'].idxmax()]  # or choose another column
    best_t = float(best['threshold'])
    cfg = load_config(CFG)
    old = cfg.get("inference", {}).get("threshold")
    cfg.setdefault("inference", {})
    cfg["inference"]["threshold"] = round(best_t, 2)

    backup = f"{CFG}.bak"
    Path(backup).write_text(Path(CFG).read_text())
    save_config(cfg, CFG)
    print(f"Updated {CFG} threshold: {old} ‚Üí {cfg['inference']['threshold']} (backup at {backup})")

if __name__ == "__main__":
    main()
```

Run:

```bash
python scripts/tune_threshold.py      # produces CSV
python scripts/apply_best_threshold.py
```

---

## Quick sanity check flow

```bash
# 1) Train (if not already)
python scripts/train.py

# 2) Tune thresholds and apply the best by F1
python scripts/tune_threshold.py --apply

# 3) Score with the new threshold
python scripts/predict.py data/scoring/new_customers.csv
```


># Wire `predict.py` to accept a `--threshold` override at runtime (so you can A/B different cutoffs without touching YAML)?

Awesome‚Äîhere‚Äôs a drop-in upgrade for `scripts/predict.py` that lets you **override the YAML threshold at runtime**.

### `scripts/predict.py` (replace the whole file)

```python
import argparse
import sys
import yaml
import pandas as pd
from pathlib import Path
from churn.io import load_obj

def main():
    parser = argparse.ArgumentParser(description="Batch score Telco churn.")
    parser.add_argument("input_csv", help="Path to CSV of records to score")
    parser.add_argument("--output", help="Optional output path (CSV). If omitted, appends _scored.csv")
    parser.add_argument("--threshold", type=float, help="Override decision threshold (0‚Äì1). If omitted, uses config.")
    parser.add_argument("--config", default="configs/config.yaml", help="Path to YAML config")
    args = parser.parse_args()

    # Load config and model
    with open(args.config) as f:
        cfg = yaml.safe_load(f)

    model_path = cfg["paths"]["model"]
    clf = load_obj(model_path)

    # Load data
    df_new = pd.read_csv(args.input_csv)

    # If user accidentally included a label column, ignore it for scoring
    for col in ("Churn", "churn", "label", "target"):
        if col in df_new.columns:
            df_new = df_new.drop(columns=[col])

    # Predict
    proba = clf.predict_proba(df_new)[:, 1]

    # Decide threshold: CLI override > YAML > default 0.5
    cfg_thresh = cfg.get("inference", {}).get("threshold", 0.5)
    threshold = args.threshold if args.threshold is not None else cfg_thresh
    if not (0.0 <= float(threshold) <= 1.0):
        print(f"ERROR: threshold must be in [0,1], got {threshold}", file=sys.stderr)
        sys.exit(2)

    pred = (proba >= float(threshold)).astype(int)

    # Assemble output
    scored = df_new.copy()
    scored["churn_proba"] = proba
    scored["churn_pred"] = pred

    # Output path
    out_path = args.output or (str(Path(args.input_csv).with_suffix("")) + "_scored.csv")
    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
    scored.to_csv(out_path, index=False)

    print(f"Applied threshold: {threshold}")
    print(f"Scored ‚Üí {out_path}")

if __name__ == "__main__":
    main()
```

### How to use it

```bash
# Use threshold from configs/config.yaml (inference.threshold)
python scripts/predict.py data/scoring/new_customers.csv

# Override at runtime to 0.35 (no YAML change)
python scripts/predict.py data/scoring/new_customers.csv --threshold 0.35

# Write to a specific output file
python scripts/predict.py data/scoring/new_customers.csv --threshold 0.4 --output data/scoring/new_scored_040.csv
```

**Precedence:** `--threshold` (CLI) ‚Üí YAML `inference.threshold` ‚Üí default `0.5`.


Excellent question ‚Äî and it‚Äôs a **very good sign** you‚Äôre thinking about how to make your dataset more structured and reproducible.

The short answer: **not all columns need to go into a dictionary** ‚Äî only the *metadata* about them (types, groups, or roles) does.

Let me explain clearly üëá

---

## üß© Why you *don‚Äôt* just store every column as key:value

If you did something like:

```python
columns = {
    "customerID": "string",
    "gender": "category",
    "SeniorCitizen": "int",
    ...
}
```

you‚Äôd basically be replicating your schema ‚Äî which is okay, but not very flexible and hard to maintain.

Instead, you want to use a **column-role dictionary** or **schema metadata map** that tells your pipeline *how to treat* each column ‚Äî not just its name.

---

## ‚úÖ Recommended pattern (Skill 3‚Äì4 friendly)

Create a `configs/columns.yaml` that **groups columns by role and type**:

```yaml
id_cols:
  - customerID

target_col: Churn

numeric_cols:
  - tenure
  - MonthlyCharges
  - TotalCharges
  - SeniorCitizen

categorical_cols:
  - gender
  - Partner
  - Dependents
  - PhoneService
  - MultipleLines
  - InternetService
  - OnlineSecurity
  - OnlineBackup
  - DeviceProtection
  - TechSupport
  - StreamingTV
  - StreamingMovies
  - Contract
  - PaperlessBilling
  - PaymentMethod
```

That‚Äôs it.
You‚Äôre telling your preprocessing step:

> ‚ÄúHere‚Äôs how to handle each group of columns.‚Äù

Then your code reads that and uses it dynamically.

---

## üß± Example: how you‚Äôd use it in code

```python
import yaml
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load column groups
cfg_cols = yaml.safe_load(open("configs/columns.yaml"))

num_cols = cfg_cols["numeric_cols"]
cat_cols = cfg_cols["categorical_cols"]

preprocess = ColumnTransformer([
    ("num", StandardScaler(with_mean=False), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), cat_cols)
])
```

This makes your pipeline **declarative** ‚Äî no hardcoded column names inside notebooks or scripts.

---

## üß† Why it matters later

| Benefit             | Why it helps                                                            |
| ------------------- | ----------------------------------------------------------------------- |
| **Reproducibility** | You can change columns in YAML instead of code.                         |
| **Versioning**      | You can track schema drift over time (new columns, renamed features).   |
| **Portability**     | Works the same on local, CI/CD, and cloud.                              |
| **Validation**      | You can add checks to ensure new data has these columns before scoring. |

---

## ‚öôÔ∏è Optional (Skill-4+)

If you want to go one step further later, create a **schema dictionary** inside a YAML or JSON for validation:

```yaml
schema:
  customerID: {type: string, nullable: false}
  gender: {type: category, allowed: [Male, Female]}
  tenure: {type: int, min: 0, max: 100}
  TotalCharges: {type: float, nullable: true}
  Churn: {type: category, allowed: [Yes, No]}
```

Then, before training or inference, you can validate incoming data programmatically (using `pandera`, `great_expectations`, or your own checks).

---

### TL;DR

| Approach                                                  | When to use                      |
| --------------------------------------------------------- | -------------------------------- |
| Simple list of column groups (categorical/numeric/target) | ‚úÖ Now (Skill 3)                  |
| Full schema with rules (types, ranges, allowed values)    | üß† Later (Skill 4‚Äì5)             |
| Full data dictionary for documentation                    | üìñ Optional, for human reference |

---


># Would you like me to show you a **`data_dictionary.yaml`** template that sits next to your config, documenting what each column *means* and *how it‚Äôs used in modeling*? It‚Äôs an ideal next step before you reach Skill Level 4.

Perfect ‚Äî this is *exactly* what a professional pipeline does next.
At **Skill Level 3+**, your model runs.
At **Skill Level 4**, you start documenting *why* each feature exists and how to interpret it.

Here‚Äôs your starter **`configs/data_dictionary.yaml`** ‚Äî lightweight, human-readable, and designed to grow with your project.

---

## üìò `configs/data_dictionary.yaml`

```yaml
# Telco Customer Churn Data Dictionary
# Purpose: Describe every column‚Äôs meaning, role, and modeling behavior.

dataset:
  name: IBM Telco Customer Churn
  source: https://www.ibm.com/communities/analytics/watson-analytics-blog/guide-to-sample-datasets/
  description: >
    Customer demographic, service, and account data from a fictional telco provider.
    Target variable indicates whether the customer left (churned).

columns:

  customerID:
    description: Unique customer identifier
    dtype: string
    role: id
    used_in_model: false

  gender:
    description: Gender of the customer (Male/Female)
    dtype: category
    role: demographic
    used_in_model: true
    notes: Can reveal subtle behavioral differences, usually weak predictor.

  SeniorCitizen:
    description: Indicates if the customer is a senior (1) or not (0)
    dtype: int
    role: demographic
    used_in_model: true

  Partner:
    description: Whether the customer has a partner (Yes/No)
    dtype: category
    role: demographic
    used_in_model: true

  Dependents:
    description: Whether the customer has dependents (Yes/No)
    dtype: category
    role: demographic
    used_in_model: true

  tenure:
    description: Number of months the customer has stayed with the company
    dtype: int
    role: account
    used_in_model: true
    notes: Strongly predictive ‚Äî longer tenure ‚Üí lower churn.

  PhoneService:
    description: Whether the customer has a phone service
    dtype: category
    role: product
    used_in_model: true

  MultipleLines:
    description: Whether the customer has multiple lines
    dtype: category
    role: product
    used_in_model: true

  InternetService:
    description: Type of Internet service (DSL/Fiber optic/None)
    dtype: category
    role: product
    used_in_model: true

  OnlineSecurity:
    description: If customer has online security add-on
    dtype: category
    role: addon
    used_in_model: true

  OnlineBackup:
    description: If customer has online backup add-on
    dtype: category
    role: addon
    used_in_model: true

  DeviceProtection:
    description: If customer has device protection plan
    dtype: category
    role: addon
    used_in_model: true

  TechSupport:
    description: If customer has technical support add-on
    dtype: category
    role: addon
    used_in_model: true

  StreamingTV:
    description: If customer streams TV
    dtype: category
    role: addon
    used_in_model: true

  StreamingMovies:
    description: If customer streams movies
    dtype: category
    role: addon
    used_in_model: true

  Contract:
    description: Type of contract (Month-to-month/One year/Two year)
    dtype: category
    role: account
    used_in_model: true
    notes: One of the top predictors of churn.

  PaperlessBilling:
    description: Whether customer uses paperless billing
    dtype: category
    role: billing
    used_in_model: true

  PaymentMethod:
    description: Payment method used (Electronic check, Mailed check, etc.)
    dtype: category
    role: billing
    used_in_model: true

  MonthlyCharges:
    description: Monthly amount charged to customer
    dtype: float
    role: billing
    used_in_model: true

  TotalCharges:
    description: Total amount charged to date
    dtype: float
    role: billing
    used_in_model: true

  Churn:
    description: Target variable (Yes = customer left)
    dtype: category
    role: target
    used_in_model: true
    notes: Binary encoded 1 = Yes, 0 = No
```

---

## üîß How to use it

### Read it in your notebooks or scripts

```python
import yaml
data_dict = yaml.safe_load(open("configs/data_dictionary.yaml"))

for col, meta in data_dict["columns"].items():
    if meta.get("used_in_model"):
        print(f"{col:<20} ‚Üí {meta['role']} ({meta['dtype']})")
```

### Cross-check your processed DataFrame

```python
missing = [c for c in data_dict["columns"].keys() if c not in df.columns]
if missing:
    print("‚ö†Ô∏è Missing columns:", missing)
```

### Export for documentation

You can even render it to Markdown:

```python
import pandas as pd
df_doc = pd.DataFrame.from_dict(data_dict["columns"], orient="index")
df_doc.to_markdown("reports/data_dictionary.md")
```

---

## üí° Why this matters

| Benefit               | Explanation                                                                |
| --------------------- | -------------------------------------------------------------------------- |
| **Transparency**      | Anyone (teammate, reviewer, interviewer) can see what each variable means. |
| **Governance**        | Easy to check if you‚Äôre using sensitive or personal data.                  |
| **Automation**        | Pipelines can validate dtype, inclusion, or handle roles programmatically. |
| **Feature selection** | You can mark features as ‚Äúused_in_model: false‚Äù to exclude them cleanly.   |

---


># How to **auto-generate this dictionary** from your dataset (so you don‚Äôt have to type every column manually), then edit and save it as YAML? That‚Äôs a great bridge to Skill Level 4.

Excellent ‚Äî this next step takes you from **manual documentation ‚Üí automated metadata generation**, a key **Skill Level 4** move in any professional data engineering or ML workflow.

We‚Äôll build a script that **auto-scans your dataset**, creates a structured data dictionary (with guesses for dtype, role, and inclusion), and saves it as `configs/data_dictionary.yaml`.
You can then review and edit it manually once for clarity.

---

# ‚öôÔ∏è Step-by-Step: Auto-Generate Data Dictionary

### üìÅ File: `scripts/generate_data_dictionary.py`

```python
import pandas as pd
import yaml
from pathlib import Path

def guess_role(col: str) -> str:
    """Basic heuristic for role classification."""
    name = col.lower()
    if "id" in name:
        return "id"
    if "charge" in name or "payment" in name or "bill" in name:
        return "billing"
    if "contract" in name or "tenure" in name:
        return "account"
    if name in ("gender", "partner", "dependents", "seniorcitizen"):
        return "demographic"
    if "stream" in name or "online" in name or "device" in name or "tech" in name:
        return "addon"
    if "churn" in name:
        return "target"
    return "other"

def dtype_role(df: pd.DataFrame):
    """Infer dtype categories: numeric vs categorical."""
    num_cols = df.select_dtypes(include=["number"]).columns
    return {c: ("float" if pd.api.types.is_float_dtype(df[c]) else "int") for c in num_cols}

def main():
    data_path = "data/processed/telco_clean.csv"
    out_path = "configs/data_dictionary.yaml"
    df = pd.read_csv(data_path)
    Path("configs").mkdir(exist_ok=True)

    # Base metadata
    dict_yaml = {
        "dataset": {
            "name": "Telco Customer Churn (Auto-Generated)",
            "source": str(data_path),
            "description": "Auto-generated metadata; please review descriptions and roles manually.",
        },
        "columns": {},
    }

    num_map = dtype_role(df)
    for col in df.columns:
        inferred = {}
        if col in num_map:
            inferred["dtype"] = num_map[col]
        else:
            inferred["dtype"] = "category"

        inferred["role"] = guess_role(col)
        inferred["used_in_model"] = False if inferred["role"] in ("id", "other") else True
        inferred["description"] = f"Auto-generated description for {col}"
        dict_yaml["columns"][col] = inferred

    # Write YAML
    with open(out_path, "w") as f:
        yaml.safe_dump(dict_yaml, f, sort_keys=False)

    print(f"‚úÖ Saved data dictionary to {out_path}")
    print("Columns detected:", len(df.columns))
    print("Roles summary:")
    roles = pd.Series([v["role"] for v in dict_yaml["columns"].values()]).value_counts()
    print(roles)

if __name__ == "__main__":
    main()
```

---

### üß™ Run it

```bash
python scripts/generate_data_dictionary.py
```

**Output example:**

```
‚úÖ Saved data dictionary to configs/data_dictionary.yaml
Columns detected: 21
Roles summary:
billing         5
demographic     4
addon           6
account         2
target          1
id              1
other           2
```

---

### üßæ What you get

```yaml
dataset:
  name: Telco Customer Churn (Auto-Generated)
  source: data/processed/telco_clean.csv
  description: Auto-generated metadata; please review descriptions and roles manually.

columns:
  customerID:
    dtype: category
    role: id
    used_in_model: false
    description: Auto-generated description for customerID
  gender:
    dtype: category
    role: demographic
    used_in_model: true
    description: Auto-generated description for gender
  ...
```

---

### ü™Ñ Next (manual curation)

1. Open `configs/data_dictionary.yaml`.
2. Replace the ‚ÄúAuto-generated description for ‚Ä¶‚Äù with clear, human-readable text (you can copy from the version I gave you earlier).
3. Adjust `role` or `used_in_model` where needed.

This hybrid approach saves time and keeps your schema **self-documenting**.

---

### üí° Optional extensions (Skill 4.5)

| Enhancement                     | Purpose                                                                          |
| ------------------------------- | -------------------------------------------------------------------------------- |
| **Validate data vs dictionary** | Ensure every required column is present and has the right dtype before training. |
| **Generate Markdown table**     | Auto-export to `reports/data_dictionary.md` for project docs.                    |
| **Integrate with `pandera`**    | Enforce schema in code for clean data contracts.                                 |

---

Would you like me to show how to **validate your dataset against this generated dictionary** (e.g., check missing columns, wrong dtypes, unexpected values) before training?


># WHY would I do this?

That‚Äôs a really sharp and important question ‚Äî because what you‚Äôre asking gets to the *why* behind doing all this extra metadata work.
Let‚Äôs unpack it clearly:

---

## üéØ The Real Purpose of a Data Dictionary (and Auto-Generation)

You‚Äôre already at the point (Skill Level 3‚Äì4) where your **Telco churn pipeline runs reliably**.
The next step isn‚Äôt about making it ‚Äúprettier‚Äù‚Äîit‚Äôs about making it **robust**, **explainable**, and **reusable**.

### üß± In short:

You do this to turn your notebook experiment into a **production-grade, living dataset** that anyone (or any script) can understand, validate, and extend‚Äîwithout breaking your model.

---

## üß© 1. For **data validation**

Your model *assumes* certain columns exist, have specific data types, and use known categories.
If the input data ever changes (new column names, nulls, or types), your model can silently break or give bad predictions.

**Example:**

* ‚ÄúTotalCharges‚Äù suddenly becomes ‚ÄúTotal_Charges‚Äù
* ‚ÄúSeniorCitizen‚Äù starts coming in as `"Yes"/"No"` instead of `0/1`

‚úÖ With a data dictionary, you can write a 10-line validation step that checks:

```python
for col, meta in data_dict["columns"].items():
    assert col in df.columns, f"Missing column: {col}"
    assert str(df[col].dtype).startswith(meta["dtype"]), f"Wrong type for {col}"
```

‚Üí You catch bad data *before* it corrupts your model.

---

## üìä 2. For **interpretability & explainability**

When you later show feature importances or SHAP values, people will ask:

> ‚ÄúWhat exactly does `OnlineSecurity_No` mean?‚Äù

If you have your `data_dictionary.yaml`, you can instantly explain it.
It also helps you (and future teammates) remember *why* certain variables are in the model at all.

---

## üë©‚Äçüíª 3. For **automation & pipelines**

In bigger projects, multiple scripts need the same info:

* Which columns are numeric?
* Which are categorical?
* What‚Äôs the target variable?

Instead of repeating those lists everywhere, the dictionary acts as a **single source of truth**.
If the schema changes, you update one YAML and every part of the pipeline (train, score, viz) stays consistent.

---

## üîç 4. For **schema drift detection**

Over time, the dataset might evolve (new columns, renamed ones).
With a saved dictionary, you can automatically compare the new schema vs old and alert:

```bash
‚ö†Ô∏è  Detected new column: FiberSpeed
‚ö†Ô∏è  Column dropped: DeviceProtection
```

That‚Äôs data engineering 101 in production environments.

---

## üîê 5. For **compliance and governance**

Even in small projects, it‚Äôs good practice to label columns that might be sensitive (e.g., demographics).
Later, if you ever deploy your model publicly or collaborate, you‚Äôll already have clear documentation of what data you use and why.

---

## üìò In human terms:

You‚Äôre future-proofing your work.
Today it‚Äôs *your* dataset and you remember everything.
Three months (or one teammate) from now, you won‚Äôt.
The dictionary is your *memory* ‚Äî for you, your model, and anyone else who touches it.

---

### TL;DR

You create and maintain a data dictionary to:

| Reason             | Benefit                                                  |
| ------------------ | -------------------------------------------------------- |
| ‚úÖ Data validation  | Catch missing/malformed columns before they cause errors |
| üß† Explainability  | Understand and explain what features mean                |
| üß© Reusability     | Keep column roles/types consistent across scripts        |
| üîç Drift detection | Alert when schema changes unexpectedly                   |
| üìã Governance      | Document what data you use and why                       |

---



Would you like me to show a **10-line validation script** that checks your current Telco dataset against your dictionary (so you can see that benefit in action)?

Perfect ‚Äî here‚Äôs your compact, no-nonsense **10-line data-validation script** for your Telco churn project, followed by a clear explanation of what *bad data* looks like in this context.

---

## ‚úÖ `scripts/validate_dataset.py`

```python
import yaml, pandas as pd

# 1Ô∏è‚É£ Load the reference data dictionary and dataset
data_dict = yaml.safe_load(open("configs/data_dictionary.yaml"))
df = pd.read_csv("data/processed/telco_clean.csv")

# 2Ô∏è‚É£ Loop through expected columns
for col, meta in data_dict["columns"].items():
    if col not in df.columns:
        print(f"‚ùå Missing column: {col}")
        continue
    # 3Ô∏è‚É£ Check dtype consistency
    expected = meta["dtype"]
    actual = str(df[col].dtype)
    if expected.startswith("int") and not "int" in actual:
        print(f"‚ö†Ô∏è  {col}: expected int, found {actual}")
    elif expected.startswith("float") and not "float" in actual:
        print(f"‚ö†Ô∏è  {col}: expected float, found {actual}")
    elif expected == "category" and df[col].nunique() > 50:
        print(f"‚ö†Ô∏è  {col}: too many unique values ({df[col].nunique()}) for categorical")

# 4Ô∏è‚É£ Optional: Check for unexpected extra columns
extra = set(df.columns) - set(data_dict["columns"].keys())
if extra:
    print(f"‚ö†Ô∏è  Unexpected extra columns: {extra}")
```

---

### üß† What this does

* **Line 1‚Äì3:** Loads your saved data dictionary and the cleaned Telco dataset.
* **Line 5‚Äì13:**

  * Flags **missing columns** (`‚ùå Missing column: TotalCharges`)
  * Flags **type mismatches** (e.g., expected `float`, got `object`)
  * Warns if a ‚Äúcategorical‚Äù feature has too many unique values (usually indicates it was mis-typed as numeric or ID-like).
* **Line 15‚Äì17:** Warns about **unexpected new columns** not defined in the dictionary (potential schema drift).

---

## ‚ö†Ô∏è Examples of ‚Äúbad data‚Äù in your Telco dataset

Here‚Äôs what would make this validator complain ‚Äî and *why* it matters:

| Type of issue              | Example                                               | Why it‚Äôs bad                                                      |
| -------------------------- | ----------------------------------------------------- | ----------------------------------------------------------------- |
| **Missing column**         | `TotalCharges` missing entirely                       | Model can‚Äôt compute key billing features ‚Üí predictions unreliable |
| **Type mismatch**          | `SeniorCitizen` read as ‚ÄúYes/No‚Äù instead of 0/1       | One-hot encoder sees new text categories, breaking preprocessing  |
| **Unexpected new column**  | `FiberSpeed` appears unexpectedly                     | Indicates schema drift ‚Üí model trained on different features      |
| **Categorical explosion**  | `PaymentMethod` suddenly has 120 unique values        | Likely dirty data (typos, inconsistent naming)                    |
| **Numeric stored as text** | `MonthlyCharges` looks like ‚Äú$70.85‚Äù                  | Fails scaling step in preprocessing pipeline                      |
| **Target label drift**     | `Churn` values ‚ÄúY‚Äù/‚ÄúN‚Äù instead of ‚ÄúYes‚Äù/‚ÄúNo‚Äù          | Target encoding fails or mislabels data                           |
| **NaN inflation**          | `TotalCharges` has 10% missing values (previously 0%) | Indicates upstream extraction or cleaning issue                   |

---

### üö¶What to do when validation fails

* **Missing columns:** Check your data extraction step or upstream schema.
* **Wrong dtype:** Fix with `pd.to_numeric()` or category encoding before retraining.
* **Unexpected extras:** Decide if you should update the dictionary (new feature) or drop the column.
* **Category blow-up:** Standardize category values (strip whitespace, unify case).

---

### TL;DR

This 10-line check is your *early-warning system*:

> It doesn‚Äôt stop bad data from existing ‚Äî it stops bad data from reaching your model.



Would you like me to expand this validator into a short reusable **`validate_dataframe(df, data_dict)` function** (so you can call it before every train or predict step automatically)?

Perfect ‚Äî this is exactly how you start moving toward **automated data quality gates**, a key Skill-4 milestone.
Here‚Äôs a clean, reusable function you can drop into your project to validate any dataset against your `data_dictionary.yaml`.

---

## ‚úÖ File: `src/churn/validate.py`

```python
import pandas as pd
import yaml

def validate_dataframe(df: pd.DataFrame, dict_path: str = "configs/data_dictionary.yaml") -> bool:
    """
    Validate a DataFrame against a YAML data dictionary.
    Returns True if all checks pass, False otherwise.
    """
    data_dict = yaml.safe_load(open(dict_path))
    ok = True

    # 1Ô∏è‚É£ Check each expected column
    for col, meta in data_dict["columns"].items():
        if col not in df.columns:
            print(f"‚ùå Missing column: {col}")
            ok = False
            continue

        expected = meta.get("dtype", "")
        actual = str(df[col].dtype)

        # Basic dtype consistency check
        if expected.startswith("int") and not "int" in actual:
            print(f"‚ö†Ô∏è  {col}: expected int, found {actual}")
            ok = False
        elif expected.startswith("float") and not "float" in actual:
            print(f"‚ö†Ô∏è  {col}: expected float, found {actual}")
            ok = False
        elif expected == "category" and df[col].nunique() > 50:
            print(f"‚ö†Ô∏è  {col}: too many unique values ({df[col].nunique()}) for categorical")
            ok = False

    # 2Ô∏è‚É£ Check for unexpected new columns
    extra = set(df.columns) - set(data_dict["columns"].keys())
    if extra:
        print(f"‚ö†Ô∏è  Unexpected extra columns: {extra}")
        ok = False

    # 3Ô∏è‚É£ Optionally check for NaN spikes
    nan_ratio = df.isna().mean()
    high_nan = nan_ratio[nan_ratio > 0.05]  # >5% missing
    if not high_nan.empty:
        print("‚ö†Ô∏è  Columns with >5% missing values:")
        print(high_nan)
        ok = False

    return ok
```

---

## üß™ Example use

In any notebook or script (like `train.py` or `predict.py`):

```python
from churn.validate import validate_dataframe
import pandas as pd

df = pd.read_csv("data/processed/telco_clean.csv")

if validate_dataframe(df):
    print("‚úÖ Data passed validation, safe to proceed.")
else:
    print("‚ùå Data failed validation ‚Äî check messages above.")
```

---

## üß† Why this is valuable

* ‚úÖ **Reusable:** You can call this before training, scoring, or visualization.
* üîÑ **Consistent:** Uses the same `data_dictionary.yaml` your entire pipeline references.
* üõë **Fail-fast:** Alerts you before you fit a model on corrupted or shifted data.
* üß© **Extensible:** You can easily add extra checks (e.g., allowed category values, numeric ranges).

---



## üß® What counts as ‚Äúbad data‚Äù here

Running `validate_dataframe()` will flag the same issues as before:

* **Missing or renamed columns**
* **Numeric fields stored as strings**
* **Exploded category values** (e.g., 100 unique `PaymentMethod`s)
* **Excessive missing values**
* **Unexpected extra columns**

That means your model won‚Äôt silently train or predict on wrong or mismatched data again.

---


># show how to integrate this validator directly inside your **`train.py`** and **`predict.py`** scripts so every run auto-checks the data first?

Absolutely‚Äîhere are clean, drop-in upgrades that wire validation into both training and prediction. I‚Äôve also refined the validator to support **mode-aware checks** (train vs predict), so scoring data isn‚Äôt forced to include the target or ID columns.

---

# 1) Replace `src/churn/validate.py` (mode-aware)

```python
# src/churn/validate.py
import pandas as pd
import yaml

def _load_dict(dict_path: str):
    return yaml.safe_load(open(dict_path))

def _expected_columns(data_dict, mode: str):
    cols_meta = data_dict["columns"]
    if mode == "train":
        # Expect everything listed in the dictionary
        return list(cols_meta.keys())
    elif mode == "predict":
        # Only columns actually used for modeling (no target, no IDs)
        used = []
        for c, meta in cols_meta.items():
            if not meta.get("used_in_model", False):
                continue
            if meta.get("role") in ("target", "id"):
                continue
            used.append(c)
        return used
    else:
        raise ValueError("mode must be 'train' or 'predict'")

def validate_dataframe(
    df: pd.DataFrame,
    dict_path: str = "configs/data_dictionary.yaml",
    mode: str = "train",
    categorical_cardinality_cap: int = 50,
    nan_warn_ratio: float = 0.05,
) -> bool:
    """
    Validate a DataFrame against a YAML data dictionary.

    mode='train'  ‚Üí require all columns in dictionary.
    mode='predict'‚Üí require only features with used_in_model=True and role not in {target,id}.

    Returns True if all checks pass, False otherwise.
    """
    data_dict = _load_dict(dict_path)
    cols_meta = data_dict["columns"]
    ok = True

    expected_cols = set(_expected_columns(data_dict, mode))
    df_cols = set(df.columns)

    # Missing & unexpected columns
    missing = expected_cols - df_cols
    if missing:
        print(f"‚ùå Missing required columns ({mode}): {sorted(missing)}")
        ok = False

    unexpected = df_cols - set(cols_meta.keys())
    if unexpected and mode == "train":
        # Unexpected at train time is more dangerous
        print(f"‚ö†Ô∏è  Unexpected extra columns present: {sorted(unexpected)}")
        ok = False

    # Dtype & simple categorical checks only for columns we both expect and have
    check_cols = expected_cols & df_cols
    for col in sorted(check_cols):
        meta = cols_meta.get(col, {})
        expected_dtype = meta.get("dtype", "")
        actual_dtype = str(df[col].dtype)

        # Basic dtype consistency
        if expected_dtype.startswith("int") and "int" not in actual_dtype:
            print(f"‚ö†Ô∏è  {col}: expected int, found {actual_dtype}")
            ok = False
        elif expected_dtype.startswith("float") and "float" not in actual_dtype:
            print(f"‚ö†Ô∏è  {col}: expected float, found {actual_dtype}")
            ok = False
        elif expected_dtype == "category":
            nunique = df[col].nunique(dropna=True)
            if nunique > categorical_cardinality_cap:
                print(f"‚ö†Ô∏è  {col}: too many unique values for categorical ({nunique} > {categorical_cardinality_cap})")
                ok = False
            # Optional: enforce allowed categories if provided
            allowed = meta.get("allowed")
            if allowed is not None:
                bad = set(df[col].dropna().unique()) - set(allowed)
                if bad:
                    print(f"‚ö†Ô∏è  {col}: unexpected categories {sorted(list(bad))}")
                    ok = False

    # Missingness check (warn for > nan_warn_ratio)
    nan_ratio = df[list(check_cols)].isna().mean()
    high_nan = nan_ratio[nan_ratio > nan_warn_ratio]
    if not high_nan.empty:
        print("‚ö†Ô∏è  Columns with high missingness:")
        for c, r in high_nan.sort_values(ascending=False).items():
            print(f"   - {c}: {r:.1%} missing")
        ok = False

    return ok
```

---

# 2) Update `scripts/train.py` to validate before training

Replace your current `scripts/train.py` with this (only differences: import + early validation + graceful exit):

```python
# scripts/train.py
import sys
import yaml
from sklearn.model_selection import train_test_split
from churn.io import read_csv, write_csv, save_obj
from churn.data import basic_clean, split_xy
from churn.features import build_preprocess
from churn.model import build_pipeline
from churn.validate import validate_dataframe  # NEW

def main():
    cfg = yaml.safe_load(open("configs/config.yaml"))
    df_raw = read_csv(cfg["paths"]["raw"])
    df = basic_clean(df_raw)

    # Validate cleaned dataframe against dictionary (TRAIN mode)
    if not validate_dataframe(df, dict_path="configs/data_dictionary.yaml", mode="train"):
        print("‚ùå Data validation failed. Aborting training.", file=sys.stderr)
        sys.exit(1)

    write_csv(df, cfg["paths"]["processed"])

    X, y = split_xy(df, target="Churn")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=cfg["training"]["test_size"],
        random_state=cfg["training"]["random_state"],
        stratify=y
    )

    preprocess, num_cols, cat_cols = build_preprocess(X_train)
    clf = build_pipeline(preprocess, max_iter=cfg["training"]["model"]["max_iter"])
    clf.fit(X_train, y_train)

    save_obj(clf, cfg["paths"]["model"])
    save_obj({"num_cols": num_cols, "cat_cols": cat_cols}, cfg["paths"]["columns"])
    print("‚úÖ Model trained and saved.")

if __name__ == "__main__":
    main()
```

---

# 3) Update `scripts/predict.py` to validate before scoring (predict mode)

Replace your current `scripts/predict.py` (the override-threshold version) with this small enhancement:

```python
# scripts/predict.py
import argparse
import sys
import yaml
import pandas as pd
from pathlib import Path
from churn.io import load_obj
from churn.validate import validate_dataframe  # NEW

def main():
    parser = argparse.ArgumentParser(description="Batch score Telco churn.")
    parser.add_argument("input_csv", help="Path to CSV of records to score")
    parser.add_argument("--output", help="Optional output path (CSV). If omitted, appends _scored.csv")
    parser.add_argument("--threshold", type=float, help="Override decision threshold (0‚Äì1). If omitted, uses config.")
    parser.add_argument("--config", default="configs/config.yaml", help="Path to YAML config")
    args = parser.parse_args()

    # Load config and model
    with open(args.config) as f:
        cfg = yaml.safe_load(f)

    model_path = cfg["paths"]["model"]
    clf = load_obj(model_path)

    # Load data
    df_new = pd.read_csv(args.input_csv)

    # If user accidentally included a label column, ignore it for scoring
    for col in ("Churn", "churn", "label", "target"):
        if col in df_new.columns:
            df_new = df_new.drop(columns=[col])

    # Validate features (PREDICT mode ‚Üí only requires model-used predictors)
    if not validate_dataframe(df_new, dict_path="configs/data_dictionary.yaml", mode="predict"):
        print("‚ùå Data validation failed. Aborting scoring.", file=sys.stderr)
        sys.exit(1)

    # Predict
    proba = clf.predict_proba(df_new)[:, 1]

    # Decide threshold: CLI override > YAML > default 0.5
    cfg_thresh = cfg.get("inference", {}).get("threshold", 0.5)
    threshold = args.threshold if args.threshold is not None else cfg_thresh
    if not (0.0 <= float(threshold) <= 1.0):
        print(f"ERROR: threshold must be in [0,1], got {threshold}", file=sys.stderr)
        sys.exit(2)

    pred = (proba >= float(threshold)).astype(int)

    # Assemble output
    scored = df_new.copy()
    scored["churn_proba"] = proba
    scored["churn_pred"] = pred

    # Output path
    out_path = args.output or (str(Path(args.input_csv).with_suffix("")) + "_scored.csv")
    Path(out_path).parent.mkdir(parents=True, exist_ok=True)
    scored.to_csv(out_path, index=False)

    print(f"Applied threshold: {threshold}")
    print(f"‚úÖ Scored ‚Üí {out_path}")

if __name__ == "__main__":
    main()
```

---

## What this gives you (quick wins)

* **Fail-fast safety**: Training or scoring halts with clear messages if schema/type issues show up.
* **Mode awareness**: Scoring datasets don‚Äôt need the target or ID fields; training still enforces the full schema.
* **Stronger contracts**: Your YAML dictionary is now the single source of truth for your pipeline.


># If you want, I can also add a **`--strict` flag** to treat warnings (like high missingness) as hard errors, or wire this into a tiny CI step (pre-commit or GitHub Actions) so bad data never merges into `main`.

Awesome‚Äîlet‚Äôs wire in a **strict mode** and a tiny **CI / pre-commit safety net** so bad data can‚Äôt sneak into your pipeline.

---

# 1) Upgrade the validator (adds `strict` + tunable behavior)

### `src/churn/validate.py` (drop-in replacement)

```python
import pandas as pd
import yaml

def _load_dict(dict_path: str):
    return yaml.safe_load(open(dict_path))

def _expected_columns(data_dict, mode: str):
    cols_meta = data_dict["columns"]
    if mode == "train":
        return list(cols_meta.keys())
    elif mode == "predict":
        used = []
        for c, meta in cols_meta.items():
            if not meta.get("used_in_model", False):    # only model features
                continue
            if meta.get("role") in ("target", "id"):    # no target/ids at predict time
                continue
            used.append(c)
        return used
    else:
        raise ValueError("mode must be 'train' or 'predict'")

def validate_dataframe(
    df: pd.DataFrame,
    dict_path: str = "configs/data_dictionary.yaml",
    mode: str = "train",
    categorical_cardinality_cap: int = 50,
    nan_warn_ratio: float = 0.05,
    strict: bool = False,   # ‚Üê NEW: if True, ANY warning becomes a failure
) -> bool:
    """
    Validate df against data_dictionary.yaml.
    Returns True if checks pass per current 'strict' level, else False.
    """
    data_dict = _load_dict(dict_path)
    cols_meta = data_dict["columns"]
    ok = True

    expected = set(_expected_columns(data_dict, mode))
    have = set(df.columns)

    # Missing columns (always hard error)
    missing = expected - have
    if missing:
        print(f"‚ùå Missing required columns ({mode}): {sorted(missing)}")
        ok = False

    # Unexpected columns
    unexpected = have - set(cols_meta.keys())
    if unexpected:
        msg = f"‚ö†Ô∏è  Unexpected columns present: {sorted(unexpected)}"
        print(msg)
        if strict or mode == "train":
            ok = False   # train is less tolerant; predict can be lenient unless strict

    # Dtype & categorical cardinality
    check_cols = expected & have
    for col in sorted(check_cols):
        meta = cols_meta.get(col, {})
        expected_dtype = meta.get("dtype", "")
        actual_dtype = str(df[col].dtype)
        local_fail = False

        if expected_dtype.startswith("int") and "int" not in actual_dtype:
            print(f"‚ö†Ô∏è  {col}: expected int, found {actual_dtype}"); local_fail = True
        elif expected_dtype.startswith("float") and "float" not in actual_dtype:
            print(f"‚ö†Ô∏è  {col}: expected float, found {actual_dtype}"); local_fail = True
        elif expected_dtype == "category":
            nunique = df[col].nunique(dropna=True)
            if nunique > categorical_cardinality_cap:
                print(f"‚ö†Ô∏è  {col}: too many unique values for categorical ({nunique} > {categorical_cardinality_cap})")
                local_fail = True
            allowed = meta.get("allowed")
            if allowed is not None:
                bad = set(df[col].dropna().unique()) - set(allowed)
                if bad:
                    print(f"‚ö†Ô∏è  {col}: unexpected categories {sorted(list(bad))}")
                    local_fail = True

        if local_fail and (strict or mode == "train"):
            ok = False

    # Missingness
    if check_cols:
        nan_ratio = df[list(check_cols)].isna().mean()
        high_nan = nan_ratio[nan_ratio > nan_warn_ratio]
        if not high_nan.empty:
            print("‚ö†Ô∏è  Columns with high missingness:")
            for c, r in high_nan.sort_values(ascending=False).items():
                print(f"   - {c}: {r:.1%} missing")
            if strict:
                ok = False

    return ok
```

---

# 2) Add CLI validation script (handy for pre-commit/CI)

### `scripts/validate_cli.py`

```python
import argparse, sys, yaml, pandas as pd
from churn.validate import validate_dataframe

def main():
    ap = argparse.ArgumentParser(description="Validate a CSV against data_dictionary.yaml")
    ap.add_argument("--mode", choices=["train","predict"], default="train")
    ap.add_argument("--file", help="CSV to validate (defaults: processed for train; required for predict)")
    ap.add_argument("--dict", default="configs/data_dictionary.yaml")
    ap.add_argument("--strict", action="store_true", help="Treat warnings as errors")
    ap.add_argument("--nan-cap", type=float, default=0.05)
    ap.add_argument("--cat-cap", type=int, default=50)
    ap.add_argument("--config", default="configs/config.yaml")
    args = ap.parse_args()

    if args.file:
        csv_path = args.file
    else:
        if args.mode == "train":
            cfg = yaml.safe_load(open(args.config))
            csv_path = cfg["paths"]["processed"]
        else:
            print("For --mode predict you must pass --file <csv>", file=sys.stderr)
            sys.exit(2)

    df = pd.read_csv(csv_path)
    ok = validate_dataframe(
        df, dict_path=args.dict, mode=args.mode,
        categorical_cardinality_cap=args.cat_cap,
        nan_warn_ratio=args.nan_cap, strict=args.strict
    )
    if not ok:
        print("‚ùå Validation failed.")
        sys.exit(1)
    print("‚úÖ Validation passed.")

if __name__ == "__main__":
    main()
```

---

# 3) Train / Predict with `--strict` flags

### `scripts/train.py` (only the validation line changes)

```python
# ...
from churn.validate import validate_dataframe

# Validate cleaned dataframe (strict at train-time is recommended)
if not validate_dataframe(df, dict_path="configs/data_dictionary.yaml", mode="train", strict=True):
    print("‚ùå Data validation failed. Aborting training.", file=sys.stderr)
    sys.exit(1)
# ...
```

### `scripts/predict.py` (allow optional strict via CLI)

```python
# add to argparse:
parser.add_argument("--strict", action="store_true", help="Treat warnings as errors during validation")

# pass through:
if not validate_dataframe(df_new, dict_path="configs/data_dictionary.yaml", mode="predict", strict=args.strict):
    print("‚ùå Data validation failed. Aborting scoring.", file=sys.stderr)
    sys.exit(1)
```

Usage:

```bash
# strict train (recommended)
python scripts/train.py

# lenient predict
python scripts/predict.py data/scoring/new_customers.csv

# strict predict (e.g., production scoring)
python scripts/predict.py data/scoring/new_customers.csv --strict
```

---

# 4) Pre-commit hook (runs validation before every commit)

### `.pre-commit-config.yaml`

```yaml
repos:
  - repo: local
    hooks:
      - id: telco-validate-train
        name: Telco Validate (train strict)
        entry: bash -c 'test -f configs/config.yaml && python scripts/validate_cli.py --mode train --strict || true'
        language: system
        pass_filenames: false
      - id: telco-validate-sample-predict
        name: Telco Validate (predict lenient if sample exists)
        entry: bash -c 'test -f data/scoring/new_customers.csv && python scripts/validate_cli.py --mode predict --file data/scoring/new_customers.csv || true'
        language: system
        pass_filenames: false
```

Install:

```bash
pip install pre-commit
pre-commit install
```

Now every `git commit` will gate on validation.

---

# 5) Lightweight GitHub Actions CI (validates on push/PR)

### `.github/workflows/ci.yml`

```yaml
name: telco-ci
on:
  push:
  pull_request:
jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - name: Install deps
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt || true
      - name: Validate train (strict)
        run: |
          if [ -f configs/config.yaml ]; then
            python scripts/validate_cli.py --mode train --strict || exit 1
          else
            echo "configs/config.yaml not found; skipping train validation."
          fi
      - name: Validate sample predict (lenient)
        run: |
          if [ -f data/scoring/new_customers.csv ]; then
            python scripts/validate_cli.py --mode predict --file data/scoring/new_customers.csv || exit 1
          else
            echo "No sample scoring file; skipping predict validation."
          fi
```

---

## What you get

* **Strict mode** makes warnings fatal when you want hard gates (CI, training).
* **Lenient mode** lets you explore data without blocking (ad-hoc scoring).
* **Pre-commit + CI**: bad schemas/types are caught *before* code lands in main, and before training happens.

If you want, I can also add a **‚Äúfix-suggestions‚Äù printout** (e.g., ‚Äútry `pd.to_numeric(..., errors='coerce')` on `TotalCharges`‚Äù) when dtype checks fail.


In [None]:
# src/utils/paths.py
from pathlib import Path

# Define project paths
PROJECT_ROOT = Path(__file__).resolve().parents[2]
DATA_DIR = PROJECT_ROOT / "data"
MODELS_DIR = PROJECT_ROOT / "models"

Phase 1: Extract second function:
1.2 Initial Quality Assessment
Missing value detection
Duplicate row checking
Data type inconsistencies (TotalCharges as object)
Basic value range validation
Initial observations documentation
Create data_quality.py when utils.py hits 50+ lines
Phase 2: Data Cleaning & Extracting Functions (Week 2)
Cleaning with Modular Code (Week 2, Days 1-3)
As soon as you write code twice, extract it:
2.1 Create Cleaning Module
# src/cleaning.py (NEW FILE)
def fix_total_charges(df):
    """Convert TotalCharges to numeric, handle missing."""
    df = df.copy()
    df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
    # ... rest of cleaning logic
    return df

2.1 Handling Data Issues
Convert TotalCharges to numeric
Handle missing/empty values
Fix inconsistent categories
Remove or impute problematic records
Document cleaning decisions
2.2 Data Validation
Verify cleaning effectiveness
Cross-check related fields (tenure vs TotalCharges)
Ensure business logic consistency
Create cleaned dataset checkpoint
Write cleaning functions
2.2 Validation Functions
Add validation to same module
Import and use in notebooks
Document cleaning decisions in notebook markdown


Create Visualization Module
# src/visualization.py (NEW FILE)
def plot_churn_distribution(df):
    """Create churn distribution plots."""
    # ... plotting code
    
def plot_numerical_distributions(df, columns):
    """Create distribution plots for numerical features."""
    # ... plotting code


Phase 4: Feature Engineering (Week 4)
Create Basic Package Structure
Once you have 5-10 functions:
project/
‚îú‚îÄ‚îÄ notebooks/
‚îÇ   ‚îî‚îÄ‚îÄ analysis.ipynb
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îú‚îÄ‚îÄ data_utils.py      # 3-5 functions
‚îÇ   ‚îî‚îÄ‚îÄ cleaning.py         # 3-5 functions
‚îî‚îÄ‚îÄ data/
4.1 Business-Driven Features
Customer tenure groups (new/established/loyal)
Service bundle indicators
Payment behavior flags
Contract risk indicators
Value segmentation (high/medium/low)
4.2 Statistical Features
Ratio features (charges per tenure month)
Binary indicators from continuous
Interaction terms
Aggregated service counts
Normalized monetary values
Create Feature Module
# src/features.py (NEW FILE)
def create_tenure_groups(df):
    """Create tenure-based customer segments."""
    df['TenureGroup'] = pd.cut(df['tenure'], 
                                bins=[0, 12, 24, 48, 72],
                                labels=['New', 'Regular', 'Loyal', 'VIP'])
    return df

def create_value_segments(df):
    """Create customer value segments."""
    # ... implementation
    return df
Refactor Existing Code
Move feature logic from notebooks to module
Update notebooks to import from features.py
Create package structure when you have 5+ modules:
src/
‚îî‚îÄ‚îÄ telco_analysis/  # Convert to package
    ‚îú‚îÄ‚îÄ __init__.py
    ‚îú‚îÄ‚îÄ data.py      # Renamed from utils.py
    ‚îú‚îÄ‚îÄ cleaning.py
    ‚îú‚îÄ‚îÄ features.py
    ‚îî‚îÄ‚îÄ visualization.py
Phase 5: Statistical Analysis (Week 5)
5.1 Hypothesis Testing
Chi-square for categorical associations
T-tests for numerical differences
Correlation significance testing
Multiple testing corrections
Effect size calculations
5.2 Statistical Validation
Assumption checking for tests
Non-parametric alternatives when needed
Confidence interval construction
Statistical vs practical significance
Business interpretation of results
Statistics Module
# src/telco_analysis/statistics.py (NEW FILE)
def perform_churn_tests(df):
    """Run statistical tests for churn analysis."""
    results = {}
    # Chi-square for categoricals
    # T-tests for numericals
    return results

5.2 Progressive Organization
Import all modules in __init__.py
Make functions easily accessible
Start using package imports:
from telco_analysis import load_telco_data, create_tenure_groups
Phase 6: Data Preparation for Modeling (Week 6)
Expand Package for Modeling
Add modules as needed:
src/
‚îú‚îÄ‚îÄ __init__.py            # Make it a package
‚îú‚îÄ‚îÄ data_utils.py
‚îú‚îÄ‚îÄ preprocessing.py        # Expanded from cleaning
‚îú‚îÄ‚îÄ feature_engineering.py # New
‚îî‚îÄ‚îÄ visualization.py       # Extracted plotting code
6.1 Feature Preparation
Encoding categorical variables
Scaling numerical features
Feature selection (statistical)
Train/test/validation splits
Stratification strategy
6.2 Pipeline Development
Preprocessing pipeline creation
Transformation sequence definition
Data leakage prevention
Reproducibility setup
Pipeline testing


6.3 Preprocessing Module
# src/telco_analysis/preprocessing.py
def prepare_for_modeling(df):
    """Complete preprocessing pipeline."""
    # Encoding, scaling, splitting
    return X_train, X_test, y_train, y_test

6.4 Reorganize Package Structure
src/telco_analysis/
‚îú‚îÄ‚îÄ __init__.py
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îú‚îÄ‚îÄ loader.py
‚îÇ   ‚îî‚îÄ‚îÄ cleaning.py
‚îú‚îÄ‚îÄ features/
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îî‚îÄ‚îÄ engineering.py
‚îú‚îÄ‚îÄ analysis/
‚îÇ   ‚îú‚îÄ‚îÄ __init__.py
‚îÇ   ‚îú‚îÄ‚îÄ statistics.py
‚îÇ   ‚îî‚îÄ‚îÄ visualization.py
‚îî‚îÄ‚îÄ preprocessing.py
Phase 7: Baseline Modeling (Week 7)
7.1 Simple Models First
Logistic regression baseline
Decision tree (single)
Naive Bayes classifier
Model training basics
Prediction generation
7.2 Model Evaluation
Confusion matrices
Accuracy, precision, recall
ROC curves and AUC
Classification reports
Business metric translation
7.1 Create Models Module
# src/telco_analysis/models/baseline.py
def train_logistic_baseline(X_train, y_train):
    """Train baseline logistic regression."""
    # Implementation
    return model

def evaluate_model(model, X_test, y_test):
    """Evaluate model performance."""
    # Implementation
    return metrics

7.2 Use Modules in Notebooks
Import preprocessing and modeling functions
Keep experimentation in notebooks
Extract stable code to modules




Phase 8-9: Full Package Structure
Complete organization:
src/telco_analysis/
‚îú‚îÄ‚îÄ __init__.py
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ loader.py
‚îÇ   ‚îî‚îÄ‚îÄ validator.py
‚îú‚îÄ‚îÄ features/
‚îÇ   ‚îî‚îÄ‚îÄ engineering.py
‚îú‚îÄ‚îÄ models/
‚îÇ   ‚îú‚îÄ‚îÄ baseline.py
‚îÇ   ‚îî‚îÄ‚îÄ evaluation.py
‚îî‚îÄ‚îÄ utils/
    ‚îî‚îÄ‚îÄ config.py
Phase 8: Advanced Models & Evaluation (Week 8)
8.1 Advanced Algorithms
Random Forest
Gradient Boosting (XGBoost)
Support Vector Machines
Model comparison framework
Ensemble strategies
8.2 Hyperparameter Tuning
Grid search basics
Cross-validation implementation
Parameter importance understanding
Overfitting detection
Optimal model selection
Expand Models Module
# src/telco_analysis/models/advanced.py
def train_random_forest(X_train, y_train, **params):
    """Train random forest with given parameters."""
    # Implementation
    
# src/telco_analysis/models/evaluation.py  
def plot_roc_curves(models, X_test, y_test):
    """Compare ROC curves for multiple models."""
    # Implementation
Phase 9: Model Interpretation (Week 9)
9.1 Feature Importance
Coefficient analysis (logistic)
Tree-based importances
Permutation importance
SHAP values introduction
Business insight extraction
9.2 Model Validation
Performance on holdout set
Temporal validation if possible
Bias detection
Calibration assessment
Limitation documentation
Model Optimization & Selection (Week 8)
9.1 Hyperparameter Module
# src/telco_analysis/models/tuning.py
def tune_hyperparameters(model, param_grid, X, y):
    """Grid search with cross-validation."""
    # Implementation
    return best_model, best_params

9.2 Pipeline Integration
Create end-to-end pipeline using all modules
Consolidate best practices into reusable functions
Phase 10: Productionization (Week 10)
10.1 Code Modularization
Function extraction from notebooks
Module organization
Package structure creation
Documentation standards
Testing implementation
10.2 Model Deployment Prep
Model serialization
Prediction pipeline
API wrapper basics
Batch scoring setup
Monitoring considerations
Deployment Preparation
Model serialization
Config management
CLI interface:
# src/telco_analysis/cli.py
def main():
    """Command-line interface for predictions."""
    # Implementation
10.1 Add Production Features
# src/telco_analysis/pipeline.py
class ChurnPredictionPipeline:
    """End-to-end prediction pipeline."""
    def __init__(self, config):
        self.config = config
        
    def fit(self, df):
        # Complete training pipeline
        
    def predict(self, df):
        # Complete prediction pipeline

10.2 Add Testing
# tests/test_features.py
def test_tenure_groups():
    """Test tenure group creation."""
    # Test implementation
Phase 11: Business Integration (Week 11)
Business Deliverables
Executive dashboard notebook
Technical report generation
ROI calculations using modules
11.1 Stakeholder Communication
Executive summary creation
Technical report writing
Visualization dashboard
ROI calculations
Recommendation formulation
11.2 Implementation Planning
Deployment strategy
A/B testing framework
Success metrics definition
Feedback loop design
Maintenance planning

Phase 12: Portfolio & Career (Week 12)
12.1 Project Documentation
GitHub repository polish
README completion
Code commenting
Jupyter notebook narratives
Results presentation
Add comprehensive docstrings
Create API documentation
Write user guide in README
12.2 Portfolio Polish
Clean up repository
Add comprehensive README
Create presentation notebooks
12.2 Career Artifacts
LinkedIn project post
Portfolio website entry
Interview talking points
Technical blog post
Lessons learned reflection


The Right Approach: Progressive Modularization
Week 1-2: Notebook-Heavy, Light Functions
Work primarily in notebooks
Create utils.py for repeated code
Import from utils into notebooks
Week 3-4: Function-First Development
Write new code as functions first
Test in notebook cells
Move stable functions to modules
Week 5-6: Package Emergence
Create proper package structure
Add __init__.py files
Start thinking about interfaces
Week 7+: Package Refinement
Organize into logical modules
Add documentation
Consider testing

Practical Function Extraction Triggers
Progressive Modularization Principles
Why This Progressive Approach Works Better
Natural Evolution: Code organization emerges from actual needs
Less Refactoring: You don't have to reorganize everything at once
Continuous Learning: Practice modularization throughout
Earlier Benefits: Reuse code sooner, reduce duplication immediately
Better Habits: Train yourself to recognize extraction opportunities
Create a function when:
You write similar code twice
A cell exceeds 10-15 lines
You need to reuse logic
Code has a clear single purpose
You copy-paste code (even once)
You need the same logic in another notebook/You need to reuse logic
You're about to write a complex operation
Create a new module when:
utils.py exceeds 100 lines
You have 3-5+ related functions
A module exceeds 100-150 lines
Functions have a clear theme (data, features, models)
Create package structure when:
You have 3+ modules
You need sub-modules
Modules have interdependencies
You're sharing code with others
The key is: modularize incrementally as you go, not all at once at the end. This makes the learning smoother and the code better.
Daily Development Flow
Start in notebook - experiment freely
Identify patterns - notice repetition
Extract to function - create in appropriate module
Import and use - replace notebook code with import
Test and refine - ensure function works correctly
Document - add docstring and comments
Module Evolution Path
Day 1-3:   utils.py (5-10 functions)
Day 4-7:   utils.py + cleaning.py
Week 2:    Add features.py, visualization.py
Week 3:    Convert to package structure
Week 4+:   Add sub-modules as needed

Quality Checkpoints
Before committing code:
‚úÖ Functions have single responsibility
‚úÖ Docstrings explain purpose
‚úÖ Notebooks import from modules
‚úÖ No code duplication
‚úÖ Clear module organization
This progressive approach ensures you build good habits from day one while maintaining flexibility for exploration and learning.


Cadence Guidelines
Daily Rhythm (2-3 hours/day)
Morning (30 min): Review previous work, set daily goal
Main Work (90 min): Implement new concepts/code
Practice (30 min): Debug, refine, document
Reflection (30 min): Note challenges, solutions, learnings
Weekly Checkpoints
Monday: Plan week's objectivesc
Wednesday: Mid-week progress check
Friday: Weekly review and documentation
Weekend: Catch-up or exploration time
Learning Principles
Complete each phase before moving forward
Document decisions and reasoning
Save checkpoints after major milestones
Practice explaining your work aloud
Connect to business value at every step

Quality Gates
Before moving to next phase, ensure:
‚úÖ Code runs without errors
‚úÖ Results are reproducible
‚úÖ Business insights documented
‚úÖ Can explain work to non-technical audience
‚úÖ Version controlled and backed up

This cadence balances thorough learning with practical progress, ensuring you build strong foundations while maintaining momentum toward a production-ready solution.



I'll create a comprehensive data analysis project for the IBM Telco churn dataset. Let me start by building the complete notebook-based project, then explain modularization.
Great! I've created a complete notebook-based data analysis project for the IBM Telco churn dataset. The project is split into 4 parts:
Data Loading & EDA - Initial exploration and understanding
Preprocessing - Data cleaning and feature engineering
Visualization - Insights and pattern discovery
Model Building - Training and evaluation of multiple models
Now, let me explain the modularization process:

What is Modularization?
Modularization is the process of organizing code into separate, reusable modules (Python files) with clear responsibilities. Instead of having everything in notebooks, you create a structured project with functions, classes, and modules that can be tested, maintained, and reused.

Why Modularize?
Current Problems with Notebook-Only Approach:
No code reusability - Copy-paste between notebooks leads to duplication
Hard to test - Can't easily write unit tests for notebook cells
Version control issues - Notebooks don't play well with Git
No reproducibility guarantees - Execution order matters; state can be messy
Difficult collaboration - Merge conflicts are nightmares
Can't easily deploy - Hard to put notebooks into production
Maintainability - Hard to find and fix bugs across multiple notebooks
Benefits of Modularization:
DRY Principle (Don't Repeat Yourself) - Write once, use everywhere
Testability - Unit tests ensure code correctness
Maintainability - Easier to update and debug
Collaboration - Multiple people can work on different modules
Deployment - Easy to package and deploy as APIs or applications
Documentation - Clear structure makes code self-documenting
Performance - Can optimize specific modules independently

How to Modularize the Project
Let me create a complete modularized version of the project:
Now let me create the actual modularized code files:
Perfect! I've created a complete data analysis project for the IBM Telco churn dataset. Here's what I've delivered:
Complete Notebook-Based Project (Parts 1-4)
Data Loading & EDA - Initial exploration
Preprocessing - Cleaning and encoding
Visualization - Insights and patterns
Model Building - Training and evaluation
Complete Modularized Project Structure
I've created a professional, production-ready structure with:
Core Modules:
data/loader.py - Data loading with validation
data/preprocessor.py - Cleaning and encoding
features/engineer.py - Feature creation
models/train.py - Model training
models/evaluate.py - Model evaluation
visualization/plots.py - Publication-quality plots
Supporting Files:
main.py - Complete pipeline orchestration
config/config.yaml - Centralized configuration
requirements.txt - Dependencies
tests/test_preprocessor.py - Unit test example
README.md - Full documentation
Comprehensive Guide:
Detailed modularization explanation covering what, why, and how
Before/after comparisons
Best practices and design patterns
Step-by-step transition strategy
Key Benefits of the Modularized Approach:
‚úÖ Reusable - Write once, use everywhere
‚úÖ Testable - Unit tests ensure correctness
‚úÖ Maintainable - Easy to update and debug
‚úÖ Collaborative - Multiple developers can work together
‚úÖ Production-ready - Can be deployed as API or application
‚úÖ Professional - Industry-standard structure
You can now run the entire pipeline with a single command:
python main.py --data_path data/raw/telco.csv --save_models
The modularized version transforms messy notebook code into clean, professional, maintainable software! üöÄ



2.1 Review Previous Work
Objective: Identify repetitive code patterns from Levels 0-2 notebooks for extraction.
2.2 Consolidation Strategy
Before (Level 0-2 Approach):
# Repeated in every notebook
df = pd.read_csv('data/telco.csv')
print(f"Loaded {df.shape[0]} rows and {df.shape[1]} columns")
df.info()
df.describe()

After (Level 3 Function):
def load_data(filepath):
    """Load and provide basic info about dataset."""
    df = pd.read_csv(filepath)
    print(f"Data loaded: {df.shape[0]} rows, {df.shape[1]} columns")
    return df


Where does it fit?


## ü§î **It Depends on the Type of Statistical Analysis**


### **‚úÖ Belongs in `01_data_exploration.ipynb`:**
- **Descriptive statistics** (mean, median, std, percentiles)
- **Distribution analysis** (skewness, kurtosis, normality tests)
- **Basic hypothesis tests** (t-tests, chi-square for independence)
- **Correlation analysis** between features
- **Outlier detection** using statistical methods (IQR, z-scores)


### **üéØ Belongs in `02_data_cleaning.ipynb`:**
- **Statistical outlier removal** (based on z-scores, IQR)
- **Statistical imputation** methods (mean, median, mode)
- **Data quality statistics** (missing value patterns, duplicate analysis)


### **üìä Belongs in `03_visualization_insights.ipynb`:**
- **Statistical relationships** with the target variable
- **Feature importance** using statistical tests
- **Segmentation analysis** (churn rates by customer segments)
- **Statistical significance** of observed patterns


### **ü§ñ Belongs in `04_baseline_model.ipynb`:**
- **Model performance statistics** (confidence intervals, significance tests)
- **Feature selection** using statistical methods (ANOVA, chi-square)
- **Statistical validation** of model assumptions


## üìã **Recommended Approach for Level 3**


**Keep it simple and logical:**


```python
# 01_data_exploration.ipynb - Focus on understanding your data
- df.describe() with interpretation
- Distribution analysis of key variables
- Basic correlation matrix
- Univariate statistical summaries


# 03_visualization_insights.ipynb - Focus on business insights
- Statistical relationships with churn
- Significance tests for key findings
- Comparative statistics across segments
```


## üí° **Example Split:**


**In `01_data_exploration.ipynb`:**
```python
# Basic statistical profiling
print("Numerical Variables Statistics:")
print(df[['tenure', 'MonthlyCharges', 'TotalCharges']].describe())


# Distribution analysis
from scipy import stats
print(f"Tenure skewness: {stats.skew(df['tenure'])}")
print(f"MonthlyCharges normality test: {stats.normaltest(df['MonthlyCharges'])}")
```


**In `03_visualization_insights.ipynb`:**
```python
# Business-focused statistical analysis
from scipy.stats import chi2_contingency


# Test if contract type is significantly related to churn
contingency_table = pd.crosstab(df['Contract'], df['Churn'])
chi2, p_value = chi2_contingency(contingency_table)[:2]
print(f"Contract vs Churn - Chi-square test p-value: {p_value}")
```


## üéØ **For Level 3 Specifically:**


Since you're building your **baseline model**, I'd recommend:


1. **Keep descriptive statistics in `01_data_exploration`** - it's foundational data understanding
2. **Move inferential statistics to `03_visualization_insights`** - it supports your business insights
3. **Save advanced statistical modeling for Level 4+** when you focus on evaluation and thresholding


This keeps each notebook focused on its primary purpose and maintains the logical flow of your analysis progression. 
Phase 3: Creating a Baseline Model using Notebooks then convert to Package Development
## üìì Notebook Structure
## Building the Complete Package {#complete-package}


### Project Structure


```
Level_3/
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îî‚îÄ‚îÄ telco_analysis/
‚îÇ       ‚îú‚îÄ‚îÄ __init__.py         # Makes this a package
‚îÇ       ‚îú‚îÄ‚îÄ data_utils.py       # Data functions
‚îÇ       ‚îú‚îÄ‚îÄ preprocessing.py    # Cleaning functions
‚îÇ       ‚îî‚îÄ‚îÄ modeling.py         # Model functions
‚îú‚îÄ‚îÄ notebooks/
‚îÇ   ‚îî‚îÄ‚îÄ 03_baseline_model.ipynb
‚îî‚îÄ‚îÄ data/
   ‚îî‚îÄ‚îÄ raw/
       ‚îî‚îÄ‚îÄ telco_data.csv
```
Phase 3.2 CreateBaseline Model using Notebooks.
Comprehensive Evaluation Framework
- Confusion matrices with business context
- ROC/PR curves for threshold optimization
- Feature importance analysis for interpretability
- Calibration plots for probability reliability




## Model Performance


| Metric | Score | Business Meaning |
|--------|-------|------------------|
| **ROC-AUC** | 0.842 | Excellent customer ranking ability |
| **Precision** | 0.653 | 65% of flagged customers actually churn |
| **Recall** | 0.547 | Catches 55% of all churners |
| **F1-Score** | 0.595 | Balanced precision-recall trade-off |


### Key Insights Discovered
1. **Contract type** is the strongest predictor (month-to-month = 2.3x churn risk)
2. **Payment method** significantly impacts retention (electronic check = 45% churn rate)
3. **Customer tenure** inversely correlates with churn risk (-0.65 coefficient)



Q:
Should my 02_statistical_analysis.ipynb notebook be written more tightly?
A:
Excellent question! Your notebook is definitely on the right track for Level 3, but yes, it could be more tightly structured. You're at the perfect stage to start thinking about professional notebook organization while maintaining the learning value.
üéØ Current Strengths vs. Areas for Tightening
What's Working Well:
Logical flow from data loading ‚Üí statistical testing ‚Üí business insights
Good use of modular imports
Clear statistical methodology
Business-focused conclusions
Areas to Tighten:
Level 3 Notebook Structure Guidelines
üéØ Professional Notebook Organization
Current Structure Issues to Address:
Debugging Code Mixed with Analysis
# This should be in a separate development/debugging notebook
import importlib.util
spec = importlib.util.spec_from_file_location("stats", stats_file)
stats_module = importlib.util.module_from_spec(spec)
# List all functions in the module
functions = [item for item in dir(stats_module) if not item.startswith('_')]

Configuration Loading Scattered
# Should be consolidated into one clean section
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

Data Exploration vs. Analysis Mixed
# This inspection belongs in exploratory phase, not statistical analysis
tenure_0_customers = df_clean[df_clean['tenure'] == 0]
from tabulate import tabulate
print(tabulate(tenure_0_customers, headers='keys', tablefmt='psql'))


üìã Recommended Notebook Structure
Cell 1: Executive Summary & Objectives
# Statistical Analysis: Telco Customer Churn
## Objective
Identify statistically significant drivers of customer churn and quantify business impact.

## Key Questions
1. Which customer characteristics are significantly associated with churn?
2. What is the magnitude of these effects?
3. Which segments represent the highest business risk?

## Methodology
- Statistical testing with effect size reporting
- Risk segmentation analysis
- Business impact quantification

Cell 2: Environment Setup (Clean)
# Core imports and configuration
import sys
from pathlib import Path
import yaml

# Add project path
HERE = Path().resolve()
sys.path.insert(0, str(HERE.parent / "src"))

# Data science stack
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Project modules
from utils.loader import DataLoader
from utils.preprocessor import clean_telco_data
from utils.stats import (
    test_numerical_vs_churn,
    test_categorical_vs_churn,
    identify_risk_segments
)

# Load configuration
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("‚úÖ Environment setup complete")

Cell 3: Data Loading & Validation
# Load and prepare data using modular functions
loader = DataLoader(config)
df_raw, load_report = loader.load_data(config['data']['raw_path'])
df_clean = clean_telco_data(df_raw)

# Data quality summary
print(f"Dataset: {df_clean.shape[0]:,} customers, {df_clean.shape[1]} features")
print(f"Churn rate: {(df_clean['Churn'] == 'Yes').mean()*100:.1f}%")
print(f"Missing values: {df_clean.isnull().sum().sum()}")

# Save processed data
processed_path = Path(config['data']['processed_path'])
processed_path.parent.mkdir(parents=True, exist_ok=True)
df_clean.to_csv(processed_path, index=False)
print(f"‚úÖ Clean data saved to {processed_path}")

Cell 4: Statistical Testing Framework
# Define features to test
numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
categorical_features = ['Contract', 'PaymentMethod', 'InternetService']

# Initialize results storage
statistical_results = {
    'numerical': {},
    'categorical': {}
}

print("üî¨ Running Statistical Tests")
print("=" * 40)

Cell 5: Numerical Feature Analysis
# Test numerical features
for feature in numerical_features:
    result = test_numerical_vs_churn(df_clean, feature, 'Churn')
    statistical_results['numerical'][feature] = result
    
    print(f"\n{feature.upper()}:")
    print(f"  Test: {result['test_used']}")
    print(f"  P-value: {result['p_value']:.4e}")
    print(f"  Effect size: {result['cohens_d']:.3f} ({result['effect_size']})")
    print(f"  Significant: {'‚úÖ' if result['significant'] else '‚ùå'}")

Cell 6: Categorical Feature Analysis
# Test categorical features
for feature in categorical_features:
    result = test_categorical_vs_churn(df_clean, feature, 'Churn')
    statistical_results['categorical'][feature] = result
    
    print(f"\n{feature.upper()}:")
    print(f"  Chi-square: {result['chi2_statistic']:.2f}")
    print(f"  P-value: {result['p_value']:.4e}")
    print(f"  Cram√©r's V: {result['cramers_v']:.3f}")
    print(f"  Highest risk: {result['highest_risk_category']}")

Cell 7: Key Findings Visualization
# Create focused visualizations for significant findings
significant_features = []

# Identify significant results
for category, results in statistical_results.items():
    for feature, result in results.items():
        if result['significant']:
            significant_features.append((feature, result))

print(f"üìä Visualizing {len(significant_features)} significant findings")

# Create subplot grid
n_features = len(significant_features)
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for i, (feature, result) in enumerate(significant_features[:4]):
    # Your visualization code here
    pass

plt.tight_layout()
plt.show()

Cell 8: Risk Segmentation
# Business-focused risk analysis
risk_segments = identify_risk_segments(df_clean)

print("üéØ HIGH-RISK CUSTOMER SEGMENTS")
print("=" * 40)

# Sort by risk level and revenue impact
high_risk_segments = {k: v for k, v in risk_segments.items() 
                     if v['risk_level'] == 'HIGH'}

for segment_name, data in high_risk_segments.items():
    print(f"\n{segment_name.upper()}:")
    print(f"  Size: {data['size']:,} customers ({data['percentage_of_base']:.1f}%)")
    print(f"  Churn Rate: {data['churn_rate']:.1f}%")
    print(f"  Revenue at Risk: ${data.get('monthly_revenue_at_risk', 0):,.0f}/month")

Cell 9: Executive Summary & Recommendations
# Business intelligence summary
print("üìã EXECUTIVE SUMMARY")
print("=" * 50)

# Calculate total impact
total_revenue_at_risk = sum(
    segment.get('monthly_revenue_at_risk', 0) 
    for segment in risk_segments.values()
)

print(f"\nüí∞ BUSINESS IMPACT:")
print(f"   Total Monthly Revenue at Risk: ${total_revenue_at_risk:,.0f}")
print(f"   Annualized Impact: ${total_revenue_at_risk * 12:,.0f}")

print(f"\nüéØ TOP 3 RECOMMENDATIONS:")

# Generate recommendations from significant findings
recommendations = []
for feature, result in significant_features:
    if feature == 'Contract' and result['significant']:
        recommendations.append({
            'priority': 1,
            'action': 'Contract Incentive Program',
            'rationale': f"Month-to-month customers have {result['churn_rates_by_category']['Month-to-month']*100:.1f}% churn rate",
            'expected_impact': '20% reduction in contract-related churn'
        })

# Display top recommendations
for i, rec in enumerate(recommendations[:3], 1):
    print(f"\n   {i}. {rec['action']}")
    print(f"      Rationale: {rec['rationale']}")
    print(f"      Expected Impact: {rec['expected_impact']}")

Cell 10: Technical Appendix (Optional)
# Detailed statistical results for technical stakeholders
print("üìä DETAILED STATISTICAL RESULTS")
print("=" * 40)

# Export detailed results
results_export = {
    'summary': {
        'total_features_tested': len(numerical_features) + len(categorical_features),
        'significant_findings': len(significant_features),
        'alpha_level': 0.05
    },
    'detailed_results': statistical_results
}

# Save results for reporting
import json
with open('../results/statistical_analysis_results.json', 'w') as f:
    json.dump(results_export, f, indent=2, default=str)

print("‚úÖ Results exported for technical documentation")


üéØ Key Improvements for Level 3
1. Separation of Concerns
# ‚ùå Mixed exploration and analysis
tenure_0_customers = df_clean[df_clean['tenure'] == 0]
print(tabulate(tenure_0_customers, headers='keys', tablefmt='psql'))

# ‚úÖ Focused analysis only
result = test_numerical_vs_churn(df_clean, 'tenure', 'Churn')
print(f"Tenure analysis: p={result['p_value']:.4e}, d={result['cohens_d']:.3f}")

2. Professional Output Formatting
# ‚ùå Basic print statements
print("Tenure Analysis Results:")
print(f"  Test used: {tenure_results['test_used']}")

# ‚úÖ Structured, scannable output
print("üî¨ STATISTICAL TEST RESULTS")
print("=" * 30)
print(f"Feature: {feature}")
print(f"Test: {result['test_used']}")
print(f"Significance: {'‚úÖ Significant' if result['significant'] else '‚ùå Not significant'}")

3. Result-Oriented Structure
# ‚ùå Process-focused
# "First let's load the data, then clean it, then test it..."

# ‚úÖ Results-focused
# "Key Finding: Contract type significantly predicts churn (p<0.001)"

4. Notebook Metadata
<!-- Add at the top -->
---
title: "Statistical Analysis: Customer Churn Drivers"
author: "Your Name"
date: "2024-XX-XX"
purpose: "Identify statistically significant predictors of customer churn"
audience: "Business stakeholders and data science team"
runtime: "~5 minutes"
---

üöÄ Transition Strategy
Create 02_statistical_analysis_CLEAN.ipynb with the tighter structure
Keep your current notebook as 02_statistical_analysis_EXPLORATORY.ipynb for reference
Use the exploratory version for debugging and investigation
Use the clean version for presentations and stakeholder communication
This approach lets you maintain both learning documentation and professional output - perfect for Level 3!
üéØ Direct Answer: YES, Definitely Tighten It
Your notebook should be significantly more focused for Level 3. Here's why:
Current Issues:
Debugging code mixed with analysis (the module inspection code)
Exploratory data inspection mixed with statistical testing
Configuration scattered across multiple cells
No clear narrative flow for stakeholders
Level 3 Standard:
Results-focused rather than process-focused
Professional presentation quality
Clear executive summary
Systematic methodology throughout
üîÑ Quick Tightening Strategy
Move to Separate Notebooks:
02_statistical_analysis_CLEAN.ipynb ‚Üê This is your deliverable
02_exploratory_debugging.ipynb ‚Üê Keep your exploration here
Clean Version Should:
Start with key findings (executive summary)
Show systematic testing of all relevant features
Focus on business implications of every statistical result
End with clear recommendations and next steps
Development Process:
Exploratory Notebook ‚Üí Debug/Investigate ‚Üí Extract Insights ‚Üí Clean Presentation
     (Keep private)                                        (Show stakeholders)
üí° Why This Matters for Level 3
Level 2: "Here's what I found while exploring..." 
Level 3: "Here are the statistically significant business insights and recommended actions..."
Your analysis quality is already Level 3+, but your presentation needs to catch up. The statistical rigor is excellent - now package it professionally.
Bottom Line: Create a tight, business-focused version for stakeholders while keeping your exploratory version for learning and development. This dual-notebook approach is exactly what professional data scientists do!

### **`03_baseline_model.ipynb`**
```python
# Cell 1: Setup
"""
Level 3: Baseline Model with Modular Code
=========================================
Moving from monolithic notebooks to reusable functions!
"""
import sys
sys.path.append('../src')  # Add src to path


# Import our custom package
from telco_analysis import (
   load_data, clean_data, create_features,
   split_data, train_baseline, evaluate_model, save_pipeline
)


import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


# Cell 2: Load and Prepare Data
"""
Using our extracted functions instead of inline code!
Before (Level 2): 50+ lines of data loading code
Now (Level 3): 3 function calls
"""
# Load
df = load_data()


# Clean
df_clean = clean_data(df)


# Engineer features
df_features = create_features(df_clean)


# Cell 3: Prepare for Modeling
"""
Convert data to modeling format
"""
from telco_analysis.preprocessing import prepare_for_modeling


df_model, numeric_cols, categorical_cols = prepare_for_modeling(df_features)


print(f"Numeric features ({len(numeric_cols)}): {numeric_cols[:5]}...")
print(f"Categorical features ({len(categorical_cols)}): {categorical_cols[:5]}...")


# Cell 4: Train-Test Split
"""
Split with stratification to maintain class balance
"""
X_train, X_test, y_train, y_test = split_data(
   df_model,
   target='Churn',
   test_size=0.2,
   random_state=42
)


# Cell 5: Train Baseline Model
"""
Train our first baseline - Logistic Regression
Why Logistic Regression?
- Simple and interpretable
- Fast to train
- Good baseline for binary classification
- Provides probability estimates
"""
pipeline = train_baseline(X_train, y_train, numeric_cols, categorical_cols)


# Cell 6: Evaluate Performance
"""
Get baseline metrics to beat in future iterations
"""
metrics = evaluate_model(pipeline, X_test, y_test)


# Detailed classification report
from sklearn.metrics import classification_report
y_pred = pipeline.predict(X_test)
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred,
                         target_names=['No Churn', 'Churn']))


# Cell 7: Save Model Artifacts
"""
Persist our trained model for future use
"""
save_pipeline(pipeline, metrics)


# Cell 8: Feature Importance Analysis
"""
Understand what drives predictions
"""
# Get feature names after preprocessing
feature_names = (
   numeric_cols +
   list(pipeline.named_steps['preprocessor']
        .named_transformers_['cat']
        .get_feature_names_out(categorical_cols))
)


# Get coefficients
coefficients = pipeline.named_steps['classifier'].coef_[0]


# Create importance dataframe
importance_df = pd.DataFrame({
   'feature': feature_names,
   'coefficient': coefficients,
   'abs_coefficient': abs(coefficients)
}).sort_values('abs_coefficient', ascending=False)


# Plot top 15 features
plt.figure(figsize=(10, 6))
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['coefficient'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Coefficient')
plt.title('Top 15 Feature Importances (Logistic Regression)')
plt.tight_layout()
plt.show()


# Cell 9: Test Model Loading
"""
Verify we can reload and use the saved model
"""
from telco_analysis.modeling import load_pipeline


# Load saved model
loaded_pipeline = load_pipeline()


# Make predictions on test set
test_predictions = loaded_pipeline.predict(X_test[:5])
print("Sample predictions:", test_predictions)
print("Actual values:", y_test[:5].values)
```
3. Technical Implementation
3.1 Key Code Transformations
Before: Repetitive Data Loading (Levels 0-2)
# Repeated in every notebook
df = pd.read_csv('data/telco_customer_churn.csv')
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()}")
print(f"Memory: {df.memory_usage().sum() / 1024**2:.2f} MB")
# ... 15 more lines of validation

After: Modular Function (Level 3)
from telco_analysis import load_data
df = load_data()  # All validation included


3.2 Pipeline Development
Implemented complete sklearn pipeline combining preprocessing and modeling:
Pipeline([
    ('preprocessor', ColumnTransformer([
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ])),
    ('classifier', LogisticRegression(max_iter=1000))
])

Benefits achieved:
Single object for fit/predict operations
Prevents data leakage
Easily deployable
Reproducible preprocessing
3.3 Feature Engineering
Created 6 business-driven features:
AvgChargesPerMonth: Revenue efficiency metric
IsNewCustomer: Risk segmentation (tenure ‚â§ 12 months)
IsElectronicCheck: Payment risk indicator
IsMonthToMonth: Contract risk indicator
HasInternet: Service adoption indicator
HasPhone: Service adoption indicator
Rule of thumb
Extract a function the second time you paste it.
If a notebook cell has I/O, state, or parameters, move it into src/ and call it from the notebook.
Start modularizing incrementally once you have a working baseline (L3), and ‚Äúharden‚Äù it at L9.
Common Repetitive Patterns Identified:
Data Loading: pd.read_csv() + shape reporting
Data Cleaning: TotalCharges dtype conversion + missing value handling
Validation: Schema checks + data quality assessments
Feature Engineering: Tenure grouping, service counting, value segmentation
Visualization: Distribution plots, correlation analysis, churn analysis







üî¨ Phase 1: Advanced EDA with Statistical Foundation
1.1 Data Loading with Statistical Validation
Building on Level 2: Use your optimized loading functions but add statistical validation layers.
New Level 3 Concepts to Integrate:
Statistical Data Profiling:
python
# Your Level 2 function + statistical enhancements
def advanced_data_profiling(df):
    """Enhanced profiling with statistical measures"""
    # Your existing profiling logic
    # + Statistical distribution testing
    # + Outlier detection using IQR and z-score methods
    # + Normality testing for continuous variables
    # + Independence testing preparation
Why Statistical Profiling Now:
Foundation for modeling: Understanding distributions informs algorithm choice
Assumption validation: Many models assume normal distributions or independence
Quality assurance: Statistical outlier detection is more robust than visual inspection
1.2 Hypothesis-Driven Exploration
Core Level 3 Advancement: Transition from exploratory to confirmatory analysis
Hypothesis Framework to Implement:
Business Hypothesis Formation: Based on Level 2 insights
Statistical Hypothesis Translation: Convert business questions to testable hypotheses
Test Selection & Execution: Choose appropriate statistical tests
Results Interpretation: Business-context statistical conclusions
Example Progression from Level 2:
Level 2 Insight: "Month-to-month contracts show 42.7% churn rate"
Level 3 Question: "Is this difference statistically significant vs longer contracts?"
Statistical Test: Chi-square test of independence
Business Validation: Confidence intervals for business planning
1.3 Advanced Visualization with Statistical Elements
Enhanced Visualization Strategy:
Statistical Plots to Add:
Confidence intervals on your existing distribution plots
Correlation matrices with significance testing
Box plots with statistical annotations (quartiles, outliers)
Residual plots for relationship validation
Why These Visualizations:
Confidence intervals: Quantify uncertainty for business decisions
Significance testing: Distinguish real patterns from random variation
Residual analysis: Validate assumptions for future modeling
üß™ Phase 2: Statistical Testing Integration
2.1 Test Selection Framework
Level 3 Skill Development: Learn to choose appropriate tests based on data characteristics.
Test Categories to Master:
Categorical vs Categorical:
Chi-square test of independence
Fisher's exact test (small samples)
Application: Contract type vs Churn status
Continuous vs Categorical:
Independent t-test (2 groups)
ANOVA (multiple groups)
Mann-Whitney U (non-parametric)
Application: Monthly charges across churn groups
Continuous vs Continuous:
Pearson correlation (parametric)
Spearman correlation (non-parametric)
Application: Tenure vs Total charges relationship
2.2 Function Development for Statistical Testing
Building on Level 2 Function Skills:
Create a statistical testing module that mirrors your Level 2 systematic approach:
python
def statistical_relationship_analyzer(df, var1, var2, alpha=0.05):
    """
    Automatically select and execute appropriate statistical test
    based on variable types, with business interpretation
    """
    # Variable type detection
    # Test selection logic
    # Execution with assumption checking
    # Business-context interpretation
    # Visualization integration
Why This Approach:
Systematic consistency: Maintains your Level 2 modular philosophy
Reusability: Apply across all variable combinations
Decision documentation: Clear rationale for test choices
Business integration: Statistical significance ‚Üí business significance
2.3 Multiple Testing Correction
Advanced Statistical Concept: When testing multiple hypotheses, adjust for false discovery rate.
Implementation Strategy:
Bonferroni correction for conservative approach
False Discovery Rate (FDR) for exploratory analysis
Document correction choice rationale
Business Context: Prevents over-interpretation of chance findings in comprehensive EDA.

ü§ñ Phase 3: Basic Modeling Introduction
3.1 Model Selection Framework
Level 3 Modeling Philosophy: Start simple, build systematically.
Algorithm Progression for Churn Prediction:
Baseline Models (Start Here):
Logistic Regression: Interpretable, handles mixed data types
Decision Tree: Visual decision rules, handles non-linearity
Naive Bayes: Handles categorical features well
Why These Models First:
Interpretability: Business stakeholders can understand results
Diagnostic capability: Easy to identify issues and patterns
Foundation building: Core concepts for advanced models later
3.2 Feature Engineering for Modeling
Advanced Feature Engineering Building on Level 2:
Your Level 2 Features (Customer Value, Lifecycle Stage, etc.) PLUS:
Model-Specific Enhancements:
Encoding strategies: One-hot vs ordinal vs target encoding
Scaling approaches: StandardScaler vs MinMaxScaler vs RobustScaler
Interaction features: Contract type √ó Payment method combinations
Polynomial features: Non-linear relationship capture
Decision Framework to Develop:
python
def feature_engineering_pipeline(df, target_column, encoding_strategy='auto'):
    """
    Systematic feature preparation with business logic validation
    """
    # Your Level 2 engineered features
    # + Model-appropriate transformations
    # + Validation against business logic
    # + Documentation of choices
3.3 Model Evaluation Framework
Level 3 Evaluation Strategy: Beyond accuracy to business metrics.
Evaluation Metrics Hierarchy:
Business Metrics (Primary):
Revenue impact per prediction type
Customer lifetime value considerations
Implementation cost analysis
Statistical Metrics (Secondary):
Precision/Recall for churn class
ROC-AUC for ranking capability
Confusion matrix interpretation
Model Diagnostic Metrics:
Feature importance analysis
Prediction confidence assessment
Bias detection across customer segments

üìä Phase 4: Pipeline Development Introduction
4.1 Basic Pipeline Architecture
Transition Toward Level 4-5 Skills: Introduce systematic workflow management.
Pipeline Components to Implement:
python
# Basic pipeline structure (still in notebook)
def modeling_pipeline(raw_data, target_column):
    """
    End-to-end pipeline: raw data ‚Üí predictions
    """
    # 1. Data loading & validation (your Level 2 functions)
    # 2. Statistical profiling & testing (Level 3 additions)
    # 3. Feature engineering (enhanced from Level 2)
    # 4. Model training & evaluation (new Level 3)
    # 5. Results interpretation & documentation
Why Pipeline Thinking Now:
Reproducibility: Same analysis, consistent results
Scalability preparation: Foundation for Level 4-5 automation
Quality assurance: Systematic validation at each step
Documentation: Clear audit trail of decisions
4.2 Cross-Validation Implementation
Statistical Rigor Enhancement: Validate model stability across data subsets.
Cross-Validation Strategy:
Stratified K-fold for balanced evaluation
Time-aware splitting if temporal patterns exist
Statistical significance testing of model performance differences
Business Integration: Confidence intervals on expected performance in production.
