<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
Is this infrastructure bootstrap worth turning into a project of its own?
</summary>

Yes ‚Äî **but only if you treat it as a *library*, not a second pipeline.** Turning `01_INFRA` into ‚Äúthe infra package‚Äù is worth it because it contains the exact stuff that separates a notebook experiment from a hireable data product:

* deterministic **project root resolution**
* **config loading + validation**
* **run identity + run-scoped directories**
* **logging + artifact paths**
* environment ‚Äúpreflight‚Äù checks (deps, paths, versions)
* a stable ‚Äúcontext‚Äù object you can pass around

What‚Äôs *not* worth it: keeping `01_INFRA` as its own standalone notebook that people are expected to run first.

### The hireable move

Extract the good bits of `01_INFRA` into:

* `src/dq_engine/bootstrap.py` (or `src/dq_engine/infra/‚Ä¶`)
* optionally `src/dq_engine/config/‚Ä¶`

Then make:

* `02_DQ_IF.ipynb` the one public ‚Äúrunner notebook‚Äù
* `01_INFRA.ipynb` moved to `notebooks/_archive/` as provenance

### When it‚Äôs definitely worth it

It‚Äôs worth it if you want any of these (and you do, if you‚Äôre aiming hireable):

* **CI runs** (GitHub Actions can‚Äôt ‚Äúrun a notebook preamble‚Äù reliably)
* **multi-dataset** support (Telco today, anything tomorrow)
* **warehouse targets** (DuckDB local + Snowflake prod)
* **dbt integration** (run dbt + store artifacts + score results)
* clean CLI entrypoint: `dq run --config ...`

### What ‚Äúmaking a project out of it‚Äù should mean

Not a new project. A module.

Minimum extraction list (high ROI):

1. `discover_project_root()`
2. `load_config(path) -> cfg`
3. `make_run_context(cfg) -> ctx`

   * run_id, run_ts
   * run dirs (reports/artifacts/logs)
   * config snapshot
4. `preflight(ctx)`

   * ensure dirs, warn on missing optional deps, etc.

Everything else can stay in notebooks until you‚Äôre ready.

### A simple gut-check

If you delete `01_INFRA.ipynb` tomorrow, would you lose reusable engineering value?
Yes. So keep the value ‚Äî **but move it into code** so it can be used by notebook, CLI, and CI.

That‚Äôs the sweet spot: you get the maturity signal without multiplying notebooks.


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
Help converting the bootstrap into functions
</summary>

Help me move it into code. Please take a deep dive into the file and explain how to refactor in depth.

Yep. I did a deep read of `01_INFRA.ipynb`, and the good news is: it already *contains* most of the ‚Äúhireable‚Äù infrastructure patterns ‚Äî they‚Äôre just trapped in notebook cells and a couple parts are doing too much.

Below is an in-depth refactor plan that maps your notebook sections into a clean Python package, with a minimal, high-signal interface: **`bootstrap_context()` ‚Üí `load_dataset()` ‚Üí `profile_schema()` ‚Üí `write_setup_summary()`**.

---

## What `01_INFRA.ipynb` is doing (so we refactor the right stuff)

### Section 1.1 ‚Äî Root + directories

* **1.1.1** resolves `PROJECT_ROOT` via heuristic ‚Äúfolder named Telco‚Äù (and hints at a better heuristic: ‚Äúconfig exists above‚Äù).
* **1.1.2** creates Level-root subfolders and a `DIRS` dict.
* **1.1.3** does a ‚Äúcan I write here?‚Äù smoke test.

‚úÖ Keep all of this, but **make it a reusable module** and stop hardcoding ‚ÄúTelco‚Äù.

### Section 1.2 ‚Äî Config loader + `C()` dotted getter

* Loads `project_config.yaml`
* Stores `CONFIG` + read-only `CFG`
* Adds `C("PATHS.RAW_DATA")` dotted lookup
* Writes `config_snapshot.json`

‚úÖ This is solid. In code, it becomes a `Config` object + `get()` method + optional schema validation.

### Section 1.3 ‚Äî Resolve config-driven paths + seeds + protected columns

* Resolves RAW/PROCESSED dirs with fallback keys
* Creates dirs
* Reads target/id/ranges/data_quality/flags
* Sets RNG seeds
* Defines ‚Äúprotected columns‚Äù

‚ö†Ô∏è Here‚Äôs the boundary issue: some of this is **infra** (paths, seeds), and some is **dataset semantics** (target/id/protected). We‚Äôll split those cleanly.

### Section 1.4 ‚Äî Load dataset + version registry

* Loads CSV/parquet using `READ_OPTS`
* Hashes file
* Maintains `dataset_version_registry.csv`
* Appends `dataset_load_log.csv`

‚úÖ Very hireable. In code, this becomes a `DatasetVersioning` utility.
üí° In a warehouse-first world, you‚Äôll later swap file hashing for **table snapshot metadata** ‚Äî but keep the interface.

### Section 1.5 ‚Äî Profiling + target normalization

* Checks non-empty
* Builds schema summary (type grouping, null%, n_unique)
* Normalizes target labels and creates numeric flag
* Writes target summary CSV

‚ö†Ô∏è This is where you should start trimming: **in a dbt + warehouse version, target normalization should move to dbt** (staging/marts). But the *profiling + reporting* part stays.

### Section 1.6 ‚Äî Environment capture + setup summary

* Captures system/library versions
* Writes `setup_summary.json`
* Writes `core_lib_versions.json`

‚úÖ Great ‚Äúproduction vibe‚Äù and easy to code-ify.

---

## The end state you want (hireable shape)

**Notebook becomes a thin runner**. Your code becomes reusable from:

* notebooks
* CLI (`dq run --config ...`)
* CI (GitHub Actions)

A clean, readable call pattern:

```python
from dq_engine.bootstrap import bootstrap_context
from dq_engine.dataset import load_local_dataset
from dq_engine.profiling import profile_schema
from dq_engine.reporting import write_setup_summary

ctx = bootstrap_context("config/project_config.yaml")
df, ds_meta = load_local_dataset(ctx)   # later: load from warehouse
schema_df = profile_schema(df, ctx)
write_setup_summary(ctx, ds_meta, schema_df)
```

That‚Äôs the ‚Äúrecruiter can understand it in 90 seconds‚Äù version.

---

## Refactor plan (step-by-step) with a concrete module layout

Create this structure:

```
src/dq_engine/
  __init__.py
  bootstrap.py
  config.py
  paths.py
  env.py
  dataset.py
  versioning.py
  profiling.py
  reporting.py
  types.py
```

### 1) `config.py` ‚Äî move 1.2.1‚Äì1.2.3 here

Your notebook‚Äôs YAML loader + `C()` becomes a small class.

```python
# src/dq_engine/config.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from types import MappingProxyType
from typing import Any, Optional
import json
import yaml

@dataclass(frozen=True)
class Config:
    path: Path
    data: MappingProxyType

    def get(self, dotted: str, default: Any = None) -> Any:
        node: Any = dict(self.data)
        for part in dotted.split("."):
            if isinstance(node, dict) and part in node:
                node = node[part]
            else:
                return default
        return node

def load_config(path: str | Path, snapshot_dir: Optional[Path] = None) -> Config:
    p = Path(path).resolve()
    if not p.exists():
        raise FileNotFoundError(f"Config file not found: {p}")
    obj = yaml.safe_load(p.read_text(encoding="utf-8")) or {}
    cfg = Config(path=p, data=MappingProxyType(obj))
    if snapshot_dir:
        snapshot_dir.mkdir(parents=True, exist_ok=True)
        (snapshot_dir / "config_snapshot.json").write_text(
            json.dumps(obj, indent=2), encoding="utf-8"
        )
    return cfg
```

**Why this matters:** you eliminate global `CONFIG/CFG/C()` and make config explicit and testable.

---

### 2) `paths.py` ‚Äî move 1.1.1‚Äì1.1.3 and 1.3.1 (path resolution) here

#### Replace the ‚ÄúTelco folder name‚Äù heuristic

Make `discover_project_root()` support:

* environment variable override
* ‚Äúcontains config file‚Äù heuristic
* fallback to ‚Äúgit root‚Äù if available

```python
# src/dq_engine/paths.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
import os
import subprocess

@dataclass(frozen=True)
class ProjectPaths:
    project_root: Path
    level_root: Path
    config_dir: Path
    reports_dir: Path
    artifacts_dir: Path
    figures_dir: Path
    models_dir: Path
    outputs_dir: Path
    raw_data_dir: Path
    processed_dir: Path

def _git_root(cwd: Path) -> Path | None:
    try:
        out = subprocess.check_output(["git", "rev-parse", "--show-toplevel"], cwd=cwd)
        return Path(out.decode().strip()).resolve()
    except Exception:
        return None

def discover_project_root(start: Path, config_rel: str = "config/project_config.yaml") -> Path:
    env_root = os.getenv("DQ_PROJECT_ROOT")
    if env_root:
        return Path(env_root).expanduser().resolve()

    start = start.resolve()
    for parent in [start] + list(start.parents):
        if (parent / config_rel).exists():
            return parent
    gr = _git_root(start)
    if gr:
        return gr
    raise FileNotFoundError(f"Could not discover project root from {start}")

def build_paths(project_root: Path, level_name: str, cfg) -> ProjectPaths:
    level_root = project_root / level_name
    config_dir = level_root / "config"

    # config-driven fallbacks (your 1.3.1 logic)
    raw_rel = cfg.get("PATHS.RAW_DATA") or cfg.get("PATHS.RAW_DATA_DIR")
    proc_rel = cfg.get("PATHS.PROCESSED") or cfg.get("PATHS.PROCESSED_DIR")
    if not raw_rel:
        raise KeyError("Missing PATHS.RAW_DATA or PATHS.RAW_DATA_DIR")
    if not proc_rel:
        raise KeyError("Missing PATHS.PROCESSED or PATHS.PROCESSED_DIR")

    return ProjectPaths(
        project_root=project_root,
        level_root=level_root,
        config_dir=config_dir,
        reports_dir=project_root / cfg.get("PATHS.REPORTS"),
        artifacts_dir=project_root / cfg.get("PATHS.ARTIFACTS"),
        figures_dir=project_root / cfg.get("PATHS.FIGURES"),
        models_dir=project_root / cfg.get("PATHS.MODELS"),
        outputs_dir=project_root / cfg.get("PATHS.OUTPUTS"),
        raw_data_dir=project_root / raw_rel,
        processed_dir=project_root / proc_rel,
    )

def ensure_dirs(*paths: Path) -> None:
    for p in paths:
        p.mkdir(parents=True, exist_ok=True)

def assert_writable(*paths: Path) -> None:
    for p in paths:
        test = p / ".write_test.tmp"
        test.write_text("ok", encoding="utf-8")
        test.unlink(missing_ok=True)
```

This preserves your current behavior but makes it portable and ‚Äúagnostic.‚Äù

---

### 3) `env.py` ‚Äî move 1.6.1 and core versions snapshot here

```python
# src/dq_engine/env.py
from __future__ import annotations
from datetime import datetime
import platform, sys
import pandas as pd
import numpy as np

def capture_env_info() -> dict:
    return {
        "timestamp_utc": datetime.utcnow().isoformat(timespec="seconds") + "Z",
        "system": {
            "platform": platform.system(),
            "platform_release": platform.release(),
            "platform_version": platform.version(),
            "machine": platform.machine(),
            "python_version": sys.version,
        },
        "libraries": {
            "pandas": pd.__version__,
            "numpy": np.__version__,
        },
    }
```

Later you can optionally include seaborn/matplotlib if installed.

---

### 4) `versioning.py` ‚Äî move 1.4.2 + 1.4.3 here (hash registry + load log)

This is one of your best ‚Äúhireable signals‚Äù because it shows **reproducibility**.

```python
# src/dq_engine/versioning.py
from __future__ import annotations
from dataclasses import dataclass
from datetime import datetime
from pathlib import Path
import hashlib, os
import pandas as pd

def compute_file_hash(path: Path, algo: str = "sha256", chunk_size: int = 1 << 20) -> str:
    h = hashlib.new(algo)
    with path.open("rb") as f:
        while chunk := f.read(chunk_size):
            h.update(chunk)
    return h.hexdigest()

@dataclass(frozen=True)
class DatasetMeta:
    dataset_path: str
    file_hash: str
    version_id: str
    n_rows: int
    n_cols: int
    timestamp_utc: str

def update_version_registry(registry_path: Path, dataset_path: Path, file_hash: str, n_rows: int, n_cols: int) -> str:
    now_utc = datetime.utcnow().isoformat(timespec="seconds") + "Z"
    if registry_path.exists():
        reg = pd.read_csv(registry_path)
    else:
        reg = pd.DataFrame(columns=["version_id","dataset_path","file_hash","first_seen_utc","last_seen_utc","n_rows","n_cols"])

    existing = reg.loc[reg["file_hash"] == file_hash]
    if not existing.empty:
        version_id = str(existing.iloc[0]["version_id"])
        reg.loc[reg["file_hash"] == file_hash, "last_seen_utc"] = now_utc
    else:
        next_id = 1 if reg.empty else int(pd.to_numeric(reg["version_id"], errors="coerce").fillna(0).max()) + 1
        version_id = str(next_id)
        reg = pd.concat([reg, pd.DataFrame([{
            "version_id": version_id,
            "dataset_path": str(dataset_path),
            "file_hash": file_hash,
            "first_seen_utc": now_utc,
            "last_seen_utc": now_utc,
            "n_rows": n_rows,
            "n_cols": n_cols,
        }])], ignore_index=True)

    tmp = registry_path.with_suffix(".tmp")
    reg.to_csv(tmp, index=False)
    os.replace(tmp, registry_path)
    return version_id

def append_load_log(load_log_path: Path, meta: DatasetMeta, read_opts: dict) -> None:
    row = pd.DataFrame([{
        "timestamp_utc": meta.timestamp_utc,
        "dataset_path": meta.dataset_path,
        "version_id": meta.version_id,
        "file_hash": meta.file_hash,
        "n_rows": meta.n_rows,
        "n_cols": meta.n_cols,
        "read_opts": str(read_opts),
    }])
    if load_log_path.exists():
        existing = pd.read_csv(load_log_path)
        all_cols = pd.Index(existing.columns).union(row.columns)
        out = pd.concat([existing.reindex(columns=all_cols), row.reindex(columns=all_cols)], ignore_index=True)
    else:
        out = row
    tmp = load_log_path.with_suffix(".tmp")
    out.to_csv(tmp, index=False)
    os.replace(tmp, load_log_path)
```

---

### 5) `dataset.py` ‚Äî move 1.4.1 here (local load)

This becomes replaceable later with a warehouse loader.

```python
# src/dq_engine/dataset.py
from __future__ import annotations
from pathlib import Path
import pandas as pd

def load_local_dataset(raw_path: Path, read_opts: dict | None = None) -> pd.DataFrame:
    read_opts = read_opts or {}
    suffix = raw_path.suffix.lower()
    if suffix == ".csv":
        return pd.read_csv(raw_path, **read_opts)
    if suffix in {".parquet", ".pq"}:
        return pd.read_parquet(raw_path)
    raise ValueError(f"Unsupported RAW_DATA format: {suffix}")
```

---

### 6) `profiling.py` ‚Äî move the schema summary from 1.5.2 here

Your schema summary is a perfect reusable ‚Äúprofiling‚Äù function.

```python
# src/dq_engine/profiling.py
from __future__ import annotations
from datetime import datetime
import pandas as pd

def profile_schema(df: pd.DataFrame, section_id: str = "1.5.2") -> pd.DataFrame:
    n_rows, n_cols = df.shape
    run_ts = datetime.utcnow().isoformat(timespec="seconds") + "Z"

    def classify(dtype_str: str) -> str:
        s = dtype_str.lower()
        if "int" in s or "float" in s: return "numeric"
        if "bool" in s: return "boolean"
        if "datetime" in s or "date" in s: return "datetime"
        if "category" in s: return "categorical"
        return "string_like"

    rows = []
    for col in df.columns:
        s = df[col]
        dtype_str = str(s.dtype)
        non_null = int(s.notna().sum())
        nulls = int(s.isna().sum())
        n_unique = int(s.nunique(dropna=True))
        null_pct = round(nulls / n_rows * 100, 3) if n_rows else 0.0
        rows.append({
            "column": col,
            "dtype": dtype_str,
            "type_group": classify(dtype_str),
            "non_null": non_null,
            "nulls": nulls,
            "null_pct": null_pct,
            "n_unique": n_unique,
            "run_ts": run_ts,
            "n_rows": n_rows,
            "n_cols": n_cols,
            "section_id": section_id,
        })
    return pd.DataFrame(rows)
```

---

### 7) `reporting.py` ‚Äî move 1.6.2 setup summary here

This is the artifact that future ‚Äúagnostic engine‚Äù runs will always emit.

```python
# src/dq_engine/reporting.py
from __future__ import annotations
from pathlib import Path
from dataclasses import asdict
import json

def write_setup_summary(env_dir: Path, payload: dict) -> Path:
    env_dir.mkdir(parents=True, exist_ok=True)
    out = env_dir / "setup_summary.json"
    out.write_text(json.dumps(payload, indent=2), encoding="utf-8")
    return out
```

---

## The glue: `bootstrap.py` ‚Äî the single entrypoint that replaces your Section 1 infra cells

```python
# src/dq_engine/bootstrap.py
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
from dq_engine.config import load_config, Config
from dq_engine.paths import discover_project_root, build_paths, ensure_dirs, assert_writable
from dq_engine.env import capture_env_info

@dataclass(frozen=True)
class RunContext:
    cfg: Config
    paths: object  # ProjectPaths
    env: dict

def bootstrap_context(config_path: str | Path, *, level_name: str = "Level_3") -> RunContext:
    config_path = Path(config_path)
    project_root = discover_project_root(Path.cwd())
    level_root = project_root / level_name
    cfg = load_config(config_path, snapshot_dir=level_root / "config")

    paths = build_paths(project_root, level_name, cfg)
    ensure_dirs(
        paths.config_dir, paths.reports_dir, paths.artifacts_dir, paths.figures_dir,
        paths.models_dir, paths.outputs_dir, paths.raw_data_dir, paths.processed_dir
    )
    assert_writable(paths.reports_dir, paths.artifacts_dir)

    env = capture_env_info()
    return RunContext(cfg=cfg, paths=paths, env=env)
```

Now your notebook doesn‚Äôt need to define `PROJECT_ROOT`, `DIRS`, `CONFIG`, `C()`, etc. It calls one function.

---

## What to do with target normalization (1.5.3)

This is important:

* In **warehouse/dbt version**, target normalization belongs in dbt (`stg_*` or `mrt_*`), not in infra.
* In local CSV mode, you can keep a helper function, but I‚Äôd treat it as a **dataset pack** (Telco pack), not core infra.

So: move 1.5.3 into `src/dq_engine/packs/telco/standardize.py` later.

That‚Äôs how you become ‚Äúagnostic‚Äù without deleting Telco usefulness.

---

## Refactor sequence (no drama, no rewrite spiral)

1. **Create the module files** above (empty stubs first).
2. Copy code from notebook sections into the right files *as-is*.
3. Make `02_DQ_IF.ipynb` import and use:

   * `bootstrap_context()`
   * `load_local_dataset()`
   * `update_version_registry()` / `append_load_log()`
   * `profile_schema()`
   * `write_setup_summary()`
4. Delete/disable the old inline infra cells once outputs match.

---

## A tiny ‚Äúsuccess criteria‚Äù checklist

You‚Äôll know you did it right when:

* the notebook is < ~30% the size it is now
* there is exactly **one** bootstrap call
* artifacts still show up in the same places
* you can run the same bootstrap from a CLI and from CI
* adding Snowflake/dbt later does **not** require changing the checks ‚Äî only changing `load_dataset()` and result writers

---

If you want, next we can do the same deep extraction for `02_DQ_IF.ipynb` but in the *right order*: first make checks accept `(ctx, table_or_df)` and write results to a standardized `DQ_RESULTS` model ‚Äî then you‚Äôre warehouse-ready without pain.


In [33]:
# # AGGREGATE_PYTHON_CELLS_SCRIPT_NO_NOTES
# """
# Aggregate all Python code cells from a Jupyter notebook into a single code cell
# in a new notebook, skipping this script cell.

# No headers or notes are injected into the aggregated code.
# """

# import nbformat
# from pathlib import Path
# from datetime import datetime

# # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# # 1. CONFIG: paths
# # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# SOURCE_NOTEBOOK = Path("/Users/b/DATA/PROJECTS/Telco/Level_3/notebooks/01_EDA.ipynb")

# # Destination: one level up from "notebooks/" ‚Üí "reports/"
# REPORTS_DIR = SOURCE_NOTEBOOK.parent.parent / "reports"
# REPORTS_DIR.mkdir(parents=True, exist_ok=True)

# # Marker used to identify this script cell so we can skip it
# SCRIPT_MARKER = "AGGREGATE_PYTHON_CELLS_SCRIPT_NO_NOTES"

# # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# # 2. Load source notebook
# # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# nb = nbformat.read(SOURCE_NOTEBOOK, as_version=4)

# code_snippets = []
# total_cells = len(nb.cells)
# skipped_cells = 0

# for cell in nb.cells:
#     if cell.cell_type != "code":
#         continue

#     src = cell.source if isinstance(cell.source, str) else "".join(cell.source)

#     # Skip the aggregator script cell (any cell containing the marker)
#     if SCRIPT_MARKER in src:
#         skipped_cells += 1
#         continue

#     if src.strip():
#         code_snippets.append(src.rstrip())

# # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# # 3. Build aggregated source (pure code)
# # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# # Join all code cells with two newlines between them ‚Äî no extra comments.
# agg_source = "\n\n".join(code_snippets) + "\n"

# # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# # 4. Create new notebook with single code cell
# # ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

# timestamp = datetime.utcnow().strftime("%Y%m%d_%H%M%S")
# output_name = f"AGG_ALL_{timestamp}.ipynb"
# OUTPUT_PATH = REPORTS_DIR / output_name

# new_nb = nbformat.v4.new_notebook(
#     metadata={
#         "kernelspec": nb.metadata.get("kernelspec", {}),
#         "language_info": nb.metadata.get("language_info", {}),
#         "aggregated_from": str(SOURCE_NOTEBOOK),
#         "aggregated_at_utc": timestamp,
#     }
# )

# # Single big code cell (no notes)
# new_nb.cells.append(nbformat.v4.new_code_cell(agg_source))

# nbformat.write(new_nb, OUTPUT_PATH)

# print("‚úÖ Aggregation complete (no notes inserted into cells)")
# print(f"   Source notebook:      {SOURCE_NOTEBOOK}")
# print(f"   Total cells scanned:  {total_cells}")
# print(f"   Output notebook:      {OUTPUT_PATH}")
# print(f"   Script cells skipped: {skipped_cells}")


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #e78229ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
TODO
</summary>


# TODO:
# 1.2.1 or 1.2.2?
# print(f"\nüìÅ Project root:         {PROJECT_ROOT}")
# print(f"üìÑ Raw dataset path:     {RAW_DATA}")
# print(f"üìÇ Processed data dir:   {DATA_PROCESSED_DIR}")
# print("‚úÖ Project directories verified/created.")

# # Verify dataset presence and load a peek
# if not TELCO_RAW.exists():
#     raise FileNotFoundError(f"‚ùå Missing dataset. Place it here:\n   {TELCO_RAW}")

# # (Optional) You could write this to directory_structure_report.txt if desired
######

# TODO: CHANGE PANDAS DTYPES FOR DOWNSTREAM MODELING
# TODO: CONVERT ALL PANDAS DTYPES. MAKE SURE TO DISPLAY DTYPES BEFORE ANY CONVERSIONS. MAKE SNAPSHOT BEFORE AND AFTER CHANGES.
# TODO: CONVERT ALL PANDAS DTYPES TO APPROPRIATE TYPES
# TODO: MAKE SURE df.info() is called before any conversions.
# TODO: (1.7) Would you like me to also provide the **mini implementation skeleton** for 2.0 (the Python stubs that create directories, initialize logs, and validate configs) in the same style as your 2.3/2.8 code blueprints?
# TODO: Would you like me to also provide the **mini implementation skeleton** for 2.0 (the Python stubs that create directories, initialize logs, and validate configs) in the same style as your 2.3/2.8 code blueprints


---
---

That `# TODO: proper desc? Load registry` is basically asking to describe this whole ‚Äúload-or-init version registry‚Äù behavior.

Here‚Äôs a good, professional wording + a slightly tidied block you can drop in.

### Suggested comment for that TODO

Replace:

```python
# TODO: proper desc? Load registry
if registry_path.exists():
```

with something like:

```python
# Load existing dataset version registry if present; otherwise initialize an empty one
if registry_path.exists():
```

If you want to be even more explicit about semantics:

```python
# Load dataset version registry (file_hash ‚Üí version_id); initialize empty registry if not present
if registry_path.exists():
```

### Slightly polished version of your whole block

Here‚Äôs your code with that comment dropped in and minor tighten-ups:

```python
# 1.5.2 Hash / Snapshot Validation
print("\n1.5.2) üîê Hash / snapshot validation")

def compute_hash(path, algo: str = "sha256", chunk_size: int = 1 << 20) -> str:
    """Streaming file hash to avoid loading whole file into memory."""
    h = hashlib.new(algo)
    with path.open("rb") as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            h.update(chunk)
    return h.hexdigest()

file_hash = compute_hash(RAW_DATA, "sha256")
now_utc = datetime.utcnow().isoformat(timespec="seconds") + "Z"

# Ensure artifacts directory exists so atomic writes (tmp + replace) don't fail
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

REGISTRY_DIR = ARTIFACTS_DIR / "registry"
REGISTRY_DIR.mkdir(parents=True, exist_ok=True)

registry_path = REGISTRY_DIR / "dataset_version_registry.csv"

# Load dataset version registry (file_hash ‚Üí version_id); initialize empty registry if not present
if registry_path.exists():
    registry = pd.read_csv(registry_path)
else:
    registry = pd.DataFrame(
        columns=[
            "version_id",
            "dataset_path",
            "file_hash",
            "first_seen_utc",
            "last_seen_utc",
            "n_rows",
            "n_cols",
        ]
    )

existing = registry.loc[registry["file_hash"] == file_hash]

if not existing.empty:
    version_id = str(existing.iloc[0]["version_id"])
    registry.loc[registry["file_hash"] == file_hash, "last_seen_utc"] = now_utc
    print(f"‚ÑπÔ∏è Existing dataset version found ‚Üí version_id={version_id}")
else:
    if registry.empty:
        next_id = 1
    else:
        max_existing = pd.to_numeric(registry["version_id"], errors="coerce").fillna(0).max()
        next_id = int(max_existing) + 1

    version_id = str(next_id)
    new_row = pd.DataFrame(
        {
            "version_id": [version_id],
            "dataset_path": [str(RAW_DATA)],
            "file_hash": [file_hash],
            "first_seen_utc": [now_utc],
            "last_seen_utc": [now_utc],
            "n_rows": [n_rows],
            "n_cols": [n_cols],
        }
    )
    registry = pd.concat([registry, new_row], ignore_index=True)
    print(f"üÜï New dataset version registered ‚Üí version_id={version_id}")

tmp_reg = registry_path.with_suffix(".tmp")
registry.to_csv(tmp_reg, index=False)
os.replace(tmp_reg, registry_path)
print(f"‚úÖ Version registry updated ‚Üí {registry_path}")
```

One tiny note: `load_log_path` is now unused in this snippet; you can either:

* Remove it from this section, or
* Move its path + mkdir into the 1.5.3 ‚Äúload log‚Äù block where you actually use it.

But for the TODO itself, ‚ÄúLoad dataset version registry (file_hash ‚Üí version_id); initialize empty registry if not present‚Äù is a very recruiter-friendly description of what‚Äôs going on.
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:7px solid #be4c2dff;
    padding:10px 12px;border-radius:10px;font-weight:700;">
NOTEBOOK SETUP - 11/15/2025
</summary>
Great question to ask *right now* instead of after Section 2 turns into a hydra üòÑ

Given your **Type 3 ‚Üí Type 2 config pattern** and **Section 2 = Data Quality framework**, I‚Äôd organize notebooks around *pipeline stages*, not around ‚Äúwhatever I was playing with today‚Äù.

Below is a concrete layout that fits exactly what you‚Äôve built so far.

---

## 0. Principles (for this Telco Level_3 setup)

With your current design:

* **Type 3 config** = `project_config.yaml` (source of truth).
* **Type 2 config** = `CONFIG` dict + `C()` helper (in memory).
* **Section 1** = environment + paths + config load + raw ingest.
* **Section 2** = data quality & integrity (your big dependency chain).
* **Section 3+** = feature engineering & modeling.

So the notebooks should:

1. **Never duplicate config.** Only one place loads YAML ‚Üí CONFIG; everything else just uses `C()` or imports.
2. **Treat Section 2 as its own pipeline block.** One notebook (or two) = ‚ÄúDQ job‚Äù.
3. **Keep notebooks thin over time.** Heavy logic migrates into `/telco_churn` modules; notebooks mostly orchestrate + display.

---

## 1. Recommended notebook layout

### 01_SECTION1_env_config_raw_load.ipynb

**Scope:** everything up through the end of Section 1 (what you‚Äôve already written).

**Content:**

* 1.1 Imports & visualization defaults
* 1.2 Project root & path setup
* 1.3 Load YAML ‚Üí `CONFIG` + `C()` (Type 3 ‚Üí Type 2)
* 1.4 Resolve config-driven paths, seeds, protected columns
* 1.5 Raw dataset load + version registry + load log
* 1.6 Basic structural validation, initial schema summary
* 1.7 Environment summary & setup log

**Key outcome:**
By the end, you‚Äôve got:

* `df`, `df_raw`
* `CONFIG`, `C()`
* `PROJECT_ROOT`, `REPORTS_DIR`, `ARTIFACTS_DIR`, etc.
* `protected_columns`, `id_cols`, `target_name`, `raw_target`
* artifacts like `dataset_version_registry.csv`, `dataset_load_log.csv`, etc.

**Other notebooks *assume* this one ran successfully.**

---

### 02_SECTION2_data_quality_framework.ipynb

This is your **entire Section 2 dependency chain** in one place. All the stuff you‚Äôve been carefully designing goes here.

**2.0 ‚Äì Preflight & scaffold**

* 2.0.1 Environment preflight & Section 2 report bootstrap
* 2.0.2 Config & constants validation for Section 2
* 2.0.3 Logging & run metadata snapshot
* 2.0.4 Dataset snapshot & preview (reusing `df`)
* 2.0.5 Lightweight baseline summary
* 2.0.6 ID & protected columns snapshot
* 2.0.7 Dependency registry build (`section2_registry.json`)
* 2.0.8 Execution map printout / markdown summary

**2.1 ‚Äì Base schema & consistency**

* Schema expectation checks vs CONFIG
* Column presence, dtype expectations, drift checks
* Missingness baseline (2.1.8)
* All outputs to `section2` reports/artifacts

**2.2 ‚Äì Column type discovery (everything you just designed)**

* 2.2.1 Auto-detect data types ‚Üí `type_detection_summary.csv`, `column_type_map.json`
* 2.2.2 Coercion attempt & logging ‚Üí `coercion_log.csv` (behind a flag)
* 2.2.3 Binary field detection ‚Üí `binary_field_report.csv`
* 2.2.4 SeniorCitizen retype (config-driven, optional)
* 2.2.5 Churn flag validation (type-level)
* 2.2.6 ID & protected columns registration ‚Üí `protected_columns.json`
* 2.2.7 Feature group classification ‚Üí `feature_group_registry.csv`
* 2.2.8 Type distribution visualization

**2.3+** (later):

* Numeric integrity & outliers
* Categorical domain / tokens
* Logic checks
* Apply phase QA

All DQ artifacts live under:

* `Level_3/resources/reports/section2/...`
* `Level_3/resources/artifacts/...`

**Important:**
This notebook **does not load data or config**. It just:

```python
from telco_churn.config.loader import C, CONFIG
from telco_churn.io.datasets import load_raw_telco  # later
# or simply assumes Section 1 ran in the same kernel
```

For now in your Level_3 notebook world, it‚Äôs fine that this runs in the same kernel as Section 1; later, you‚Äôll refactor into scripts.

---

### 03_SECTION3_feature_engineering_and_baseline_model.ipynb

**Scope:** once Section 2 says ‚Äúdata is clean & typed,‚Äù this is where you:

* pull in only the **model-ready features** using `feature_group_registry.csv`
* perform feature engineering
* build baseline models
* track model metrics

It will **consume** artifacts from Section 2:

* `feature_group_registry.csv`
* `protected_columns.json`
* maybe ‚Äúnumeric-only‚Äù list for scaling, etc.

This is more for later, but the **important part** for organization right now is:
**don‚Äôt mix Section 3 modeling with Section 2 DQ in the same notebook.**

---

### 99_sandbox.ipynb (optional)

Your trash lab / scratchpad:

* weird experiments
* one-off plots
* queries you might later formalize in Section 1/2/3

This keeps your main notebooks clean and portfolio-friendly.

---

## 2. Where Type 3 ‚Üí Type 2 actually lives

To keep the story clean for recruiters:

* **Type 3 (YAML) lives in `/config/project_config.yaml`**
* **Type 2 (CONFIG dict + C()) is created once in Section 1**
  ‚Üê *and eventually moved into a module* like:

```python
# telco_churn/config/loader.py
from pathlib import Path
import yaml
from types import MappingProxyType

CONFIG_PATH = Path(__file__).resolve().parents[2] / "config" / "project_config.yaml"

with CONFIG_PATH.open("r", encoding="utf-8") as f:
    CONFIG = yaml.safe_load(f) or {}

CFG = MappingProxyType(CONFIG)

def C(path: str, default=None):
    ...
```

Then your notebooks do:

```python
from telco_churn.config.loader import CONFIG, C
```

So:

* **Section 1 notebook** = demonstration of using that loader + building env.
* **Section 2 notebook** = consumer of `C()` + env, not re-declaring config.

---

## 3. TL;DR notebook organization

For your current design:

* `notebooks/01_SECTION1_env_config_and_raw_load.ipynb`
* `notebooks/02_SECTION2_data_quality_framework.ipynb`
* `notebooks/03_SECTION3_feature_engineering_and_modeling.ipynb`
* `notebooks/99_sandbox.ipynb` (optional)

And every time you feel the urge to copy a big chunk of logic between notebooks, that‚Äôs your cue:

> ‚ÄúThis belongs in `/telco_churn/...py`, not duplicated in two notebooks.‚Äù

That‚Äôs how your **Type 3 ‚Üí Type 2 config chain + Section 2 as Data Quality** turns from ‚Äúcool notebook‚Äù into ‚Äúthis person thinks like a pipeline engineer.‚Äù

</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:1px solid #e5e7eb;
    padding:10px 12px;border-radius:10px;font-weight:700;">
Which notebook has the stats?
</summary>

Short answer: **the stats live mostly in 2.1.8, 2.2.x, 2.3, and 2.4.**
Everything before that is mostly *plumbing / contracts / config*.

Let me map it to your chain.

---

### Mostly *not* statistics (infra / contracts)

These are engineering-focused:

* **2.0.x ‚Äì Preflight & scaffolding**

  * Env checks, paths, manifests, metadata, logging.
* **2.1.1‚Äì2.1.7 ‚Äì Base schema & consistency**

  * Column presence, dtype expectations, config roots.
  * This is more ‚Äúdata contracts‚Äù than ‚Äústatistics‚Äù.

They may compute some counts, but the *purpose* is structural, not statistical insight.

---

### Where the **statistics** start showing up

Think: counts, proportions, distributions, outliers, frequency patterns, etc.

#### 1. 2.1.8 ‚Äì Missingness Baseline

This is your **first clearly statistical block**:

* `% missing per column`
* maybe top columns by null %, etc.

That‚Äôs descriptive stats about completeness.

#### 2. 2.2.x ‚Äì Column Type Discovery (light stats baked in)

Even though 2.2 is ‚Äútype discovery‚Äù, it uses statistical ideas:

* **2.2.1 Auto-Detect Types**

  * `n_unique`, `unique_ratio`
  * `pct_numeric_like`, `pct_boolean_like`, `pct_datetime_like`
* **2.2.2 Coercion Attempt & Logging**

  * success/fail rates of coercion (implicit stats on parseability)
* **2.2.3 Binary Field Detection**

  * columns with exactly 2 unique values, value counts, proportions

This is still mostly *schema intelligence*, but it‚Äôs driven by simple descriptive stats.

#### 3. 2.3 ‚Äì Numeric Integrity & Outliers

This is your **first ‚Äúreal‚Äù stats-heavy section**:

Typical contents:

* Summary stats: mean, median, std, min, max, quantiles
* Outlier rules: IQR, Z-score, Winsorization flags, etc.
* Distribution checks: skew, kurtosis (if you include them)

This is where you can brag ‚Äúwe do statistical verification of numeric fields.‚Äù

#### 4. 2.4 ‚Äì Categorical Integrity & Domain Diagnostics

This is statistical too, but on categories:

* Frequency tables (`count`, `% share`)
* Rare-category identification (`<1%`, `<0.5%`, etc.)
* Domain frequency audits, dominant vs fragmented labels
* Potentially association measures later (Cram√©r‚Äôs V, Theil‚Äôs U, etc.)

All of that is **discrete / categorical statistics**.

---

### Mostly logical / rule-based (but standing on stats)

* **2.5 Logic Checks**

  * Business rules like `TotalCharges ‚âà MonthlyCharges * tenure`,
    `tenure=0 ‚áí TotalCharges=0`, etc.
    These *may* use thresholds or tolerances that come from stats (e.g. ‚Äúallow ¬±3œÉ‚Äù), but they‚Äôre conceptually logic/contract-based.

* **2.6 Apply Phase QA**

  * Comparing ‚Äúbefore vs after‚Äù distributions, counts of rows changed, etc.
    That‚Äôs partly stats (distribution comparison) and partly pipeline QA.

---

### TL;DR: if a recruiter asks ‚ÄúWhere‚Äôs the statistics part?‚Äù

You can honestly say:

* **Descriptive completeness stats**: 2.1.8
* **Type-discovery stats (uniques, proportions, binary detection)**: 2.2.x
* **Numeric distribution & outlier statistics**: 2.3
* **Categorical frequency & rare-category statistics**: 2.4
* **Optional multivariate / drift tests**: later 2.x if you add them (e.g., Cram√©r‚Äôs V, PSI, KS tests)

Everything else in Section 2 is the **infrastructure that makes those statistics reproducible, traceable, and production-ready**.

</details>

---
<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border: 1px solid #e5e7eb;
    padding:10px 12px;border-radius:10px;font-weight:900;">
üìä Report Summary
</summary>

<h1>üìä IBM Telco Customer Churn ‚Äî Exploratory Data Analysis (EDA)</h1>

**Author:** Brandon Hardison  
**Role:** Analytics Engineering Student  
**Notebook:** `01_EDA.ipynb`  
**Version:** v1.0  
**Date Completed:** 2025-10-31  

---
<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border: 1px solid #e5e7eb;
    padding:10px 12px;border-radius:10px;font-weight:900;">
Purpose üéØ
</summary>
<div style="margin: 20px; padding: 10px; background-color: #f8f9fa; border-radius: 10px;">

This notebook performs an in-depth **Exploratory Data Analysis (EDA)** on the IBM Telco Customer Churn dataset.  
It focuses on data quality diagnostics, missing value handling, type coercion, categorical normalization,  
and target preparation for downstream modeling and feature engineering.

Its objectives are to:
- Assess overall **data quality**, including missing values, type consistency, and categorical normalization.  
- **Understand the dataset‚Äôs structure and feature distributions** through descriptive statistics and visualization.  
- **Identify statistically significant predictors of customer churn** for downstream modeling.  

The analysis is designed for both **business stakeholders** seeking actionable insights  
and the **data science team** responsible for model development and feature engineering.


---

### üìÅ Dataset Summary
- **Source:** IBM Telco Customer Churn (public Kaggle / IBM sample dataset)  
- **Rows:** ~7,000 customer records  
- **Columns:** 21 features  
- **Target:** `Churn` (Yes/No) ‚Üí numeric flag `Churn_flag`  

---

### üß† Report Scope
This notebook covers:
1. **Data Quality & Cleaning (Section 2)**  
   - Missing value scan  
   - Numeric validation & coercion  
   - Categorical cleaning  
   - Cross-field & business-rule consistency  
2. **Preliminary Target & Demographic Diagnostics (Section 2.12)**  
3. **Preparation for Modeling & Feature Engineering (next notebook)**

---

> _This report is designed for internal validation and reproducibility.  
> All outputs are atomic (timestamped) and feed directly into Level_3 reports and resources._
</div>
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border: 1px solid #e5e7eb;
    padding:10px 12px;border-radius:10px;font-weight:900;">
Deliverables
</summary>
<div style="margin: 20px; padding: 10px; background-color: #f8f9fa; border-radius: 10px;">


‚úÖ **Deliverables from EDA Notebook**

| Output Type       | Example File                  | Used In                 |
| ----------------- | ----------------------------- | ----------------------- |
| Clean EDA dataset | `telco_eda.parquet`           | Statistics & Modeling   |
| EDA report        | `eda_summary.csv`             | Insights notebook       |
| Visuals           | `figures/*.png`               | Insights presentation   |
| Notes             | Inline markdown or `.md` file | Documentation & handoff |

‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
‚û°Ô∏è 3.0  Descriptive Statistics & EDA
</div>
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border: 1px solid #e5e7eb;
    padding:10px 12px;border-radius:10px;font-weight:900;">
Outlines
</summary>
<div style="margin: 20px; padding: 10px; background-color: #f8f9fa; border-radius: 10px;">

</div>
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border: 1px solid #e5e7eb;
    padding:10px 12px;border-radius:10px;font-weight:900;">
Pros & Cons of full-inline pipeline
</summary>

Short version: doing it all inline is **awesome as a learning/spec pass**, and kind of terrible as a long-term implementation. üòÑ

I‚Äôll break it down.

---

## Benefits of a fully inline, no-`def` pipeline

### 1. Maximum transparency & debuggability

* You can **see every step** in the exact order it runs. No hunting for helper functions across files.
* When something blows up, the traceback points straight at **the exact cell / section** instead of some deep helper.
* It‚Äôs easier to answer: *‚ÄúWhat is this pipeline actually doing to the data?‚Äù* because the answer is on one page, in order, with concrete operations.

### 2. Perfect ‚Äúexecutable spec‚Äù

* This style gives you a **single, linear, executable specification** of Section 2.x / 3.x:

  * What directories must exist
  * What artifacts are produced and where
  * What each check actually checks
* For future you (or reviewers / interviewers), this file is a **golden reference**: ‚ÄúThis is the canonical behavior; all refactors must preserve this.‚Äù

### 3. Great for learning & design

* You‚Äôre forcing yourself to think about **data flow, state, and dependencies**:

  * ‚ÄúThis needs `df` and `sec2_reports_dir` in globals.‚Äù
  * ‚ÄúThis script assumes `CONFIG` and `C()` exist, but can still run without them.‚Äù
* No abstractions to hide behind = you really internalize:

  * How checks depend on each other
  * How config gets resolved
  * How Section 2/3 artifacts hang together

For your analytics-engineering brain, this is basically a **design doc + implementation in one**.

### 4. Easier to make dataset-agnostic *specs*

* Doing it inline makes it obvious which pieces are:

  * **Truly generic** (e.g., ‚Äúwrite CSV with tmp + `os.replace`‚Äù)
  * **Schema-aware but dataset-agnostic** (e.g., ‚ÄúONEHOT.GROUPS‚Äù, ‚ÄúTOTALS.RULES‚Äù from CONFIG)
  * **Telco-specific** (e.g., `tenure √ó MonthlyCharges` helpers)
* That clarity helps when you later split into:

  * `core/generic_dq.py`
  * `projects/telco/logic_rules.yaml`

### 5. Frictionless tweaks during exploration

* Want to change a tolerance or add a column? You edit that one block and rerun. No worrying about function signatures, imports or modules being stale.
* For ‚ÄúLevel_3‚Äù work, **speed of iteration** matters more than elegance.

### 6. Great for portfolio / code review

* A reviewer can:

  * Scroll once and **see the entire logic stack**: bootstrap ‚Üí numeric checks ‚Üí logic ‚Üí anomaly context.
  * Understand your data-quality philosophy without jumping across files.
* Then you can say, ‚ÄúThis notebook is the **spec level**. In Level_4+ I factor it into a package.‚Äù

---

## Cons of staying fully inline long term

### 1. Zero reuse ‚Üí duplication everywhere

* You already see this: path resolution, `SECTION2_REPORT_PATH` merging, temp-file writes, `if "df" not in globals()` checks‚Ä¶
* Inline only means:

  * If you want the same checks for EDGAR, Telco, and some 3rd dataset, you‚Äôll copy / paste and lightly edit.
  * Bug fixes must be applied in multiple places, which is how pipelines drift out of sync and become untrustworthy.

### 2. Hard to test properly

* Unit tests want small, pure-ish functions: ‚ÄúGiven this config and df fragment, does `catnum_alignment` behave?‚Äù
* With fully inline code:

  * You can only really test by running the **whole script**, which is slow and noisy.
  * It‚Äôs extremely hard to test edge cases without manually hacking global state in notebook cells.

### 3. Brittle global state & ordering

* Everything depends on:

  * Correct **cell order**
  * Correct **global variables** (`df`, `CONFIG`, `C`, `REPORTS_DIR`, etc.)
* If you:

  * Rerun cells out of order
  * Import this notebook as a module
  * Or try to call a ‚Äúpiece‚Äù of Section 2.5 from another notebook
    it becomes fragile quickly.
* Production systems (Airflow, Dagster, Prefect, etc.) **hate** this style because they need deterministic, parameterized tasks.

### 4. Harder to parameterize & orchestrate

* Want to:

  * Run Section 2 on multiple datasets?
  * Toggle checks on/off from the command line?
  * Run only 2.5.7‚Äì2.5.11 for a quick regression?
* With inline-only:

  * You end up adding more and more `if run_this:` globals and config flags inside the giant script.
  * That becomes just as complex as having functions, but without the structure.

### 5. Maintenance cost explodes with size

* Section 2.5 alone is already huge.
* Imagine:

  * Section 2 numeric + categorical
  * Section 3 features
  * Section 4 modeling
    all inline in one place.
* Keeping mental track of **dependencies, side effects, and artifacts** becomes a full-time job.
* Refactors (e.g., changing anomaly severity scheme) are risky because you‚Äôre editing big blocks instead of well-scoped helpers.

### 6. Integration with other tools is awkward

* Want:

  * A CLI: `telco-churn run section2 --level full`
  * A Python package other notebooks can `import telco_churn.dq as dq` from
  * A web UI that re-runs just the logic section
* Inline only means:

  * You basically have to re-implement the logic *again* in function form or script wrappers.
  * Or you end up calling ‚Äúmagic notebook cells‚Äù via `papermill`, which is not fun.

### 7. Cognitive load for teammates / future-you

* For you today, the inline spec is crisp because you grew it.
* For future you (or a teammate), 1500+ lines of inline code with repeated patterns is intimidating:

  * Harder to see ‚Äúoh, this shape appears in 5 places, we should unify it.‚Äù
  * Harder to tell which parts are safe to change without breaking everything.

---

## How to use this style *strategically*

The way you‚Äôre doing it actually fits a **nice maturity curve**:

1. **Level_3‚Äìish: Full inline ‚Äúspec mode‚Äù (where you are)**

   * Goal: get *everything* correct, explicit, and observable.
   * You prioritize **clarity and traceability** over elegance.
   * You discover the **natural function boundaries** by feeling the repetition and pain.

2. **Level_4+: Extract patterns, keep the spec**

   * You:

     * Identify repeated shapes (**path resolution**, **append section2 report**, **config access**, **CSV/parquet writes**, **anomaly aggregation patterns**).
     * Pull those into small, sharp helpers in a `telco_churn` package.
   * But you **keep this inline file** as:

     * Documentation
     * A regression reference
     * A teaching artifact / portfolio piece

3. **Later: pure package / CLI oriented**

   * At this point:

     * Notebooks become **thin orchestration / visualization**.
     * The real logic lives in reusable functions / classes.

So:

* **Pros**: unbeatable for learning, design clarity, auditing, and having an executable spec for the whole logic layer.
* **Cons**: terrible for reuse, testing, orchestration, and long-term maintenance if you *never* factor out helpers.

If you want, next step I can help you:

* Walk through your inline 2.5.x and highlight **exact candidates** for the first 5‚Äì10 helper functions you‚Äôd eventually extract, while still honoring your ‚Äúno `def` until Level_3 is done‚Äù rule.

</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #b95d39ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
Section 1 üèóÔ∏è Environment, Imports & Project Setup: click to Jump to Section
</summary>

> ### Section 1 is your **foundation** ‚Äî the layer that ensures every later stage (2.0 Data Quality, 3.0 Modeling, etc.) runs reproducibly and coherently.
> **üéØ Purpose:**
> - Establish a reproducible, production-grade analytics environmen
> - that guarantees consistency across machines, teammates, and time.
> - All later sections (2.x‚Äì4.x) depend on this foundation.


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #8470d4ff;
padding:10px 12px;border-radius:10px;font-weight:700;"> 
1.1 üì¶ Imports & Environment Settings
</summary>

```groovy
‚îÇ
‚îú‚îÄ‚îÄ 1.1.1 Import Core Libraries
‚îÇ        ‚Üí pandas, numpy, matplotlib, seaborn, scipy, pathlib, warnings
‚îÇ        ‚Üí Establish reproducible numeric & plotting environment
‚îÇ
‚îú‚îÄ‚îÄ 1.1.2 Visualization Defaults
‚îÇ        ‚Üí Apply consistent theme (whitegrid, font, figsize)
‚îÇ        ‚Üí Centralizes aesthetic consistency across notebooks
‚îÇ
‚îú‚îÄ‚îÄ 1.1.3 Global Display & Print Settings
‚îÇ        ‚Üí Configure pandas display, float precision, warning filters
‚îÇ        ‚Üí Guarantees uniform console and report behavior
‚îÇ
‚îî‚îÄ‚îÄ Output: environment_log.txt (optional)
    üü© *Hiring value:* Demonstrates environment reproducibility and
        version-controlled dependency awareness ‚Äî key in team analytics.
```
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:5px solid #70b1d4ff;
padding:10px 12px;border-radius:10px;font-weight:700;"> 
1.2 üìÅ Path Setup & Directory Management
</summary>

```py
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
‚îÇ
‚îú‚îÄ‚îÄ 1.2.1 Resolve Project Root
‚îÇ        ‚Üí Dynamically detect `/Telco` or repo root path.
‚îÇ        ‚Üí Supports portability across dev and prod.
‚îÇ
‚îú‚îÄ‚îÄ 1.2.2 Define Core Subdirectories
‚îÇ        ‚Üí /data/raw/, /data/processed/, /src/, /reports/, /config/, /resources/
‚îÇ        ‚Üí Create missing directories automatically.
‚îÇ
‚îú‚îÄ‚îÄ 1.2.3 Verify Writable Paths
‚îÇ        ‚Üí Assert read/write permissions before runtime.
‚îÇ
‚îî‚îÄ‚îÄ Output: directory_structure_report.txt
    üü© *Hiring value:* Signals infrastructure literacy (you treat notebooks like codebases).
```
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #8470d4ff;
padding:10px 12px;border-radius:10px;font-weight:700;"> 
1.3 üß∞ Environment Configuration & Dependency Management
</summary>

```py
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
‚îÇ
‚îú‚îÄ‚îÄ 1.3.1 Virtual Environment / Kernel Registration
‚îÇ        ‚Üí Create `venv` or `conda` env, register with Jupyter kernel.
‚îÇ
‚îú‚îÄ‚îÄ 1.3.2 requirements.txt / setup_env.sh
‚îÇ        ‚Üí Pin library versions for deterministic builds.
‚îÇ
‚îú‚îÄ‚îÄ 1.3.3 CONFIG Loader
‚îÇ        ‚Üí Load central YAML/JSON config (project constants, thresholds).
‚îÇ
‚îî‚îÄ‚îÄ Output: env_requirements_report.txt
    üü© *Hiring value:* Shows MLOps maturity ‚Äî you know reproducibility is part of engineering.
```
</details>
<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #70b1d4ff;
padding:10px 12px;border-radius:10px;font-weight:700;"> 
1.4 ‚öôÔ∏è Global Constants & Config Mapping
</summary>

```groovy
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
‚îÇ
‚îú‚îÄ‚îÄ 1.4.1 Load CONFIG["PATHS"], CONFIG["RANGES"], CONFIG["DATA_CONTRACTS"]
‚îÇ        ‚Üí Provides shared context across all notebooks (2.x‚Äì3.x).
‚îÇ
‚îú‚îÄ‚îÄ 1.4.2 Set Random Seeds
‚îÇ        ‚Üí Guarantees deterministic statistical results.
‚îÇ
‚îú‚îÄ‚îÄ 1.4.3 Register Protected Columns
‚îÇ        ‚Üí ID and target columns (customerID, Churn_flag) locked from mutation.
‚îÇ
‚îî‚îÄ‚îÄ Output: config_validation_report.csv
    üü© *Hiring value:* Reflects enterprise-ready modularity ‚Äî you centralize constants cleanly.
```
</details>
<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #8470d4ff;
padding:10px 12px;border-radius:10px;font-weight:700;"> 
1.5 üß© Dataset Load & Version Verification
</summary>

```groovy
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
‚îÇ
‚îú‚îÄ‚îÄ 1.5.1 Load Raw Dataset (CSV or Parquet)
‚îÇ        ‚Üí Reads from CONFIG["PATHS"]["RAW_DATA"]
‚îÇ
‚îú‚îÄ‚îÄ 1.5.2 Hash / Snapshot Validation
‚îÇ        ‚Üí Compare MD5/SHA hash of dataset to version registry.
‚îÇ
‚îú‚îÄ‚îÄ 1.5.3 Schema Version Logging
‚îÇ        ‚Üí Record dataset version, load timestamp, row/column count.
‚îÇ
‚îî‚îÄ‚îÄ Output: dataset_load_log.csv
    üü© *Hiring value:* Proves you understand data lineage ‚Äî recruiters see governance mindset.
```
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #8470d4ff;
padding:10px 12px;border-radius:10px;font-weight:700;"> 
1.6 üîç Quick Structural Validation (Pre-DQ Sanity)
</summary>

```groovy
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.6 üîç Quick Structural Validation (Pre-DQ Sanity)
‚îÇ
‚îú‚îÄ‚îÄ 1.6.1 Verify Non-Empty Dataset
‚îÇ        ‚Üí Assert df.shape > (0,0).
‚îÇ
‚îú‚îÄ‚îÄ 1.6.2 Basic Column-Type Introspection
‚îÇ        ‚Üí preview df.dtypes, null counts, unique counts.
‚îÇ
‚îú‚îÄ‚îÄ 1.6.3 Target Variable Registration
‚îÇ        ‚Üí Map ‚ÄúYes‚Äù/‚ÄúNo‚Äù ‚Üí 1/0; create `Churn_flag`.
‚îÇ        ‚Üí Ensures consistency for modeling (Section 3).
‚îÇ
‚îî‚îÄ‚îÄ Output: initial_schema_summary.csv
    üü© *Hiring value:* You validate structure *before* analysis ‚Äî shows quality-first discipline.
```
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #8470d4ff;
padding:10px 12px;border-radius:10px;font-weight:700;"> 
1.7 üßæ Environment Summary & Log Consolidation
</summary>

```groovy
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1.7 üßæ Environment Summary & Log Consolidation
‚îÇ
‚îú‚îÄ‚îÄ 1.7.1 Capture System & Library Versions
‚îÇ        ‚Üí Python, pandas, numpy, seaborn, matplotlib, etc.
‚îÇ
‚îú‚îÄ‚îÄ 1.7.2 Generate Setup Summary Report
‚îÇ        ‚Üí Aggregates paths, env vars, library versions, config sources.
‚îÇ
‚îî‚îÄ‚îÄ Output: setup_summary.json / markdown cell
    üü© *Hiring value:* Gives recruiters a polished, verifiable record of engineering hygiene.
```
</details>

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:10px solid #8470d4ff;
padding:10px 12px;border-radius:10px;font-weight:700;"> 
1.8 ‚úÖ Integration Points
</summary>

```groovy
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
‚úÖ Integration Points
‚Üí Feeds: Section 2 (Data Quality & Integrity Framework)
‚Üí Updates: CONFIG, PROJECT_ROOT, ENV_LOGS
‚Üí Ensures all later notebooks can run deterministically and portably.
```
</details>


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
1.0.0-1.0.2 | PART A üì¶ Imports & Environment Settings
</summary>

>### **1.0.0 Import Core Libraries**
- pandas, numpy, matplotlib, seaborn, scipy, pathlib, warnings
- Establish reproducible numeric & plotting environment

>### **1.0.1 Visualization Defaults**
- Apply consistent theme (whitegrid, font, figsize)
- Centralizes aesthetic consistency across notebooks

>### **1.0.2 Global Display & Print Settings**
- Configure pandas display, float precision, warning filters
- Guarantees uniform console and report behavior

---

- **Output:** environment_log.txt (optional)

- üü© *Hiring value:* Demonstrates environment reproducibility and version-controlled dependency awareness ‚Äî key in team analytics.

</details>

In [34]:
# 1.0.0 Imports & Environment Settings

# Standard library
import os
import sys
import json
import math
import random
import platform
import hashlib
import warnings
from pathlib import Path
from types import MappingProxyType
from datetime import datetime, date
import subprocess

now = datetime.now()
today = date.today()

# Third-party
import pandas as pd
import numpy as np
import yaml
from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns

# Local application (Create after first full build)
# from telco_churn.config.loader import C, CONFIG
# from telco_churn.io.datasets import load_raw_telco
# from telco_churn.dq.schema_checks import run_schema_checks


In [35]:
# 1.0.1 Clean visualization defaults
warnings.filterwarnings("ignore")
sns.set_theme(style="whitegrid")

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", "{:,.4f}".format)
plt.rcParams.update(
    {
        "figure.figsize": (12, 6),
        "font.size": 11,
        "axes.grid": True,
    }
)


In [36]:
# 1.0.2 Core library version snapshot (match pins in requirements.txt)
versions = {
    "python":    sys.version.split()[0],
    "pandas":    pd.__version__,
    "numpy":     np.__version__,
    "seaborn":   sns.__version__,
    "matplotlib": mpl.__version__,
    "pyyaml":    getattr(yaml, "__version__", "unknown"),
}

print("‚úÖ Env & visual defaults set!")
print("   " + " | ".join(f"{k}: {v}" for k, v in versions.items()))


‚úÖ Env & visual defaults set!
   python: 3.12.6 | pandas: 2.3.3 | numpy: 2.3.5 | seaborn: 0.13.2 | matplotlib: 3.10.7 | pyyaml: 6.0.3


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
1.1.1-1.1.3 | PART B üìÅ Path Setup & Directory Management
</summary>

# 1.1.1-1.1.3 | PART B Path Setup & Directory Management
# - "Level_3" centralized via LEVEL_NAME
# - Case-insensitive project root detection
# - Explicit prints for PROJECT_ROOT and LEVEL_ROOT

---

>### **1.1.1 Resolve Project Root**
- ‚Üí Dynamically detect `/Telco` (or repo root) via `Path.cwd().resolve()` walk-up.
- ‚Üí Supports portability across dev, CI, and prod-like environments.

---

>### **1.1.2 Define Core Subdirectories**
- ‚Üí `/resources/data/raw/`, `/resources/data/processed/`, `/Level_3/resources/reports/`, `/config/`, `/models/`, `/Level_3/resources/artifacts/`, etc.
- ‚Üí Create missing directories automatically (`mkdir(parents=True, exist_ok=True)`).

---

>### **1.1.3 Verify Writable Paths**
- ‚Üí Assert read/write permissions for key paths (data, reports, artifacts).

- ‚îî‚îÄ‚îÄ **Output:** `directory_structure_report.txt`
- üü© *Hiring value:* Signals infrastructure literacy ‚Äî you treat notebooks like codebases with a real project root and folder structure.
```

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:2px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
# TODO: | PART B Path Setup & Directory Management / 1113_924
</summary>

This looks really solid already ‚Äî this is exactly the kind of ‚ÄúPart B: environment bootstrap‚Äù that makes a project feel real. There are just a few small things I‚Äôd tweak for clarity/maintainability and ‚Äúfuture-you won‚Äôt hate this‚Äù vibes.

I‚Äôll point them out and then show a tightened version.

---

## 1. PROJECT_ROOT resolution

This part is good:

```python
CURRENT_PATH = Path.cwd().resolve()
PROJECT_ROOT = None

for parent in [CURRENT_PATH] + list(CURRENT_PATH.parents):
    # Heuristic 1: repo folder name is "Telco"
    if parent.name == "Telco":
        PROJECT_ROOT = parent
        break
```

Two small ideas:

1. **Be case-insensitive**, just in case the folder gets renamed `telco` at some point:

   ```python
   if parent.name.lower() == "telco":
   ```

2. **Consider an override** (env var) later, but that can wait until you have the package; your comment about `config/project_config.yaml` is already a nice ‚Äúfuture heuristic.‚Äù

---

## 2. Magic `"Level_3"` string everywhere

Right now:

```python
CONFIG_DIR = PROJECT_ROOT / "Level_3" / "config"

CONVENTION_DIRS = [
    CONFIG_DIR,
    PROJECT_ROOT / "resources" / "data" / "raw",
    PROJECT_ROOT / "resources" / "data" / "processed",
    PROJECT_ROOT / "Level_3" / "resources" / "reports",
    PROJECT_ROOT / "Level_3" / "resources" / "figures",
    PROJECT_ROOT / "Level_3" / "resources" / "models",
    PROJECT_ROOT / "Level_3" / "resources" / "outputs",
    PROJECT_ROOT / "Level_3" / "resources" / "artifacts",
]
```

You‚Äôre going to copy this pattern for Level_4, Level_5, etc, so I‚Äôd pull `"Level_3"` into a tiny constant at the top of the notebook:

```python
LEVEL_NAME = "Level_3"
```

Then:

```python
LEVEL_ROOT = PROJECT_ROOT / LEVEL_NAME

CONFIG_DIR = LEVEL_ROOT / "config"

CONVENTION_DIRS = [
    CONFIG_DIR,
    PROJECT_ROOT / "resources" / "data" / "raw",
    PROJECT_ROOT / "resources" / "data" / "processed",
    LEVEL_ROOT / "resources" / "reports",
    LEVEL_ROOT / "resources" / "figures",
    LEVEL_ROOT / "resources" / "models",
    LEVEL_ROOT / "resources" / "outputs",
    LEVEL_ROOT / "resources" / "artifacts",
]
```

Same behavior, much easier to generalize later.

---

## 3. Directory creation + writable check

This bit is great:

```python
for d in CONVENTION_DIRS:
    d.mkdir(parents=True, exist_ok=True)
```

and:

```python
writable_checks = []
for d in CONVENTION_DIRS:
    try:
        test_file = d / ".write_test.tmp"
        test_file.write_text("ok", encoding="utf-8")
        test_file.unlink(missing_ok=True)
        writable_checks.append((str(d), True))
    except Exception:
        writable_checks.append((str(d), False))
```

Two micro-tweaks:

1. **Narrow the exception** if you want (e.g. `OSError`), but `Exception` is fine for a ‚Äúhealth check‚Äù if you just log it.
2. Add a tiny comment that explains *why* you‚Äôre doing the test write (you know now ‚Äî future-you might not):

```python
# 1.2.3 Verify Writable Paths (simple can-I-write-here check)
```

You already have the printout at the end, which is üëå.

---

## 4. A slightly polished version

Here‚Äôs your same logic with those tiny tweaks:

```python
# 1.2.1‚Äì1.2.3 | PART B Path Setup & Directory Management

# 1.2.1 Resolve Project Root
CURRENT_PATH = Path.cwd().resolve()
PROJECT_ROOT = None
LEVEL_NAME = "Level_3"  # current working level
LEVEL_ROOT = None

for parent in [CURRENT_PATH] + list(CURRENT_PATH.parents):
    # Heuristic 1: repo folder name is "Telco"
    if parent.name.lower() == "telco":
        PROJECT_ROOT = parent
        LEVEL_ROOT = PROJECT_ROOT / LEVEL_NAME
        break
    # Heuristic 2 (future): parent contains config/project_config.yaml
    # if (parent / "config" / "project_config.yaml").exists():
    #     PROJECT_ROOT = parent
    #     LEVEL_ROOT = PROJECT_ROOT / LEVEL_NAME
    #     break

if PROJECT_ROOT is None or LEVEL_ROOT is None:
    raise FileNotFoundError(
        f"‚ùå Could not locate Telco project root from {CURRENT_PATH}. "
        "Expected a 'Telco' folder or 'config/project_config.yaml' above."
    )

print(f"üìÅ 1.2.1 Project root resolved   ‚Üí {PROJECT_ROOT}")
print(f"üìÅ 1.2.1 Level root resolved     ‚Üí {LEVEL_ROOT}")

# 1.2.2 Define Core Subdirectories (convention-level, before config)
CONFIG_DIR = LEVEL_ROOT / "config"

CONVENTION_DIRS = [
    CONFIG_DIR,
    PROJECT_ROOT / "resources" / "data" / "raw",
    PROJECT_ROOT / "resources" / "data" / "processed",
    LEVEL_ROOT / "resources" / "reports",
    LEVEL_ROOT / "resources" / "figures",
    LEVEL_ROOT / "resources" / "models",
    LEVEL_ROOT / "resources" / "outputs",
    LEVEL_ROOT / "resources" / "artifacts",
]

for d in CONVENTION_DIRS:
    d.mkdir(parents=True, exist_ok=True)

# 1.2.3 Verify Writable Paths
writable_checks = []
for d in CONVENTION_DIRS:
    try:
        test_file = d / ".write_test.tmp"
        test_file.write_text("ok", encoding="utf-8")
        test_file.unlink(missing_ok=True)
        writable_checks.append((str(d), True))
    except Exception:
        writable_checks.append((str(d), False))

print("\nüìÇ 1.2.2‚Äì1.2.3 Directory structure & write access:")
for path_str, ok in writable_checks:
    status = "‚úÖ writable" if ok else "‚ö†Ô∏è not writable"
    print(f"   {status}: {path_str}")
```

Functionally identical to what you have, but:

* `"Level_3"` is centralized
* Case-insensitive root detection
* Slightly more explicit prints

---

If you want, next step I can help you turn this 1.2 block into a tiny `telco_churn.config.paths` module (e.g. `get_project_root()`, `get_level_root()`) so your notebooks can just call a function instead of carrying the for-loop logic inline.


</details>

In [37]:
# 1.1.1  Resolve Project Root

CURRENT_PATH = Path.cwd().resolve()
PROJECT_ROOT = None
LEVEL_NAME = "Level_3"  # current working level
LEVEL_ROOT = None

for parent in [CURRENT_PATH] + list(CURRENT_PATH.parents):
    # Heuristic 1: repo folder name is "Telco"
    # TODO: consider override after create package; your comment about config/project_config.yaml is already a nice ‚Äúfuture heuristic.‚Äù
    #  Be case insensitive in case folder renamed
    if parent.name.lower() == "telco":
        PROJECT_ROOT = parent
        LEVEL_ROOT = PROJECT_ROOT / LEVEL_NAME
        break
    # Heuristic 2: parent contains config/project_config.yaml
    # Data lives up a level from the project root
    # if (parent / "config" / "project_config.yaml").exists():
    #     PROJECT_ROOT = parent
    #     break

if PROJECT_ROOT is None:
    raise FileNotFoundError(
        f"‚ùå Could not locate Telco project root from {CURRENT_PATH}. "
        "Expected a 'Telco' folder or 'config/project_config.yaml' above."
    )

print(f"üìÅ 1.1.1 Project root resolved ‚Üí {PROJECT_ROOT}")


üìÅ 1.1.1 Project root resolved ‚Üí /Users/b/DATA/PROJECTS/Telco


In [38]:
# 1.1.2 Define Core Subdirectories (convention-level, before config)

CONFIG_DIR = LEVEL_ROOT / "config"

DIRS = {
    "CONFIG_DIR":        CONFIG_DIR,
    "RAW_DATA_DIR":      PROJECT_ROOT / "resources" / "data" / "raw",
    "PROCESSED_DATA_DIR":PROJECT_ROOT / "resources" / "data" / "processed",
    "REPORTS_DIR":       LEVEL_ROOT / "resources" / "reports",
    "FIGURES_DIR":       LEVEL_ROOT / "resources" / "figures",
    "MODELS_DIR":        LEVEL_ROOT / "resources" / "models",
    "OUTPUTS_DIR":       LEVEL_ROOT / "resources" / "outputs",
    "ARTIFACTS_DIR":     LEVEL_ROOT / "resources" / "artifacts",
    "DASHBOARDS_DIR":     LEVEL_ROOT / "resources" / "_dash"
}

# Create directories (dirs only)
for path in DIRS.values():
    path.mkdir(parents=True, exist_ok=True)

# Optionally expose key paths as globals for later sections
REPORTS_DIR   = DIRS["REPORTS_DIR"]
ARTIFACTS_DIR = DIRS["ARTIFACTS_DIR"]
FIGURES_DIR   = DIRS["FIGURES_DIR"]
MODELS_DIR    = DIRS["MODELS_DIR"]
OUTPUTS_DIR   = DIRS["OUTPUTS_DIR"]
DASHBOARDS_DIR  = DIRS["DASHBOARDS_DIR"]

print(f"üìÅ 1.1.2 Level root resolved     ‚Üí {LEVEL_ROOT}")

üìÅ 1.1.2 Level root resolved     ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3


In [39]:
# 1.1.3 Verify writable paths (fail fast if directories are not usable)

# Simple "can-I-write-here?" smoke test so later sections fail fast
writable_checks = []

for name, path in DIRS.items():
    try:
        test_file = path / ".write_test.tmp"
        test_file.write_text("ok", encoding="utf-8")
        test_file.unlink(missing_ok=True)
        writable_checks.append((name, path, True))
    except Exception:
        writable_checks.append((name, path, False))

print("\nüìÇ 1.1.3 Directory structure & write access:")
for name, path_obj, ok in writable_checks:
    status = "‚úÖ writable" if ok else "‚ö†Ô∏è not writable"
    print(f"   {status}: {name} ‚Üí {path_obj}")



üìÇ 1.1.3 Directory structure & write access:
   ‚úÖ writable: CONFIG_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/config
   ‚úÖ writable: RAW_DATA_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/resources/data/raw
   ‚úÖ writable: PROCESSED_DATA_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/resources/data/processed
   ‚úÖ writable: REPORTS_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/reports
   ‚úÖ writable: FIGURES_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/figures
   ‚úÖ writable: MODELS_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/models
   ‚úÖ writable: OUTPUTS_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/outputs
   ‚úÖ writable: ARTIFACTS_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/artifacts
   ‚úÖ writable: DASHBOARDS_DIR ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/_dash


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
1.2.1-1.2.3 | PART C üß∞ Env Config & Dependency Management | Hybrid Config Loader (Type 3 ‚Üí Type 2)

</summary>

# 1.2.1-1.3.3 | PART C Load Environment Configuration & Hybrid Config Loader (Type 3 ‚Üí Type 2)

>**Goal:** Load a **YAML config file** as the single source of truth (Type 3), then expose it as a safe, ergonomic in-memory mapping (Type 2-style API).

---

>### **1.2.1 Load External Config (`project_config.yaml`)**
- Read `config/project_config.yaml` using `yaml.safe_load`.
- Centralizes:
- `PATHS` (RAW_DATA, PROCESSED, ARTIFACTS, REPORTS, FIGURES, MODELS, OUTPUTS)
- `TARGET` (target column, raw target, positive/negative labels)
- `ID_COLUMNS`
- `RANGES` (tenure, MonthlyCharges, TotalCharges)
- `DATA_QUALITY` thresholds (high-cardinality, rare categories, etc.).
- ‚Üí YAML lives in version control and can be overridden per environment.

---

>### **1.2.2 Create In-Memory CONFIG + Read-Only View (`CFG`)**
- Store YAML contents in a Python `CONFIG` dict.
- Wrap it in `MappingProxyType` ‚Üí `CFG` to prevent accidental mutation.
- This becomes the **in-memory config object** used throughout the notebook.

---

>### **1.2.3 Add Dotted-Path Helper `C()`**
- Implement `C("PATHS.RAW_DATA")` / `C("RANGES.tenure.max")` style lookup.
- ‚Üí Keeps code clean and readable when accessing deeply nested config entries.

- ‚îî‚îÄ‚îÄ **Output:** optional `config_snapshot.json` (for debugging / documentation)
- üü© *Hiring value:* Shows you combine file-based, versioned config with a clean runtime API ‚Äî exactly how production pipelines manage configuration.

In [40]:
# 1.2.1 Load external config file

CONFIG_DIR.mkdir(parents=True, exist_ok=True)
CONFIG_PATH = CONFIG_DIR / "project_config.yaml"

if not CONFIG_PATH.exists():
    # üí° CHANGED: Expanded error message with a hint about expected location.
    raise FileNotFoundError(
        f"‚ùå Config file not found at: {CONFIG_PATH}\n"
        "Expected a project_config.yaml under LEVEL_ROOT/config."
    )

# üí° Use CONFIG_PATH.open(...) > instead of `bare open() for style consistency with Path.
with CONFIG_PATH.open("r", encoding="utf-8") as f:
    config_data = yaml.safe_load(f) or {}


In [41]:
# 1.2.2 Load in-memory config TODO: inspect yaml vars

CONFIG = config_data
CFG = MappingProxyType(CONFIG)  # üîí read-only view to prevent accidental mutation

#
# inspect new config.yaml vars
strict_map   = CONFIG.get("SCHEMA_EXPECTED_DTYPES_STRICT", {})
semantic_map = CONFIG.get("SCHEMA_EXPECTED_DTYPES_SEMANTIC", {})


In [42]:
# 1.2.3 Add dotted helper - def

def C(path: str, default=None):
    """
    Dotted-path lookup into CONFIG.

    Example:
        C("PATHS.RAW_DATA")
        C("RANGES.tenure.max")
        C("TARGET.POSITIVE_CLASS", default="Yes")
    """
    node = CONFIG
    for part in path.split("."):
        if isinstance(node, dict) and part in node:
            node = node[part]
        else:
            return default
    return node

# Optional snapshot
snapshot_path = CONFIG_DIR / "config_snapshot.json"
with snapshot_path.open("w", encoding="utf-8") as f:
    json.dump(CONFIG, f, indent=2)

print("‚úÖ 1.2 Config loaded & wrapped as read-only mapping.")
print(f"   CONFIG_PATH:    {CONFIG_PATH}")
print(f"   Snapshot saved: {snapshot_path}")
print(f"   Project name:   {C('META.PROJECT_NAME')}")
print(f"   Target column:  {C('TARGET.COLUMN')} (raw: {C('TARGET.RAW_COLUMN')})")
print(f"   RAW_DATA path:  {C('PATHS.RAW_DATA')}")


‚úÖ 1.2 Config loaded & wrapped as read-only mapping.
   CONFIG_PATH:    /Users/b/DATA/PROJECTS/Telco/Level_3/config/project_config.yaml
   Snapshot saved: /Users/b/DATA/PROJECTS/Telco/Level_3/config/config_snapshot.json
   Project name:   Telco Customer Churn
   Target column:  Churn_flag (raw: Churn)
   RAW_DATA path:  None


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
1.3.1-1.3.3 | PART D ‚öôÔ∏è Global Constants & Config Mapping
</summary>

##### **Goal:** Map key entries from `CONFIG` into convenient runtime constants for the rest of the project.

---

>### **1.3.1 Resolve Config-Driven Paths & Columns**
- `RAW_DATA      = PROJECT_ROOT / C("PATHS.RAW_DATA")`
- `PROCESSED_DIR = PROJECT_ROOT / C("PATHS.PROCESSED")`
- `ARTIFACTS_DIR = PROJECT_ROOT / C("PATHS.ARTIFACTS")`
- `REPORTS_DIR`, `FIGURES_DIR`, `MODELS_DIR`, `OUTPUTS_DIR` similarly.
- `TARGET_NAME   = C("TARGET.COLUMN")`
- `RAW_TARGET    = C("TARGET.RAW_COLUMN")`
- `ID_COLS       = C("ID_COLUMNS", [])`

---

>### **1.3.2 Set Random Seeds (Config-Driven)**
- Optionally read seed values from `CONFIG["SEEDS"]` or use default seeds.
- Set seeds for `numpy`, `random`, and any ML libraries you‚Äôll use.
- Guarantees deterministic statistical results across runs.

---

>### **1.3.3 Register Protected Columns & DQ Ranges**
- `PROTECTED_COLUMNS = set(ID_COLS) ‚à™ {TARGET_NAME}`
- Load `RANGES  = C("RANGES")`
- Load `DQ_OPTS = C("DATA_QUALITY")` (thresholds, suspect tokens, etc.).
- These values govern Section 2‚Äôs checks and are all **config-driven**, not hard-coded.
- ‚îî‚îÄ‚îÄ **Output:** `config_validation_report.csv`
- üü© *Hiring value:* Reflects enterprise-ready modularity ‚Äî constants are centralized and derived from a versioned YAML config, not scattered magic numbers.

---

</details>


In [43]:
# 1.3.1 Resolve Config-Driven Paths & Columns
# TODO: def
# FIXME: (FIX_ID: 11132019): allow PROJECT_ROOT override via env/config instead of Telco-only heuristic
# FIXME: (FIX_ID: 11141430): allow C() override via env/config instead of Telco-only heuristic
# FIXME: (FIX_ID: 11141431): Keep or remove 1.3.4 / place all parts appropriately if keeping.
# FIXME(FIX_ID: 2025-11-13-01): RAW_DATA path is hard-coded to Telco layout.
#   Future: allow overriding PROJECT_ROOT via env var or CLI arg.

# RAW_DATA: support RAW_DATA then RAW_DATA_DIR
raw_data_rel = C("PATHS.RAW_DATA", None)
if raw_data_rel is None:
    raw_data_rel = C("PATHS.RAW_DATA_DIR", None)
if raw_data_rel is None:
    raise KeyError(
        "‚ùå Missing config for RAW_DATA path. "
        "Tried PATHS.RAW_DATA and PATHS.RAW_DATA_DIR."
    )
RAW_DATA_DIR = PROJECT_ROOT / raw_data_rel

# PROCESSED_DIR: support PROCESSED then PROCESSED_DIR
processed_rel = C("PATHS.PROCESSED", None)
if processed_rel is None:
    processed_rel = C("PATHS.PROCESSED_DIR", None)
if processed_rel is None:
    raise KeyError(
        "‚ùå Missing config for PROCESSED path. "
        "Tried PATHS.PROCESSED and PATHS.PROCESSED_DIR."
    )
PROCESSED_DIR = PROJECT_ROOT / processed_rel

ARTIFACTS_DIR = PROJECT_ROOT / C("PATHS.ARTIFACTS")
REPORTS_DIR   = PROJECT_ROOT / C("PATHS.REPORTS")
FIGURES_DIR   = PROJECT_ROOT / C("PATHS.FIGURES")
MODELS_DIR    = PROJECT_ROOT / C("PATHS.MODELS")
OUTPUTS_DIR   = PROJECT_ROOT / C("PATHS.OUTPUTS")

# old pattern
# RAW_DATA      = PROJECT_ROOT / C("PATHS.RAW_DATA")
# PROCESSED_DIR = PROJECT_ROOT / C("PATHS.PROCESSED")
# ARTIFACTS_DIR = PROJECT_ROOT / C("PATHS.ARTIFACTS")
# REPORTS_DIR   = PROJECT_ROOT / C("PATHS.REPORTS")
# FIGURES_DIR   = PROJECT_ROOT / C("PATHS.FIGURES")
# MODELS_DIR    = PROJECT_ROOT / C("PATHS.MODELS")
# OUTPUTS_DIR   = PROJECT_ROOT / C("PATHS.OUTPUTS")

# old 2
# RAW_DATA      = _cfg_path("PATHS.RAW_DATA")
# PROCESSED_DIR = _cfg_path("PATHS.PROCESSED")
# ARTIFACTS_DIR = _cfg_path("PATHS.ARTIFACTS")
# REPORTS_DIR   = _cfg_path("PATHS.REPORTS")
# FIGURES_DIR   = _cfg_path("PATHS.FIGURES")
# MODELS_DIR    = _cfg_path("PATHS.MODELS")
# OUTPUTS_DIR   = _cfg_path("PATHS.OUTPUTS")

for d in [PROCESSED_DIR, ARTIFACTS_DIR, REPORTS_DIR, FIGURES_DIR, MODELS_DIR, OUTPUTS_DIR]:
    d.mkdir(parents=True, exist_ok=True)

target_name = C("TARGET.COLUMN")
raw_target  = C("TARGET.RAW_COLUMN")
id_cols     = C("ID_COLUMNS", []) or []
ranges      = C("RANGES", {})
dq_opts     = C("DATA_QUALITY", {})
flags       = C("FLAGS", {})

print("üìå 1.3.1 Resolved config-driven paths & columns:")
print(f"   RAW_DATA:      {RAW_DATA}")
print(f"   PROCESSED_DIR: {PROCESSED_DIR}")
print(f"   ARTIFACTS_DIR: {ARTIFACTS_DIR}")
print(f"   REPORTS_DIR:   {REPORTS_DIR}")
print(f"   FIGURES_DIR:   {FIGURES_DIR}")
print(f"   MODELS_DIR:    {MODELS_DIR}")
print(f"   OUTPUTS_DIR:   {OUTPUTS_DIR}")
print(f"   target_name:   {target_name} (raw: {raw_target})")
print(f"   id_cols:       {id_cols}")
print(f"   ranges:        {ranges}")
print(f"   data_quality:  {dq_opts}")


üìå 1.3.1 Resolved config-driven paths & columns:
   RAW_DATA:      /Users/b/DATA/PROJECTS/Telco/data/_raw
   PROCESSED_DIR: /Users/b/DATA/PROJECTS/Telco/data/processed
   ARTIFACTS_DIR: /Users/b/DATA/PROJECTS/Telco/Level_3/resources/artifacts
   REPORTS_DIR:   /Users/b/DATA/PROJECTS/Telco/Level_3/resources/reports
   FIGURES_DIR:   /Users/b/DATA/PROJECTS/Telco/Level_3/resources/figures
   MODELS_DIR:    /Users/b/DATA/PROJECTS/Telco/Level_3/resources/models
   OUTPUTS_DIR:   /Users/b/DATA/PROJECTS/Telco/Level_3/outputs
   target_name:   Churn_flag (raw: Churn)
   id_cols:       ['customerID']
   ranges:        {'tenure': {'min': 0, 'max': 120}, 'MonthlyCharges': {'min': 0, 'max': 1000}, 'TotalCharges': {'min': 0, 'max': 100000}}
   data_quality:  {'NUMERIC_LIKE_THRESHOLD': 0.95, 'RARE_PCT_THRESHOLD': 0.5, 'RARE_MIN_COUNT': 5, 'HIGH_CARD_THRESHOLD': 50, 'NEARLY_CONST_THRESHOLD': 0.98, 'SUSPECT_TOKENS': ['?', 'unknown', 'unk', 'n/a', 'na', 'n.a.', 'null', 'none', 'missing', '-', '--', '

In [44]:
# 1.3.2 Set Random Seeds (Config-Driven)

seeds = C("SEEDS", {"numpy": 42, "python": 42})
np.random.seed(seeds.get("numpy", 42))
random.seed(seeds.get("python", 42))
print(f"\nüé≤ 1.3.2 Seeds set from config (if provided): {seeds}")



üé≤ 1.3.2 Seeds set from config (if provided): {'numpy': 42, 'python': 42}


In [45]:
# 1.3.3 Register Protected Columns & DQ Ranges

protected_columns = set(id_cols)
if target_name:
    protected_columns.add(target_name)
print(f"\nüõ°Ô∏è 1.3.3 Protected columns (not to be mutated): {sorted(protected_columns)}")


# Config validation report

checks = []

def add_check(name, ok, value=None, category="general", required=True, note=""):
    checks.append(
        {
            "check": name,
            "category": category,
            "required": required,
            "ok": bool(ok),
            "value": str(value),
            "note": note,
        }
    )

add_check(
    "PATHS.RAW_DATA",
    RAW_DATA.exists(),
    value=RAW_DATA,
    category="paths",
    note="file should exist",
)
add_check(
    "PATHS.PROCESSED",
    PROCESSED_DIR.exists(),
    value=PROCESSED_DIR,
    category="paths",
    note="directory should exist (auto-created if missing)",
)
add_check(
    "PATHS.ARTIFACTS",
    ARTIFACTS_DIR.exists(),
    value=ARTIFACTS_DIR,
    category="paths",
    note="directory should exist (auto-created if missing)",
)
add_check(
    "PATHS.REPORTS",
    REPORTS_DIR.exists(),
    value=REPORTS_DIR,
    category="paths",
    note="directory should exist (auto-created if missing)",
)

add_check(
    "TARGET.COLUMN",
    target_name is not None,
    value=target_name,
    category="target",
    note="numeric flag used across Sections 2‚Äì3",
)
add_check(
    "TARGET.RAW_COLUMN",
    raw_target is not None,
    value=raw_target,
    category="target",
    note="raw label column before encoding",
)
add_check(
    "ID_COLUMNS",
    len(id_cols) > 0,
    value=id_cols,
    category="ids",
    note="at least one primary identifier expected",
)

add_check(
    "RANGES",
    bool(ranges),
    value=list(ranges.keys()),
    category="ranges",
    required=False,
    note="optional but recommended for Telco numeric fields",
)
add_check(
    "DATA_QUALITY",
    bool(dq_opts),
    value=list(dq_opts.keys()),
    category="data_quality",
    required=False,
    note="optional thresholds for Section 2 checks",
)

config_validation_df = pd.DataFrame(checks)
display(config_validation_df)

config_validation_path = REPORTS_DIR / "config_validation_report.csv"
config_validation_df.to_csv(config_validation_path, index=False)

print(f"\n‚úÖ 1.3.3 Config validation report saved ‚Üí {config_validation_path}")



üõ°Ô∏è 1.3.3 Protected columns (not to be mutated): ['Churn_flag', 'customerID']


Unnamed: 0,check,category,required,ok,value,note
0,PATHS.RAW_DATA,paths,True,True,/Users/b/DATA/PROJECTS/Telco/data/_raw,file should exist
1,PATHS.PROCESSED,paths,True,True,/Users/b/DATA/PROJECTS/Telco/data/processed,directory should exist (auto-created if missing)
2,PATHS.ARTIFACTS,paths,True,True,/Users/b/DATA/PROJECTS/Telco/Level_3/resources...,directory should exist (auto-created if missing)
3,PATHS.REPORTS,paths,True,True,/Users/b/DATA/PROJECTS/Telco/Level_3/resources...,directory should exist (auto-created if missing)
4,TARGET.COLUMN,target,True,True,Churn_flag,numeric flag used across Sections 2‚Äì3
5,TARGET.RAW_COLUMN,target,True,True,Churn,raw label column before encoding
6,ID_COLUMNS,ids,True,True,['customerID'],at least one primary identifier expected
7,RANGES,ranges,False,True,"['tenure', 'MonthlyCharges', 'TotalCharges']",optional but recommended for Telco numeric fields
8,DATA_QUALITY,data_quality,False,True,"['NUMERIC_LIKE_THRESHOLD', 'RARE_PCT_THRESHOLD...",optional thresholds for Section 2 checks



‚úÖ 1.3.3 Config validation report saved ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/reports/config_validation_report.csv


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
1.4.1-1.4.3 | PART E üìä Dataset Load & Version Verification
</summary>

> **GOAL:**

---

>### **1.4.1 Load Raw Dataset (CSV or Parquet)**
- Reads from CONFIG["PATHS"]["RAW_DATA"]

>### **1.4.2 Hash / Snapshot Validation**
- Compare MD5/SHA hash of dataset to version registry.

>### **1.4.3 Schema Version Logging**
- Record dataset version, load timestamp, row/column count.
  - ‚îî‚îÄ‚îÄ Output: dataset_load_log.csv
- üü© *Hiring value:* Proves you understand data lineage ‚Äî recruiters see governance mindset.

---

**Options:**
1. Download the dataset from kaggle
2. Use Kaggle API
**Download the dataset:** https://www.kaggle.com/datasets/blastchar/telco-customer-churn

```py
%pip install kagglehub
import os
from pathlib import Path
import kagglehub

# Set KaggleHub cache to current working directory
os.environ["KAGGLEHUB_CACHE"] = str(Path.cwd())

# Download dataset directly into current working directory
path = kagglehub.dataset_download("blastchar/telco-customer-churn")

print("üìÇ Dataset downloaded to:", path)
```

</details>

In [None]:
# 1.4.1 Load Raw Dataset (CSV: Config-Driven, with READ_OPTS) / (PARQUET: )
print("1.4 üß© DATASET LOAD & VERSION VERIFICATION")
print(f"1.4.1) üì• Loading raw dataset from: {RAW_DATA}")
#FIXME: 1113_1013

if not RAW_DATA.exists():
    raise FileNotFoundError(f"‚ùå RAW_DATA file not found at {RAW_DATA}")

# Pull CSV read options from config (READ_OPTS is optional)
READ_OPTS = C("READ_OPTS", {}) or {}

suffix = RAW_DATA.suffix.lower()
if suffix == ".csv":
    print(f"   Using READ_OPTS from config: {READ_OPTS}")
    df_raw = pd.read_csv(RAW_DATA, **READ_OPTS)
elif suffix in {".parquet", ".pq"}:
    df_raw = pd.read_parquet(RAW_DATA)
else:
    raise ValueError(f"‚ùå Unsupported file format for RAW_DATA: {suffix}")

n_rows, n_cols = df_raw.shape
print(f"‚úÖ Loaded dataset: {n_rows:,} rows √ó {n_cols:,} columns")


1.4 üß© DATASET LOAD & VERSION VERIFICATION
1.4.1) üì• Loading raw dataset from: /Users/b/DATA/PROJECTS/Telco/data/_raw


ValueError: ‚ùå Unsupported file format for RAW_DATA: 

In [None]:
# 1.4.2 Hash / Snapshot Validation
print("\n1.4.2) üîê Hash / snapshot validation")

def compute_hash(path, algo: str = "sha256", chunk_size: int = 1 << 20) -> str:
    """Streaming file hash to avoid loading whole file into memory."""
    h = hashlib.new(algo)
    with path.open("rb") as f:
        while True:
            chunk = f.read(chunk_size)
            if not chunk:
                break
            h.update(chunk)
    return h.hexdigest()

file_hash = compute_hash(RAW_DATA, "sha256")
now_utc = datetime.utcnow().isoformat(timespec="seconds") + "Z"

registry_path = ARTIFACTS_DIR / "dataset_version_registry.csv"
load_log_path = ARTIFACTS_DIR / "dataset_load_log.csv"

# Ensure artifacts directory exists so atomic writes (tmp + replace) don't fail
ARTIFACTS_DIR.mkdir(parents=True, exist_ok=True)

REGISTRY_DIR = ARTIFACTS_DIR / "registry"
REGISTRY_DIR.mkdir(parents=True, exist_ok=True)

registry_path = REGISTRY_DIR / "dataset_version_registry.csv"

# Load dataset version registry (file_hash ‚Üí version_id)
# ; initialize empty registry if it doesn't exist yet
# Load existing dataset version registry,
# or start a new one if this is the first run
if registry_path.exists():
    registry = pd.read_csv(registry_path)
else:
    registry = pd.DataFrame(
        columns=[
            "version_id",
            "dataset_path",
            "file_hash",
            "first_seen_utc",
            "last_seen_utc",
            "n_rows",
            "n_cols",
        ]
    )

existing = registry.loc[registry["file_hash"] == file_hash]

if not existing.empty:
    version_id = str(existing.iloc[0]["version_id"])
    registry.loc[registry["file_hash"] == file_hash, "last_seen_utc"] = now_utc
    print(f"‚ÑπÔ∏è Existing dataset version found ‚Üí version_id={version_id}")
else:
    if registry.empty:
        next_id = 1
    else:
        max_existing = pd.to_numeric(registry["version_id"], errors="coerce").fillna(0).max()
        next_id = int(max_existing) + 1

    version_id = str(next_id)
    new_row = pd.DataFrame(
        {
            "version_id": [version_id],
            "dataset_path": [str(RAW_DATA)],
            "file_hash": [file_hash],
            "first_seen_utc": [now_utc],
            "last_seen_utc": [now_utc],
            "n_rows": [n_rows],
            "n_cols": [n_cols],
        }
    )
    registry = pd.concat([registry, new_row], ignore_index=True)
    print(f"üÜï New dataset version registered ‚Üí version_id={version_id}")

tmp_reg = registry_path.with_suffix(".tmp")
registry.to_csv(tmp_reg, index=False)
os.replace(tmp_reg, registry_path)
print(f"‚úÖ Version registry updated ‚Üí {registry_path}")



1.4.2) üîê Hash / snapshot validation
‚ÑπÔ∏è Existing dataset version found ‚Üí version_id=1
‚úÖ Version registry updated ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/artifacts/registry/dataset_version_registry.csv


In [None]:
# 1.4.3 Schema Version Logging

print("\n1.4.3) üßæ Appending load metadata to dataset_load_log.csv")

log_row = pd.DataFrame(
    {
        "timestamp_utc": [now_utc],
        "dataset_path": [str(RAW_DATA)],
        "version_id": [version_id],
        "file_hash": [file_hash],
        "n_rows": [n_rows],
        "n_cols": [n_cols],
        "read_opts": [str(READ_OPTS)],
    }
)

if load_log_path.exists():
    existing_log = pd.read_csv(load_log_path)
    all_cols = pd.Index(existing_log.columns).union(log_row.columns)
    out_log = pd.concat(
        [existing_log.reindex(columns=all_cols), log_row.reindex(columns=all_cols)],
        ignore_index=True,
    )
else:
    out_log = log_row

tmp_log = load_log_path.with_suffix(".tmp")
out_log.to_csv(tmp_log, index=False)
os.replace(tmp_log, load_log_path)

print(f"‚úÖ Load log updated ‚Üí {load_log_path}")
print(f"üìä Dataset hash: {file_hash[:12]}... | version_id: {version_id}")

# df_raw is your canonical raw DataFrame for 1.6+



1.4.3) üßæ Appending load metadata to dataset_load_log.csv
‚úÖ Load log updated ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/artifacts/dataset_load_log.csv
üìä Dataset hash: 88be4b93fbe0... | version_id: 1


In [None]:
# 1.4.4 Defensive Copy for Processing
df = df_raw.copy(deep=True)
print("\n‚úÖ Created deep working copy 'df' from 'df_raw'")
print(f"   Memory usage: {df.memory_usage(deep=True).sum() / 1e6:.2f} MB")
print(f"   Shape: {df.shape[0]:,} rows √ó {df.shape[1]:,} columns")



‚úÖ Created deep working copy 'df' from 'df_raw'
   Memory usage: 6.82 MB
   Shape: 7,043 rows √ó 21 columns


In [None]:
# 1.4.5 metadata snapshot

raw_info_path = ARTIFACTS_DIR / "df_raw_info.txt"
working_info_path = ARTIFACTS_DIR / "df_working_info.txt"

with raw_info_path.open("w") as f:
    df_raw.info(buf=f)

with working_info_path.open("w") as f:
    df.info(buf=f)

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
1.5.1-1.5.3 | PART A üîç Quick Structural Validation (Pre-DQ Sanity)
</summary>

**GOAL:**


>### **1.5.1 Verify Non-Empty Dataset**
- Assert df.shape > (0,0).

- Assert that df exists and df.shape has:
    - n_rows > 0
    - n_cols > 0
      - If empty, raise a clear error and stop Section 1.

Ensures downstream DQ logic never runs on an empty or malformed dataset.

---

>### **1.5.2 Basic Column-Type Introspection**

**Preview:**
- df.dtypes
- non-null / null counts per column
- number of unique values per column

* **Save a lightweight schema snapshot for auditability.**
- **Output:** initial_schema_summary.csv under REPORTS_DIR.
- This becomes the baseline view for later type/coercion decisions.

---

>### **1.5.3 Target Variable Registration**

**Normalize the raw churn label:**
- Map ‚ÄúYes‚Äù/‚ÄúNo‚Äù ‚Üí 1/0 (or config-driven labels)to canonical text; create numeric `Churn_flag`

**Ensure:**
- Churn_flag exists
- dtype is an efficient integer type (e.g. Int8)
- consistency for modeling (Section 3).

This guarantees the target is consistent and model-ready for Section 3.

**Section Output:** initial_schema_summary.csv (+- target sanity prints)

üü© Hiring value: validate structure before deeper analysis ‚Äî a quality-first, production-minded habit. Shows quality-first discipline.

---


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:2px solid #e74f29ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
TODO: Short answer: **B is the better 1.6.2** for your Telco ‚Äúhiring-ready‚Äù story
</summary>

‚Äî it‚Äôs section-aware, config-aware, has emojis/type_groups, exports CSVs, and carries metadata.

But **A has really useful per-column metrics** (non_null, nulls, null_pct, n_unique) and a nice simple `initial_schema_df` that you *don‚Äôt* want to lose.

So the move is:

> ‚úÖ Keep **B as the canonical 1.6.2**,
> ‚úÖ **Merge A‚Äôs metrics and `initial_schema_df` into B**,
> ‚ùå Delete the old A block and the duplicate ‚Äúinitial_schema summary‚Äù block at the bottom.

---

## What‚Äôs good in each

### A (simple 1.6.2)

**Strengths:**

* Straightforward per-column stats:

  * `non_null`, `nulls`, `null_pct`, `n_unique`
* A clean `initial_schema_df` that‚Äôs perfect for:

  * sanity checks
  * downstream comparisons (before/after dtype changes)
* Simple, readable `display(initial_schema_df.head(10))`

**Weaknesses:**

* No section metadata (`SECTION_ID`, `run_ts`, etc.).
* No ‚Äútype_group‚Äù (numeric / string_like / etc.).
* No export to a canonical CSV path.
* No emoji/color groupings.

---

### B (Telco-ready 1.6.2)

**Strengths:**

* Section metadata: `SECTION_ID`, `SECTION_NAME`, `run_ts`, `n_rows`, `n_cols`.
* `type_group` (numeric/boolean/datetime/categorical/string_like).
* Emoji + background color per type.
* Wide ‚Äúone row per dtype‚Äù view with columns horizontally.
* Writes a **schema CSV** for the pipeline.

**Weaknesses:**

* It **does not compute** `non_null`, `nulls`, `null_pct`, `n_unique`.
* It references `initial_schema_df` later, but never actually builds it (it still depends on A).
* It re-resolves `PROJECT_ROOT` even though you already have it from 1.2 (not wrong, just redundant for this notebook).
* You have **two different blocks** that try to persist `initial_schema_summary.csv` (one via `REPORTS_DIR`, one via `SECTION1_REPORT_DIR`).

---

## Suggested refactor: ‚ÄúB + A‚Äôs brains‚Äù

Here‚Äôs the structural plan:

1. **Use B as the base** for 1.6.2.
2. While building `schema_df`, **add A‚Äôs metrics**:

   * `non_null`, `nulls`, `null_pct`, `n_unique`
3. Build `initial_schema_df` from `schema_df`.
4. Use **one consistent report directory**:

   ```python
   SECTION1_REPORT_DIR = REPORTS_DIR / "section1"
   ```

   (since `REPORTS_DIR` is already config-driven).
5. Save:

   * `section1_6_2_schema.csv` (full, with type_group & metadata)
   * `initial_schema_summary.csv` (column-level metrics table)
6. Delete:

   * The old A block.
   * The duplicate `initial_schema_df` persistence block at the bottom.

---

## Example merged 1.6.2 (sketch)

You don‚Äôt have to paste this verbatim, but this is the shape I‚Äôd aim for:

```python
# 1.6.2 üß± Basic column-type introspection (merged A + B)

print("\n1.6.2) üß± Basic column-type introspection")

SECTION_ID   = "1.6.2"
SECTION_NAME = "Basic column-type introspection"

# n_rows / n_cols from df
n_rows, n_cols = df.shape
run_ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

# Use config-driven REPORTS_DIR from earlier and create section1 subfolder
SECTION1_REPORT_DIR = REPORTS_DIR / "section1"
SECTION1_REPORT_DIR.mkdir(parents=True, exist_ok=True)

CSV_PATH = SECTION1_REPORT_DIR / "section1_6_2_schema.csv"

# 1) Helper (define once somewhere global if you prefer)
def classify_dtype(dtype_str: str) -> str:
    s = dtype_str.lower()
    if "int" in s or "float" in s:
        return "numeric"
    if "bool" in s:
        return "boolean"
    if "datetime" in s or "date" in s:
        return "datetime"
    if "category" in s:
        return "categorical"
    return "string_like"

TYPE_EMOJI = {
    "numeric":      "üî¢",
    "boolean":      "üîò",
    "datetime":     "‚è±Ô∏è",
    "categorical":  "üè∑Ô∏è",
    "string_like":  "üî§",
}
TYPE_COLOR = {
    "numeric":      "#c7ffd4",
    "boolean":      "#e4ffe9",
    "datetime":     "#ffd6f6",
    "categorical":  "#fff7ae",
    "string_like":  "#ffe4d1",
}

# 2) Build long schema table with BOTH A + B info
rows = []
for col in df.columns:
    s = df[col]
    dtype_str = str(s.dtype)
    type_group = classify_dtype(dtype_str)

    non_null = int(s.notna().sum())
    nulls = int(s.isna().sum())
    n_unique = int(s.nunique(dropna=True))
    null_pct = round(nulls / n_rows * 100, 3) if n_rows else 0.0

    rows.append(
        {
            "column":      col,
            "dtype":       dtype_str,
            "type_group":  type_group,
            "emoji":       TYPE_EMOJI.get(type_group, "‚ùì"),
            "bg_color":    TYPE_COLOR.get(type_group, "#f2f2f2"),
            "non_null":    non_null,
            "nulls":       nulls,
            "null_pct":    null_pct,
            "n_unique":    n_unique,
            "section_id":   SECTION_ID,
            "section_name": SECTION_NAME,
            "run_ts":       run_ts,
            "n_rows":       n_rows,
            "n_cols":       n_cols,
        }
    )

schema_df = (
    pd.DataFrame(rows)
    .sort_values(["type_group", "dtype", "column"])
    .reset_index(drop=True)
)

print("\nüìã dtypes:")
print(df.dtypes)

print("\nüìä Column-level summary (head):")
display(schema_df[["column", "dtype", "type_group", "non_null", "nulls", "null_pct", "n_unique"]].head(10))

# 3) Wide view: one row per dtype, columns horizontally (from B)
grouped = {}
for _, r in schema_df.iterrows():
    key = (r["dtype"], r["type_group"], r["emoji"], r["bg_color"])
    grouped.setdefault(key, []).append(r["column"])

rows_wide = []
for (dtype_str, type_group, emoji, bg_color), cols in grouped.items():
    row = {
        "dtype":      dtype_str,
        "type_group": type_group,
        "emoji":      emoji,
        "bg_color":   bg_color,
    }
    for i, colname in enumerate(cols, start=1):
        row[f"col_{i}"] = colname
    rows_wide.append(row)

wide_df = (
    pd.DataFrame(rows_wide)
    .sort_values(["type_group", "dtype"])
    .reset_index(drop=True)
    .fillna("")
)

def row_bg_colors(row):
    color = row["bg_color"]
    return [f"background-color: {color}" for _ in row]

styled_wide = (
    wide_df
    .style
    .apply(row_bg_colors, axis=1)
    .hide(axis="columns", subset=["bg_color"])
)
display(styled_wide)

# 4) Export for pipeline
export_cols = [
    "column",
    "dtype",
    "type_group",
    "emoji",
    "section_id",
    "section_name",
    "run_ts",
    "n_rows",
    "n_cols",
    "non_null",
    "nulls",
    "null_pct",
    "n_unique",
]

schema_df[export_cols].to_csv(CSV_PATH, index=False)
print(f"\nüíæ Exported schema summary to:\n   {CSV_PATH}")
print(f"   (columns={n_cols}, rows={n_rows}, run_ts={run_ts}, section={SECTION_ID})")

# 5) initial_schema_df for downstream summaries (what A was doing)
initial_schema_df = schema_df[
    ["column", "dtype", "type_group", "non_null", "nulls", "null_pct", "n_unique"]
].copy()

initial_schema_df.insert(0, "n_rows", n_rows)
initial_schema_df.insert(1, "n_cols", n_cols)

initial_schema_path = SECTION1_REPORT_DIR / "initial_schema_summary.csv"
initial_schema_df.to_csv(initial_schema_path, index=False)

print(f"\n‚úÖ 1.6 initial schema summary saved ‚Üí {initial_schema_path}")
```

Then you can safely:

* **Delete** the earlier simple A version.
* **Delete** the second duplicate `initial_schema_df` block that uses `SECTION1_REPORT_DIR` separately.
* Keep 1.6.3+ as-is (they will now see a properly defined `initial_schema_df` if they care, and the pipeline will have both CSVs).

If you want, next we can go through **1.6.5** and make sure it leans on the *same* `classify_dtype` + type_group logic (single source of truth for type semantics across Section 1).


In [None]:
# 1.5.1 Verify Non-Empty Dataset

RAISE_ON_CRITICAL = bool(C("FLAGS.RAISE_ON_CRITICAL", True))

print("\n1.5.1) ‚úÖ Verify non-empty dataset")

if "df" not in globals():
    raise NameError("‚ùå Working DataFrame 'df' is not defined. Run Section 1.4/1.5 first.")

n_rows, n_cols = df.shape
print(f"   Shape: {n_rows:,} rows √ó {n_cols:,} columns")

if n_rows == 0 or n_cols == 0:
    msg = f"‚ùå Dataset is empty: shape={df.shape}"
    if RAISE_ON_CRITICAL:
        raise ValueError(msg)
    else:
        print("‚ö†Ô∏è " + msg)
else:
    print("‚úÖ Dataset is non-empty and ready for structural checks.")



1.5.1) ‚úÖ Verify non-empty dataset
   Shape: 7,043 rows √ó 21 columns
‚úÖ Dataset is non-empty and ready for structural checks.


In [None]:
# 1.5.2 Basic column-type introspection (merged A + B)

print("\n1.5.2) üß± Basic column-type introspection")

SECTION_ID   = "1.5.2"
SECTION_NAME = "Basic column-type introspection"

# 0) quick dtypes preview
print("\nüìã dtypes:")
print(df.dtypes)

# 1) Env & metadata | n_rows / n_cols from df
n_rows, n_cols = df.shape
run_ts = datetime.now().strftime("%Y-%m-%d %H:%M:%S")

# Use config-driven REPORTS_DIR from earlier and create section1 subfolder
SECTION1_REPORT_DIR = REPORTS_DIR / "section1"
SECTION1_REPORT_DIR.mkdir(parents=True, exist_ok=True)

CSV_PATH = SECTION1_REPORT_DIR / "section1_5_2_schema.csv"

# 2) Helper (define once somewhere global if you prefer)
def classify_dtype(dtype_str: str) -> str:
    s = dtype_str.lower()
    if "int" in s or "float" in s:
        return "numeric"
    if "bool" in s:
        return "boolean"
    if "datetime" in s or "date" in s:
        return "datetime"
    if "category" in s:
        return "categorical"
    return "string_like"

TYPE_EMOJI = {
    "numeric":      "üî¢",
    "boolean":      "üîò",
    "datetime":     "‚è±Ô∏è",
    "categorical":  "üè∑Ô∏è",
    "string_like":  "üî§",
}
TYPE_COLOR = {
    "numeric":      "#c7ffd4",
    "boolean":      "#e4ffe9",
    "datetime":     "#ffd6f6",
    "categorical":  "#fff7ae",
    "string_like":  "#ffe4d1",
}

# --- 3) Long schema table with counts + type grouping
# 3) Build long schema table with BOTH A + B info
rows = []
for col in df.columns:
    s = df[col]
    dtype_str = str(s.dtype)
    type_group = classify_dtype(dtype_str)

    non_null = int(s.notna().sum())
    nulls = int(s.isna().sum())
    n_unique = int(s.nunique(dropna=True))
    null_pct = round(nulls / n_rows * 100, 3) if n_rows else 0.0

    rows.append(
        {
            "column":      col,
            "dtype":       dtype_str,
            "type_group":  type_group,
            "emoji":       TYPE_EMOJI.get(type_group, "‚ùì"),
            "bg_color":    TYPE_COLOR.get(type_group, "#f2f2f2"),
            "non_null":    non_null,
            "nulls":       nulls,
            "null_pct":    null_pct,
            "n_unique":    n_unique,
            "section_id":   SECTION_ID,
            "section_name": SECTION_NAME,
            "run_ts":       run_ts,
            "n_rows":       n_rows,
            "n_cols":       n_cols,
        }
    )

schema_df = (
    pd.DataFrame(rows)
    .sort_values(["type_group", "dtype", "column"])
    .reset_index(drop=True)
)

print("\nüìä Column-level summary (head):")
display(schema_df[["column", "dtype", "type_group", "non_null", "nulls", "null_pct", "n_unique"]].head(20))

# 4) Wide view: one row per dtype, columns horizontally (from B)
grouped = {}
for _, r in schema_df.iterrows():
    key = (r["dtype"], r["type_group"], r["emoji"], r["bg_color"])
    grouped.setdefault(key, []).append(r["column"])

rows_wide = []
for (dtype_str, type_group, emoji, bg_color), cols in grouped.items():
    row = {
        "dtype":      dtype_str,
        "type_group": type_group,
        "emoji":      emoji,
        "bg_color":   bg_color,
    }
    for i, colname in enumerate(cols, start=1):
        row[f"col_{i}"] = colname
    rows_wide.append(row)

wide_df = (
    pd.DataFrame(rows_wide)
    .sort_values(["type_group", "dtype"])
    .reset_index(drop=True)
    .fillna("")
)

def row_bg_colors(row):
    color = row["bg_color"]
    return [f"background-color: {color}" for _ in row]

styled_wide = (
    wide_df
    .style
    .apply(row_bg_colors, axis=1)
    .hide(axis="columns", subset=["bg_color"])
)
display(styled_wide)

# 5) Export for pipeline
export_cols = [
    "column",
    "dtype",
    "type_group",
    "emoji",
    "section_id",
    "section_name",
    "run_ts",
    "n_rows",
    "n_cols",
    "non_null",
    "nulls",
    "null_pct",
    "n_unique",
]

schema_df[export_cols].to_csv(CSV_PATH, index=False)
print(f"\nüíæ Exported schema summary to:\n   {CSV_PATH}")
print(f"   (columns={n_cols}, rows={n_rows}, run_ts={run_ts}, section={SECTION_ID})")

# 6) initial_schema_df for downstream summaries (what A was doing)
initial_schema_df = schema_df[
    ["column", "dtype", "type_group", "non_null", "nulls", "null_pct", "n_unique"]
].copy()

initial_schema_df.insert(0, "n_rows", n_rows)
initial_schema_df.insert(1, "n_cols", n_cols)

initial_schema_path = SECTION1_REPORT_DIR / "initial_schema_summary.csv"
initial_schema_df.to_csv(initial_schema_path, index=False)

print(f"\n‚úÖ 1.5 initial schema summary saved ‚Üí {initial_schema_path}")



1.5.2) üß± Basic column-type introspection

üìã dtypes:
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

üìä Column-level summary (head):


Unnamed: 0,column,dtype,type_group,non_null,nulls,null_pct,n_unique
0,MonthlyCharges,float64,numeric,7043,0,0.0,1585
1,TotalCharges,float64,numeric,7032,11,0.156,6530
2,SeniorCitizen,int64,numeric,7043,0,0.0,2
3,tenure,int64,numeric,7043,0,0.0,73
4,Churn,object,string_like,7043,0,0.0,2
5,Contract,object,string_like,7043,0,0.0,3
6,Dependents,object,string_like,7043,0,0.0,2
7,DeviceProtection,object,string_like,7043,0,0.0,3
8,InternetService,object,string_like,7043,0,0.0,3
9,MultipleLines,object,string_like,7043,0,0.0,3


Unnamed: 0,dtype,type_group,emoji,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,col_11,col_12,col_13,col_14,col_15,col_16,col_17
0,float64,numeric,üî¢,MonthlyCharges,TotalCharges,,,,,,,,,,,,,,,
1,int64,numeric,üî¢,SeniorCitizen,tenure,,,,,,,,,,,,,,,
2,object,string_like,üî§,Churn,Contract,Dependents,DeviceProtection,InternetService,MultipleLines,OnlineBackup,OnlineSecurity,PaperlessBilling,Partner,PaymentMethod,PhoneService,StreamingMovies,StreamingTV,TechSupport,customerID,gender



üíæ Exported schema summary to:
   /Users/b/DATA/PROJECTS/Telco/Level_3/resources/reports/section1/section1_5_2_schema.csv
   (columns=21, rows=7043, run_ts=2025-11-20 13:59:12, section=1.5.2)

‚úÖ 1.5 initial schema summary saved ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/reports/section1/initial_schema_summary.csv


In [None]:
# 1.5.3 Target Variable Registration
print("\n1.5.3) üéØ Target variable registration & normalization")

# FIXME: where does this belong the best?
#  Ensure we have a section1 reports subfolder
SECTION1_REPORT_DIR = REPORTS_DIR / "section1"
SECTION1_REPORT_DIR.mkdir(parents=True, exist_ok=True)


TARGET_NAME = C("TARGET.COLUMN")
RAW_TARGET  = C("TARGET.RAW_COLUMN")
POS_CLASS   = C("TARGET.POSITIVE_CLASS", "Yes")
NEG_CLASS   = C("TARGET.NEGATIVE_CLASS", "No")

if RAW_TARGET is None:
    msg = "‚ùå TARGET.RAW_COLUMN not defined in config."
    if RAISE_ON_CRITICAL:
        raise KeyError(msg)
    else:
        print("‚ö†Ô∏è " + msg)
elif RAW_TARGET not in df.columns:
    msg = f"‚ùå Raw target column '{RAW_TARGET}' not found in df."
    if RAISE_ON_CRITICAL:
        raise KeyError(msg)
    else:
        print("‚ö†Ô∏è " + msg)
else:
    # normalize raw labels
    raw_series = df[RAW_TARGET].astype("string")
    norm = raw_series.str.strip().str.casefold()

    pos_norm = str(POS_CLASS).strip().casefold()
    neg_norm = str(NEG_CLASS).strip().casefold()

    # map normalized labels to canonical text
    text_map = {
        pos_norm: POS_CLASS,
        neg_norm: NEG_CLASS,
    }
    df[RAW_TARGET] = norm.map(text_map).fillna(raw_series)
    df[RAW_TARGET] = df[RAW_TARGET].astype("category")

    # create numeric flag (TARGET_NAME) = 1 for positive, 0 for negative
    if TARGET_NAME is None:
        TARGET_NAME = "Churn_flag"  # fallback name if config missing
        print(f"‚ÑπÔ∏è TARGET.COLUMN not set in config ‚Äî using default '{TARGET_NAME}'")

    num_map = {
        neg_norm: 0,
        pos_norm: 1,
    }
    df[TARGET_NAME] = norm.map(num_map).astype("Int8")

    print(f"üéØ Target registered: RAW_TARGET='{RAW_TARGET}' ‚Üí FLAG='{TARGET_NAME}'")
    print("\nChurn value counts (normalized labels):")
    print(df[RAW_TARGET].value_counts(dropna=False))

    print("\nTarget flag distribution (0=negative, 1=positive):")
    print(df[TARGET_NAME].value_counts(dropna=False))

    # quick sanity: unmapped rows
    unmapped_mask = df[TARGET_NAME].isna()
    if unmapped_mask.any():
        print("\n‚ö†Ô∏è Some rows did not map to 0/1. Normalized raw values (top 10):")
        print(norm[unmapped_mask].value_counts(dropna=False).head(10))


 # üìäüìä Build a tiny summary artifact for Section 1 / later 2.2.5 validation
    run_ts = datetime.utcnow().isoformat(timespec="seconds") + "Z"
    tgt_summary_rows = []

    vc_raw = df[RAW_TARGET].value_counts(dropna=False)
    vc_flag = df[TARGET_NAME].value_counts(dropna=False)

    for label, count in vc_raw.items():
        tgt_summary_rows.append(
            {
                "column":          RAW_TARGET,
                "kind":            "raw_label",
                "value":           str(label),
                "count":           int(count),
                "pct":             float(count) / len(df) if len(df) else 0.0,
                "target_name":     TARGET_NAME,
                "timestamp_utc":   run_ts,
            }
        )

    for flag_val, count in vc_flag.items():
        tgt_summary_rows.append(
            {
                "column":          TARGET_NAME,
                "kind":            "flag",
                "value":           str(flag_val),
                "count":           int(count),
                "pct":             float(count) / len(df) if len(df) else 0.0,
                "target_name":     TARGET_NAME,
                "timestamp_utc":   run_ts,
            }
        )

    target_summary_df = pd.DataFrame(tgt_summary_rows)

    # üíæüíæ Persist summary to Section 1 reports for downstream checks (2.2.5)
    target_summary_path = SECTION1_REPORT_DIR / "target_field_summary.csv"
    tmp_path = target_summary_path.with_suffix(".tmp.csv")

    try:
        target_summary_df.to_csv(tmp_path, index=False)
        os.replace(tmp_path, target_summary_path)
        print(f"\nüßæ Target field summary written ‚Üí {target_summary_path}")
    except Exception as e:
        if tmp_path.exists():
            try:
                tmp_path.unlink()
            except Exception:
                pass
        print(f"‚ö†Ô∏è Could not write target_field_summary.csv: {e}")


1.5.3) üéØ Target variable registration & normalization
üéØ Target registered: RAW_TARGET='Churn' ‚Üí FLAG='Churn_flag'

Churn value counts (normalized labels):
Churn
No     5174
Yes    1869
Name: count, dtype: int64

Target flag distribution (0=negative, 1=positive):
Churn_flag
0    5174
1    1869
Name: count, dtype: Int64

üßæ Target field summary written ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/reports/section1/target_field_summary.csv


In [None]:
# 1.5.4 üíæ Persist initial schema summary (single source of truth)

print("\n1.5.4) üíæ Persist initial schema summary")

# Ensure we have a section1 reports subfolder
SECTION1_REPORT_DIR = REPORTS_DIR / "section1"
SECTION1_REPORT_DIR.mkdir(parents=True, exist_ok=True)

# Add dataset-shape columns if not already present
if "n_rows" not in initial_schema_df.columns:
    initial_schema_df.insert(0, "n_rows", n_rows)
if "n_cols" not in initial_schema_df.columns:
    initial_schema_df.insert(1, "n_cols", n_cols)

initial_schema_path = SECTION1_REPORT_DIR / "initial_schema_summary.csv"
tmp_path = initial_schema_path.with_suffix(".tmp.csv")

try:
    # Atomic write: temp ‚Üí replace
    initial_schema_df.to_csv(tmp_path, index=False)
    os.replace(tmp_path, initial_schema_path)
    print(f"‚úÖ 1.5 initial schema summary saved ‚Üí {initial_schema_path}")
except Exception as e:
    # Best-effort cleanup
    if tmp_path.exists():
        try:
            tmp_path.unlink()
        except Exception:
            pass
    print(f"‚ö†Ô∏è Could not write initial_schema_summary.csv: {e}")




1.5.4) üíæ Persist initial schema summary
‚úÖ 1.5 initial schema summary saved ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/reports/section1/initial_schema_summary.csv


In [None]:
# 1.5.5) üß¨ Dtype snapshot (read-only, pre‚ÄìData Quality)

print("\n1.5.5) üß¨ Dtype snapshot (pre‚ÄìData Quality, no coercion)")

DTYPE_DIR = ARTIFACTS_DIR / "dtypes"
DTYPE_DIR.mkdir(parents=True, exist_ok=True)

dtype_snapshot = (
    df.dtypes.astype(str)
      .rename("dtype")
      .reset_index()
      .rename(columns={"index": "column"})
)

dtype_snapshot["n_unique"] = [df[c].nunique(dropna=True) for c in df.columns]
dtype_snapshot["null_pct"] = [df[c].isna().mean() * 100 for c in df.columns]
dtype_snapshot["snapshot_ts_utc"] = datetime.utcnow().isoformat(timespec="seconds") + "Z"

before_path = DTYPE_DIR / "dtypes_pre_section2.csv"
tmp_path = before_path.with_suffix(".tmp.csv")

try:
    dtype_snapshot.to_csv(tmp_path, index=False)
    os.replace(tmp_path, before_path)
    print(f"   üíæ Saved pre‚ÄìSection 2 dtype snapshot ‚Üí {before_path}")
except Exception as e:
    if tmp_path.exists():
        try:
            tmp_path.unlink()
        except Exception:
            pass
    print(f"   ‚ö†Ô∏è Could not write dtype snapshot: {e}")



1.5.5) üß¨ Dtype snapshot (pre‚ÄìData Quality, no coercion)
   üíæ Saved pre‚ÄìSection 2 dtype snapshot ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/artifacts/dtypes/dtypes_pre_section2.csv


In [None]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:5px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
NEW! 1.6.5 üè∑Ô∏è Categorical Registration & Manifest
</summary>

This block assumes you‚Äôve run everything you pasted above (including the `schema_df` / `initial_schema_df` / `dtype_before` logic from earlier 1.6.x).

>#### drop this **right after 1.6.4 (dtypes snapshot)** and before you enter Section 2.

---

### 1.6.5) üè∑Ô∏è Categorical registration & manifest

### How this answers your earlier dtype questions

* **‚Äúobject / string_like üî§ Contract, Dependents, DeviceProtection, ‚Ä¶ Aren‚Äôt these categorical?‚Äù**
  Yep. The Telco-ish block `DEFAULT_BINARY_CANDIDATES` targets exactly those ‚ÄúYes/No / No service‚Äù style columns and forces them to `category`.

* **‚ÄúShould I be converting all those columns to categorical?‚Äù**
  Generally: yes for things like `Partner`, `Dependents`, `PhoneService`, etc. They‚Äôre discrete labels and perfect for `category`. The block above makes it **config-first**, but also has heuristics so you don‚Äôt have to hand-curate everything.

* **‚ÄúWhich dtype is best for these?‚Äù**
  For **binary / low-cardinality labels** in modeling pipelines:

  * `category` for the raw label column (`Yes`/`No`/`No internet service`)
  * Optionally **separate numeric flags** when needed (like your `Churn_flag` = 0/1).

* **‚ÄúWhy isn‚Äôt `category` in primitive/scalar data types?‚Äù**
  Because ‚Äúprimitive/scalar types‚Äù is usually a *language-level* concept (int, float, char, bool).
  `category` is a **pandas logical / semantic dtype**, built on top of underlying arrays, not a low-level primitive like `int32` or `float64`. It‚Äôs more of a storage + semantics optimization in the DataFrame world.

If you‚Äôd like, next pass we can wire up **Section 2** so it explicitly reads:

* `categorical_manifest_1_6_5.csv`
* `dtypes_before.csv` / `dtypes_after.csv`

and uses them to drive:

* which columns get frequency tables,
* which get association metrics (Cram√©r‚Äôs V / Theil‚Äôs U),
* which ones are ‚Äúprotected‚Äù from encoding until later.


In [None]:
# 1.5.6 üè∑Ô∏è Categorical hints & light profiling (no coercion)
print("\n1.5.6) üè∑Ô∏è Categorical hints & light profiling")

SECTION_ID   = "1.5.6"
SECTION_NAME = "Categorical hints & light profiling"

# 1Ô∏è‚É£ Config-driven categorical hints
cfg_cat_cols  = set(C("CATEGORICAL_COLUMNS", []) or [])
cfg_bool_cols = set(C("BOOLEAN_COLUMNS", []) or [])
cfg_id_cols   = set(C("ID_COLUMNS", []) or [])

DEFAULT_BINARY_CANDIDATES = {
    "Partner",
    "Dependents",
    "PhoneService",
    "PaperlessBilling",
    "MultipleLines",
    "DeviceProtection",
    "OnlineBackup",
    "OnlineSecurity",
    "TechSupport",
    "StreamingMovies",
    "StreamingTV",
}

cfg_binary_candidates = set(C("BINARY_CANDIDATES", []) or [])
binary_candidates = DEFAULT_BINARY_CANDIDATES | cfg_binary_candidates

print("   Config categorical columns: ", sorted(cfg_cat_cols))
print("   Config boolean columns:     ", sorted(cfg_bool_cols))
print("   Default binary candidates:  ", sorted(binary_candidates))

# 2Ô∏è‚É£ Column-level profile used for later heuristics (Section 2.2)
n_rows, n_cols = df.shape

def classify_dtype(dtype_str: str) -> str:
    s = dtype_str.lower()
    if "int" in s or "float" in s:
        return "numeric"
    if "bool" in s:
        return "boolean"
    if "datetime" in s or "date" in s:
        return "datetime"
    if "category" in s:
        return "categorical"
    return "string_like"

col_meta_rows = []
for col in df.columns:
    s = df[col]
    dtype_str = str(s.dtype)
    type_group = classify_dtype(dtype_str)

    non_null = s.notna().sum()
    n_null   = s.isna().sum()
    n_unique = s.nunique(dropna=True)
    unique_ratio = float(n_unique) / n_rows if n_rows else 0.0

    col_meta_rows.append(
        {
            "column":       col,
            "dtype":        dtype_str,
            "type_group":   type_group,
            "n_unique":     int(n_unique),
            "unique_ratio": round(unique_ratio, 4),
            "null_pct":     round(n_null / n_rows * 100, 3) if n_rows else 0.0,
        }
    )

col_meta = (
    pd.DataFrame(col_meta_rows)
    .sort_values(["type_group", "dtype", "column"])
    .reset_index(drop=True)
)

# 3Ô∏è‚É£ Heuristic ‚Äúsmall-cardinality string-like‚Äù candidates (for Section 2.2)
MAX_UNIQUE_FOR_CATEGORY = int(C("CATEGORICAL_HEURISTICS.MAX_UNIQUE", 50))
MAX_RATIO_FOR_CATEGORY  = float(C("CATEGORICAL_HEURISTICS.MAX_RATIO", 0.5))

heuristic_candidates = set(
    col_meta.loc[
        (col_meta["type_group"] == "string_like")
        & (col_meta["n_unique"] <= MAX_UNIQUE_FOR_CATEGORY)
        & (col_meta["unique_ratio"] <= MAX_RATIO_FOR_CATEGORY),
        "column",
    ].tolist()
)

print(
    f"   Heuristic candidates (‚â§{MAX_UNIQUE_FOR_CATEGORY} uniques, "
    f"‚â§{MAX_RATIO_FOR_CATEGORY:.2f} ratio): {sorted(heuristic_candidates)}"
)

# 4Ô∏è‚É£ Optional: persist a tiny artifact for Section 2.2 to reuse
section1_dir = REPORTS_DIR / "section1"
section1_dir.mkdir(parents=True, exist_ok=True)

cat_hints_path = section1_dir / "categorical_hints_1_5_6.csv"
col_meta.to_csv(cat_hints_path, index=False)
print(f"   üíæ Categorical hints profile saved ‚Üí {cat_hints_path}")

print("\n‚úÖ 1.5.6 Categorical hints & light profiling completed (no dtype coercion).")



1.5.6) üè∑Ô∏è Categorical hints & light profiling
   Config categorical columns:  []
   Config boolean columns:      []
   Default binary candidates:   ['Dependents', 'DeviceProtection', 'MultipleLines', 'OnlineBackup', 'OnlineSecurity', 'PaperlessBilling', 'Partner', 'PhoneService', 'StreamingMovies', 'StreamingTV', 'TechSupport']
   Heuristic candidates (‚â§50 uniques, ‚â§0.50 ratio): ['Contract', 'Dependents', 'DeviceProtection', 'InternetService', 'MultipleLines', 'OnlineBackup', 'OnlineSecurity', 'PaperlessBilling', 'Partner', 'PaymentMethod', 'PhoneService', 'StreamingMovies', 'StreamingTV', 'TechSupport', 'gender']
   üíæ Categorical hints profile saved ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/reports/section1/categorical_hints_1_5_6.csv

‚úÖ 1.5.6 Categorical hints & light profiling completed (no dtype coercion).


In [None]:
# 1.5.6 Categorical Candidate Registration & Manifest
print("\n1.5.6) üè∑Ô∏è Categorical registration & manifest")

# Optionally: save a read-only artifact like section1/categorical_candidates.csv:
# columns: column, type_group, n_unique, unique_ratio, is_config_categorical, is_config_boolean, is_binary_candidate, is_heuristic_candidate
# No astype(...) calls, no mutation.
from collections import defaultdict

SECTION_ID   = "1.5.6"
SECTION_NAME = "Categorical registration & manifest"

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 1) Pull config-driven categorical hints
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
cfg_cat_cols   = set(C("CATEGORICAL_COLUMNS", []) or [])
cfg_bool_cols  = set(C("BOOLEAN_COLUMNS", []) or [])
cfg_id_cols    = set(C("ID_COLUMNS", []) or [])

DEFAULT_BINARY_CANDIDATES = {
    "Partner",
    "Dependents",
    "PhoneService",
    "PaperlessBilling",
    "MultipleLines",
    "DeviceProtection",
    "OnlineBackup",
    "OnlineSecurity",
    "TechSupport",
    "StreamingMovies",
    "StreamingTV",
}

cfg_binary_candidates = set(C("BINARY_CANDIDATES", []) or [])
binary_candidates = DEFAULT_BINARY_CANDIDATES | cfg_binary_candidates

print("   Config categorical columns: ", sorted(cfg_cat_cols))
print("   Config boolean columns:     ", sorted(cfg_bool_cols))
print("   Default binary candidates:  ", sorted(binary_candidates))

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 2) Inference helpers: type_group, uniqueness, tiny-cardinality strings
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
n_rows, n_cols = df.shape

def classify_dtype(dtype_str: str) -> str:
    s = dtype_str.lower()
    if "int" in s or "float" in s:
        return "numeric"
    if "bool" in s:
        return "boolean"
    if "datetime" in s or "date" in s:
        return "datetime"
    if "category" in s:
        return "categorical"
    return "string_like"

col_meta_rows = []
for col in df.columns:
    s = df[col]
    dtype_str = str(s.dtype)
    type_group = classify_dtype(dtype_str)

    non_null = s.notna().sum()
    n_null = s.isna().sum()
    n_unique = s.nunique(dropna=True)
    unique_ratio = float(n_unique) / n_rows if n_rows else 0.0

    col_meta_rows.append(
        {
            "column": col,
            "dtype": dtype_str,
            "type_group": type_group,
            "n_unique": int(n_unique),
            "unique_ratio": round(unique_ratio, 4),
            "null_pct": round(n_null / n_rows * 100, 3) if n_rows else 0.0,
        }
    )

col_meta = pd.DataFrame(col_meta_rows)

MAX_UNIQUE_FOR_CATEGORY = int(C("CATEGORICAL_HEURISTICS.MAX_UNIQUE", 50))
MAX_RATIO_FOR_CATEGORY  = float(C("CATEGORICAL_HEURISTICS.MAX_RATIO", 0.5))

heuristic_candidates = set(
    col_meta.loc[
        (col_meta["type_group"] == "string_like")
        & (col_meta["n_unique"] <= MAX_UNIQUE_FOR_CATEGORY)
        & (col_meta["unique_ratio"] <= MAX_RATIO_FOR_CATEGORY),
        "column",
    ].tolist()
)

print(
    f"   Heuristic category candidates (‚â§{MAX_UNIQUE_FOR_CATEGORY} uniques, "
    f"‚â§{MAX_RATIO_FOR_CATEGORY:.2f} ratio): {sorted(heuristic_candidates)}"
)

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 3) Apply categorical / boolean conversions
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
forced_categories = set()
forced_booleans = set()
conversion_errors = []

# (a) Boolean columns from config
for col in cfg_bool_cols:
    if col not in df.columns:
        conversion_errors.append((col, "bool", "missing column"))
        continue
    try:
        df[col] = df[col].map(
            {
                True: True,
                False: False,
                1: True,
                0: False,
                "Yes": True,
                "No": False,
                "yes": True,
                "no": False,
                "Y": True,
                "N": False,
            }
        ).astype("boolean")
        forced_booleans.add(col)
    except Exception as e:
        conversion_errors.append((col, "bool", str(e)))

# (b) Config categorical columns
for col in cfg_cat_cols:
    if col not in df.columns:
        conversion_errors.append((col, "category", "missing column"))
        continue
    try:
        df[col] = df[col].astype("category")
        forced_categories.add(col)
    except Exception as e:
        conversion_errors.append((col, "category", str(e)))

# (c) Default binary Yes/No-style candidates
for col in binary_candidates:
    if col not in df.columns:
        continue
    if str(df[col].dtype) == "category":
        continue
    try:
        df[col] = df[col].astype("category")
        forced_categories.add(col)
    except Exception as e:
        conversion_errors.append((col, "category", str(e)))

# (d) Heuristic small-cardinality string-like candidates (if not in ID columns)
auto_cat_candidates = heuristic_candidates - cfg_id_cols - forced_categories
for col in auto_cat_candidates:
    if col not in df.columns:
        continue
    if str(df[col].dtype) == "category":
        continue
    try:
        df[col] = df[col].astype("category")
        forced_categories.add(col)
    except Exception as e:
        conversion_errors.append((col, "category", str(e)))

print(f"\n   ‚úÖ Forced to categorical: {sorted(forced_categories)}")
print(f"   ‚úÖ Forced to boolean:    {sorted(forced_booleans)}")

if conversion_errors:
    print("\n   ‚ö†Ô∏è Conversion issues encountered:")
    for col, target, msg in conversion_errors[:10]:
        print(f"     ‚Ä¢ {col} ‚Üí {target}: {msg}")
    if len(conversion_errors) > 10:
        print(f"     ‚Ä¶ and {len(conversion_errors) - 10} more")

# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
# 4) Build categorical manifest
# ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
run_ts = datetime.utcnow().isoformat(timespec="seconds") + "Z"

cat_cols = [c for c in df.columns if "category" in str(df[c].dtype).lower()]

manifest_rows = []
for col in cat_cols:
    s = df[col]
    n_unique = s.nunique(dropna=True)
    value_counts = (
        s.value_counts(dropna=False)
         .head(10)
         .to_dict()
    )

    if col in cfg_cat_cols:
        source = "config:categorical"
    elif col in binary_candidates:
        source = "default_binary"
    elif col in heuristic_candidates:
        source = "heuristic_small_cardinality"
    else:
        source = "existing_category"

    manifest_rows.append(
        {
            "column": col,
            "is_config_categorical": col in cfg_cat_cols,
            "is_binary_candidate": col in binary_candidates,
            "source": source,
            "n_unique": int(n_unique),
            "sample_values": json.dumps(value_counts),
            "section_id": SECTION_ID,
            "section_name": SECTION_NAME,
            "run_ts_utc": run_ts,
        }
    )

categorical_manifest = (
    pd.DataFrame(manifest_rows)
      .sort_values("column")
      .reset_index(drop=True)
)

# Persist manifest
cat_dir = ARTIFACTS_DIR / "categorical"
cat_dir.mkdir(parents=True, exist_ok=True)

cat_manifest_path = cat_dir / "categorical_manifest_1_5_6.csv"
tmp_path = cat_manifest_path.with_suffix(".tmp.csv")

try:
    categorical_manifest.to_csv(tmp_path, index=False)
    os.replace(tmp_path, cat_manifest_path)
    print(f"\n   üíæ Categorical manifest written ‚Üí {cat_manifest_path}")
except Exception as e:
    if tmp_path.exists():
        try:
            tmp_path.unlink()
        except Exception:
            pass
    print(f"   ‚ö†Ô∏è Could not write categorical manifest: {e}")

# Markdown snapshot
section1_dir = REPORTS_DIR / "section1"
section1_dir.mkdir(parents=True, exist_ok=True)
cat_md_path = section1_dir / "1_5_6_categorical_overview.md"

with cat_md_path.open("w", encoding="utf-8") as f:
    f.write(f"# {SECTION_ID} {SECTION_NAME}\n\n")
    f.write(f"- Run timestamp: {run_ts}\n")
    f.write(f"- Total columns: {n_cols}\n")
    f.write(f"- Categorical columns: {len(cat_cols)}\n\n")
    f.write("## Categorical Columns\n\n")
    for _, row in categorical_manifest.iterrows():
        f.write(f"### {row['column']}\n")
        f.write(f"- Source: `{row['source']}`\n")
        f.write(f"- n_unique: {row['n_unique']}\n")
        f.write(f"- Config categorical: {row['is_config_categorical']}\n")
        f.write(f"- Binary candidate: {row['is_binary_candidate']}\n\n")

print(f"   üìÑ Markdown overview written ‚Üí {cat_md_path}")
print("\n‚úÖ 1.5.6 Categorical registration & manifest completed.")



1.5.6) üè∑Ô∏è Categorical registration & manifest
   Config categorical columns:  []
   Config boolean columns:      []
   Default binary candidates:   ['Dependents', 'DeviceProtection', 'MultipleLines', 'OnlineBackup', 'OnlineSecurity', 'PaperlessBilling', 'Partner', 'PhoneService', 'StreamingMovies', 'StreamingTV', 'TechSupport']
   Heuristic category candidates (‚â§50 uniques, ‚â§0.50 ratio): ['Contract', 'Dependents', 'DeviceProtection', 'InternetService', 'MultipleLines', 'OnlineBackup', 'OnlineSecurity', 'PaperlessBilling', 'Partner', 'PaymentMethod', 'PhoneService', 'StreamingMovies', 'StreamingTV', 'TechSupport', 'gender']

   ‚úÖ Forced to categorical: ['Contract', 'Dependents', 'DeviceProtection', 'InternetService', 'MultipleLines', 'OnlineBackup', 'OnlineSecurity', 'PaperlessBilling', 'Partner', 'PaymentMethod', 'PhoneService', 'StreamingMovies', 'StreamingTV', 'TechSupport', 'gender']
   ‚úÖ Forced to boolean:    []

   üíæ Categorical manifest written ‚Üí /Users/b/DATA/

<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:5px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
1.6.1 - 1.6.2 üßæ Environment Summary & Log Consolidation
</summary>

# 1.6.1-1.6.2 üßæ Environment Summary & Log Consolidation

1.6.1 Capture System & Library Versions
- Python, pandas, numpy, seaborn, matplotlib, etc.

1.6.2 Generate Setup Summary Report
- Aggregates paths, env vars, library versions, config sources.

- **Output:** setup_summary.json / markdown cell

- üü© *Hiring value:* Gives recruiters a polished, verifiable record of engineering hygiene.

</details>


In [None]:
# 1.6.1 Capture System & Library Versions
print("1.6 üßæ ENVIRONMENT SUMMARY & LOG CONSOLIDATION")
print("\n1.6.1) üß¨ Capturing system & library versions")

env_info = {
    "timestamp_utc": datetime.utcnow().isoformat(timespec="seconds") + "Z",
    "system": {
        "platform": platform.system(),
        "platform_release": platform.release(),
        "platform_version": platform.version(),
        "machine": platform.machine(),
        "python_version": sys.version,
    },
    "libraries": {
        "pandas": pd.__version__,
        "numpy": np.__version__,
        "seaborn": sns.__version__,
        "matplotlib": mpl.__version__,
    },
}

print("   System:")
print(f"     ‚Ä¢ Platform: {env_info['system']['platform']} "
      f"{env_info['system']['platform_release']}")
print(f"     ‚Ä¢ Python:   {env_info['system']['python_version'].split()[0]}")

print("   Libraries:")
for lib, ver in env_info["libraries"].items():
    print(f"     ‚Ä¢ {lib}: {ver}")

1.6 üßæ ENVIRONMENT SUMMARY & LOG CONSOLIDATION

1.6.1) üß¨ Capturing system & library versions
   System:
     ‚Ä¢ Platform: Darwin 21.6.0
     ‚Ä¢ Python:   3.12.6
   Libraries:
     ‚Ä¢ pandas: 2.3.2
     ‚Ä¢ numpy: 2.3.3
     ‚Ä¢ seaborn: 0.13.2
     ‚Ä¢ matplotlib: 3.10.6


In [None]:
# 1.6.2 üßæ Generating setup summary report
# Right now, 1.6 is strong enough that you don‚Äôt strictly need a separate 1.3.4.
# Consider 1.6.2 a ‚Äújust-in-time‚Äù version of 1.3.4 / Or downsize 1.6
print("\n1.6.2 \n\nüßæ Generating setup summary report..")



setup_summary = {
    "timestamp_utc": env_info["timestamp_utc"],
    "project": {
        "project_root":   str(PROJECT_ROOT),
        "project_name":   C("META.PROJECT_NAME", "Telco Churn Project"),
        "project_version": C("META.VERSION", "0.1.0"),
        "config_path":    str(CONFIG_PATH),
    },
    "paths": {
        "raw_data":      str(RAW_DATA),
        "processed_dir": str(PROCESSED_DIR),
        "artifacts_dir": str(ARTIFACTS_DIR),
        "reports_dir":   str(REPORTS_DIR),
        "figures_dir":   str(FIGURES_DIR),
        "models_dir":    str(MODELS_DIR),
        "outputs_dir":   str(OUTPUTS_DIR),
    },
    "target_and_ids": {
        "target_column":      C("TARGET.COLUMN"),
        "raw_target_column":  C("TARGET.RAW_COLUMN"),
        "positive_class":     C("TARGET.POSITIVE_CLASS"),
        "negative_class":     C("TARGET.NEGATIVE_CLASS"),
        "id_columns":         C("ID_COLUMNS", []),
    },
    "data_quality_config": {
        "ranges":                 C("RANGES", {}),
        "data_quality_thresholds": C("DATA_QUALITY", {}),
        "flags":                  C("FLAGS", {}),
    },
    "environment": env_info,
}

# Decide where to store env/setup info
env_dir = ARTIFACTS_DIR / "env"
env_dir.mkdir(parents=True, exist_ok=True)

setup_summary_path = env_dir / "setup_summary.json"

with setup_summary_path.open("w", encoding="utf-8") as f:
    json.dump(setup_summary, f, indent=2)

# Optional: small human-readable snapshot for the notebook
print("\nüìå Setup snapshot:")
print(
    f"   ‚Ä¢ Project: {setup_summary['project']['project_name']} "
    f"(v{setup_summary['project']['project_version']})"
)
print(f"   ‚Ä¢ Root:    {setup_summary['project']['project_root']}")
print(f"   ‚Ä¢ Raw:     {setup_summary['paths']['raw_data']}")
print(
    f"   ‚Ä¢ Target:  {setup_summary['target_and_ids']['target_column']} "
    f"(raw: {setup_summary['target_and_ids']['raw_target_column']})"
)
print(f"   ‚Ä¢ IDs:     {setup_summary['target_and_ids']['id_columns']}")

print(f"\n‚úÖ Setup summary written ‚Üí {setup_summary_path}")
print("‚úÖ 1.6 Environment summary & setup log completed.")



1.6.2 

üßæ Generating setup summary report..

üìå Setup snapshot:
   ‚Ä¢ Project: Telco Customer Churn (v1.0)
   ‚Ä¢ Root:    /Users/b/DATA/PROJECTS/Telco
   ‚Ä¢ Raw:     /Users/b/DATA/PROJECTS/Telco/resources/data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv
   ‚Ä¢ Target:  Churn_flag (raw: Churn)
   ‚Ä¢ IDs:     ['customerID']

‚úÖ Setup summary written ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/artifacts/env/setup_summary.json
‚úÖ 1.6 Environment summary & setup log completed.


In [None]:
# 1.6.3 Core library version snapshot (optional, but nice hiring signal)

# OLD-1.3.4 snapshot core lib version (match pins in requirements.txt)
# TODO: remove this if not needed

def write_core_env_snapshot(artifacts_dir: Path, versions: dict) -> Path:
    """Write core library versions + timestamp as a JSON artifact."""
    env_dir = artifacts_dir / "env"
    env_dir.mkdir(parents=True, exist_ok=True)

    env_snapshot_path = env_dir / "core_lib_versions.json"
    env_snapshot = {
        "timestamp_utc": datetime.utcnow().isoformat(timespec="seconds") + "Z",
        "versions": versions,
    }

    with env_snapshot_path.open("w", encoding="utf-8") as f:
        json.dump(env_snapshot, f, indent=2)

    return env_snapshot_path

env_snapshot_path = write_core_env_snapshot(ARTIFACTS_DIR, versions)
print(f"üßæ 1.6.3 Core library versions snapshot saved ‚Üí {env_snapshot_path}")


üßæ 1.6.3 Core library versions snapshot saved ‚Üí /Users/b/DATA/PROJECTS/Telco/Level_3/resources/artifacts/env/core_lib_versions.json


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:2px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
Section-1 / Section-2 boundary check!
</summary>

# Section-1 / Section-2 boundary check!

Important sanity check with your new DQ framework:

Section 1:
Loads config (Type-3 YAML ‚Üí Type-2 C()).
Resolves paths & constants.
Loads df_raw, hashes it, registers version, logs load.
Creates df (working copy).
Does non-mutating structural checks and simple, domain-obvious normalization (target label to Churn_flag).

Emits:
config_validation_report.csv
dataset_version_registry.csv
dataset_load_log.csv
df_raw_info.txt / df_working_info.txt
section1_5_2_schema.csv
initial_schema_summary.csv
target_field_summary.csv
dtypes_pre_section2.csv

Section 2:
+ Owns typed schema intelligence + coercion
+ DQ (2.1‚Äì2.4 pipeline)