<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
df lineage recorder *& trasnformation log
</summary>

Nice, this is a perfect moment to add a ‚Äúblack box recorder‚Äù for `df` / `df_clean` to the notebook.

You won‚Äôt get this for *free* (Python can‚Äôt magically infer ‚Äúthis line transforms df‚Äù), but you can make it pretty painless:

---

## 1Ô∏è‚É£ Add a tiny lineage log at the top of 02_DQ

Right after 2.0.0 (or even inside it, after `df` is created) add:

```python
# üîÅ Simple DF lineage recorder for Section 2 (no new defs required if you don't want them)

if "df_lineage_records" not in globals():
    df_lineage_records = []

def _log_df_state(section, label, frame, note=None):
    """
    Tiny helper: record where df/df_clean is in the pipeline.
    (OK to keep this for Level_3; later you can move it to a library.)
    """
    rec = {
        "section": section,                # "2.0.0", "2.1.3", "2.6.7", etc.
        "label": label,                    # short human label
        "n_rows": int(frame.shape[0]),
        "n_cols": int(frame.shape[1]),
        "has_TotalCharges": "TotalCharges" in frame.columns,
        "TotalCharges_nulls": (
            int(frame["TotalCharges"].isna().sum())
            if "TotalCharges" in frame.columns else None
        ),
        "timestamp": pd.Timestamp.utcnow(),
        "note": note or "",
    }
    df_lineage_records.append(rec)
```

üí°üí° If you *really* want **no new functions** in the notebook, you can inline the body of `_log_df_state` at each step instead. But having one tiny helper here will massively reduce copy-paste and you can always move it into `telco_churn/utils/lineage.py` later.

---

## 2Ô∏è‚É£ Log the very first load in 2.0.0

At the end of your 2.0.0 cell (right after `df = df_raw.copy(...)` and the ‚Äúdf ready for Section 2‚Äù prints), add:

```python
_log_df_state(
    section="2.0.0",
    label="Initial load from RAW_DATA",
    frame=df,
    note="df loaded from WA_Fn-UseC_-Telco-Customer-Churn.csv"
)
```

Now you have a clear ‚Äúorigin point‚Äù for the lineage.

---

## 3Ô∏è‚É£ Log after each major transformation

After any cell where `df` or `df_clean` changes in a meaningful way, drop a one-liner like:

### Example: after basic type coercion (2.1.x)

```python
# ... your coercion code ...

_log_df_state(
    section="2.1.4",
    label="Post-type coercion",
    frame=df,
    note="Coerced TotalCharges to float; initial numeric integrity checks"
)
```

### Example: after missing-value imputation (2.6.1‚Äì2.6.3)

If you switch to `df_clean` in 2.6A, log that too:

```python
# ... 2.6.1‚Äì2.6.3 missing-value + outlier handling, resulting in df_clean ...

_log_df_state(
    section="2.6.3",
    label="Post-missing/outlier handling",
    frame=df_clean,
    note="Applied MISSING_VALUES strategy; TotalCharges: zero override"
)
```

### Example: after 2.6.7 logic repairs (where TotalCharges fixes should happen)

Right after the 2.6.7 block you pasted, add:

```python
_log_df_state(
    section="2.6.7",
    label="Post-logic-driven repairs",
    frame=df_clean,
    note="Applied LOGIC_REPAIR rules to tenure/TotalCharges"
)
```

And after 2.6.8 / 2.6.9:

```python
_log_df_state(
    section="2.6.8",
    label="Post-derived feature regeneration",
    frame=df_clean,
    note="Regenerated derived features after repairs"
)

_log_df_state(
    section="2.6.9",
    label="Post-encoding prep",
    frame=df_clean,
    note="Ready for modeling export / 3.x"
)
```

You don‚Äôt have to log *every tiny mutation*, just the big waypoints where you care about:

* shape changes,
* `TotalCharges` nulls,
* switch from `df` ‚Üí `df_clean`.

---

## 4Ô∏è‚É£ Turn the lineage into a visual table at the end

Add a final cell near the bottom of the notebook:

```python
# 2.9.x üìä DF Lineage Overview for this run

if "df_lineage_records" in globals() and df_lineage_records:
    df_lineage = pd.DataFrame(df_lineage_records)
    df_lineage = df_lineage.sort_values("timestamp").reset_index(drop=True)

    print("üìä DF lineage across Section 2:")
    display(df_lineage[
        ["section", "label", "n_rows", "n_cols", "has_TotalCharges", "TotalCharges_nulls", "note"]
    ])

    # Save as artifact for dashboards / HTML UX later
    lineage_path = SEC2_REPORTS_DIR / "df_lineage_section2.csv"
    tmp_lineage_path = lineage_path.with_suffix(".tmp.csv")
    df_lineage.to_csv(tmp_lineage_path, index=False)
    os.replace(tmp_lineage_path, lineage_path)

else:
    print("‚ö†Ô∏è No df_lineage_records found ‚Äì did you call _log_df_state anywhere?")
```

Visually you‚Äôll get a **timeline table** like:

| section | label                         | n_rows | n_cols | has_TotalCharges | TotalCharges_nulls | note                                          |
| ------- | ----------------------------- | ------ | ------ | ---------------- | ------------------ | --------------------------------------------- |
| 2.0.0   | Initial load from RAW_DATA    | 7,043  | 21     | True             | 11                 | df loaded from WA_Fn-UseC_-Telco...           |
| 2.6.3   | Post-missing/outlier handling | 7,043  | 21     | True             | 11                 | Applied MISSING_VALUES strategy               |
| 2.6.7   | Post-logic-driven repairs     | 7,043  | 21     | True             | 0                  | Applied LOGIC_REPAIR rules to tenure/Total... |

That last line is the **UX ‚Äúaha!‚Äù**: you literally see `TotalCharges_nulls` drop to `0` at 2.6.7.

---

## 5Ô∏è‚É£ (Optional) Make a tiny node-flow diagram

If you want an actual ‚Äúflow picture‚Äù, you can quickly turn that table into a simple graph later. For now, I‚Äôd just keep the CSV and table. In a future pass you could do something like:

```python
import matplotlib.pyplot as plt

if "df_lineage" in globals():
    plt.figure(figsize=(8, 4))
    plt.plot(range(len(df_lineage)), df_lineage["TotalCharges_nulls"], marker="o")
    plt.xticks(
        ticks=range(len(df_lineage)),
        labels=df_lineage["section"] + "\n" + df_lineage["label"],
        rotation=45,
        ha="right"
    )
    plt.ylabel("TotalCharges nulls")
    plt.title("TotalCharges nulls across Section 2 lineage")
    plt.tight_layout()
    plt.show()
```

That gives you a quick **sparkline of where the fix actually kicked in**.

---

If you‚Äôd like, next step I can:

* tailor the `_log_df_state` to also track *which* column changed (e.g., diff in `TotalCharges_nulls` vs previous step), or
* wire this lineage into your HTML dashboard so the ‚ÄúApply Phase‚Äù panel shows a mini timeline of `TotalCharges` health.


<details>
<summary style="
    cursor:pointer;background:#f7f7fb;border:3px solid #297be7ff;
padding:10px 12px;border-radius:10px;font-weight:700;">
more complex df lineage recorder *& trasnformation log
</summary>


Perfect, then let‚Äôs go full ‚Äúblack box recorder‚Äù with a proper lineage tracker ‚úàÔ∏èüìä

Below is a **self-contained helper** you can drop into a module (e.g. `telco_churn/utils/lineage.py`) plus **exact usage** in `02_DQ.ipynb` that will:

* Record **where `df` is first loaded**
* Record **each major transformation** (`df` / `df_clean`)
* Track **TotalCharges nulls** (and any other columns you care about)
* Give you a **timeline table + optional chart** so you can *see* where the problem gets fixed.

---

## 1Ô∏è‚É£ Helper module: `telco_churn/utils/lineage.py`

Create this file:

```python
# telco_churn/utils/lineage.py

from __future__ import annotations

from dataclasses import dataclass, asdict, field
from pathlib import Path
from typing import Any, Dict, Iterable, List, Optional, Sequence

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


@dataclass
class DFSnapshot:
    """One point in the pipeline: what did the frame look like here?"""

    section: str
    label: str
    step_index: int

    n_rows: int
    n_cols: int

    # shape / diff vs previous
    rows_delta: Optional[int] = None
    cols_delta: Optional[int] = None

    # core ‚Äúhealth‚Äù metrics per tracked column (flattened as columns)
    metrics: Dict[str, Any] = field(default_factory=dict)

    changed_columns: str = ""  # comma-separated names that changed vs previous
    note: str = ""
    timestamp: pd.Timestamp = field(default_factory=pd.Timestamp.utcnow)

    def to_flat_dict(self) -> Dict[str, Any]:
        base = asdict(self)
        # metrics is nested; flatten to top-level columns
        metrics = base.pop("metrics", {}) or {}
        for k, v in metrics.items():
            base[k] = v
        return base


class DFLineageTracker:
    """
    Tracks how a DataFrame evolves across a notebook / pipeline.

    - You call .snapshot(df, section="2.0.0", label="Initial load", ...)
    - It records shape, per-column metrics, and diffs vs previous snapshot.
    - At the end, call .to_frame() / .save_csv() / .plot_metric(...)
    """

    def __init__(
        self,
        name: str,
        tracked_columns: Sequence[str] = ("TotalCharges",),
        save_path: Optional[Path] = None,
        compute_changed_columns: bool = True,
    ) -> None:
        self.name = name
        self.tracked_columns = list(tracked_columns)
        self.save_path = Path(save_path) if save_path is not None else None
        self.compute_changed_columns = compute_changed_columns

        self._snapshots: List[DFSnapshot] = []
        self._last_df: Optional[pd.DataFrame] = None

    # ------------------------------------------------------------------ #
    # Core API
    # ------------------------------------------------------------------ #
    def snapshot(
        self,
        df: pd.DataFrame,
        section: str,
        label: str,
        note: str = "",
    ) -> None:
        """
        Record the current state of df at a given pipeline step.
        """
        df = df.copy(deep=False)  # cheap view; we don't mutate it here
        step_index = len(self._snapshots)

        n_rows, n_cols = int(df.shape[0]), int(df.shape[1])
        rows_delta = None
        cols_delta = None
        changed_cols: List[str] = []

        if self._last_df is not None:
            prev_rows, prev_cols = self._last_df.shape
            rows_delta = n_rows - int(prev_rows)
            cols_delta = n_cols - int(prev_cols)

            if self.compute_changed_columns:
                common_cols = [c for c in df.columns if c in self._last_df.columns]
                for col in common_cols:
                    # cheap-ish equality check; for 7k x 21 this is fine
                    try:
                        if not df[col].equals(self._last_df[col]):
                            changed_cols.append(col)
                    except Exception:
                        # if comparison fails for some weird dtype, just skip it
                            changed_cols.append(col)
        else:
            rows_delta = 0
            cols_delta = 0

        metrics = self._compute_metrics(df)

        snapshot = DFSnapshot(
            section=section,
            label=label,
            step_index=step_index,
            n_rows=n_rows,
            n_cols=n_cols,
            rows_delta=rows_delta,
            cols_delta=cols_delta,
            metrics=metrics,
            changed_columns=",".join(sorted(set(changed_cols))) if changed_cols else "",
            note=note,
        )

        self._snapshots.append(snapshot)
        self._last_df = df

    def to_frame(self) -> pd.DataFrame:
        """
        Return all snapshots as a pandas DataFrame.
        """
        if not self._snapshots:
            return pd.DataFrame()
        rows = [s.to_flat_dict() for s in self._snapshots]
        df = pd.DataFrame(rows).sort_values("step_index").reset_index(drop=True)
        return df

    def save_csv(self, path: Optional[Path] = None) -> Path:
        """
        Save lineage to CSV (atomic write). Returns final path.
        """
        if path is None:
            if self.save_path is None:
                raise ValueError("No save_path configured for DFLineageTracker.")
            path = self.save_path

        path = Path(path)
        path.parent.mkdir(parents=True, exist_ok=True)

        df = self.to_frame()
        tmp = path.with_suffix(".tmp.csv")
        df.to_csv(tmp, index=False)
        os.replace(tmp, path)
        return path

    # ------------------------------------------------------------------ #
    # Metrics + plotting
    # ------------------------------------------------------------------ #
    def _compute_metrics(self, df: pd.DataFrame) -> Dict[str, Any]:
        """
        Compute metrics for each tracked column, like:
            - {col}_present
            - {col}_nulls
            - {col}_non_nulls
            - {col}_min / _max / _mean (numeric only)
        """
        m: Dict[str, Any] = {}
        for col in self.tracked_columns:
            col_key = col.replace(" ", "_")
            present = col in df.columns
            m[f"{col_key}_present"] = bool(present)
            if not present:
                m[f"{col_key}_nulls"] = None
                m[f"{col_key}_non_nulls"] = None
                m[f"{col_key}_min"] = None
                m[f"{col_key}_max"] = None
                m[f"{col_key}_mean"] = None
                continue

            series = df[col]
            nulls = int(series.isna().sum())
            non_nulls = int(series.notna().sum())
            m[f"{col_key}_nulls"] = nulls
            m[f"{col_key}_non_nulls"] = non_nulls

            if pd.api.types.is_numeric_dtype(series):
                m[f"{col_key}_min"] = float(series.min(skipna=True)) if non_nulls > 0 else None
                m[f"{col_key}_max"] = float(series.max(skipna=True)) if non_nulls > 0 else None
                m[f"{col_key}_mean"] = float(series.mean(skipna=True)) if non_nulls > 0 else None
            else:
                m[f"{col_key}_min"] = None
                m[f"{col_key}_max"] = None
                m[f"{col_key}_mean"] = None

        return m

    def plot_metric(
        self,
        metric_col: str,
        title: Optional[str] = None,
        figsize: tuple = (8, 4),
    ) -> None:
        """
        Quick line plot of a numeric metric across snapshots.

        Example metric_col:
            'TotalCharges_nulls'
            'TotalCharges_non_nulls'
            'tenure_nulls'
        """
        df = self.to_frame()
        if df.empty:
            print("‚ö†Ô∏è No snapshots recorded; nothing to plot.")
            return

        if metric_col not in df.columns:
            print(f"‚ö†Ô∏è Metric '{metric_col}' not found in lineage DataFrame.")
            print("   Available metrics:", [c for c in df.columns if metric_col.split('_')[0] in c])
            return

        x = range(len(df))
        y = df[metric_col].values

        plt.figure(figsize=figsize)
        plt.plot(x, y, marker="o")
        plt.xticks(
            ticks=x,
            labels=[f"{s} | {l}" for s, l in zip(df["section"], df["label"])],
            rotation=45,
            ha="right",
        )
        plt.ylabel(metric_col)
        plt.title(title or f"{self.name}: {metric_col} across steps")
        plt.tight_layout()
        plt.show()
```

> üí°üí° You can add more metrics later (e.g. `churn_rate`, `mean_MonthlyCharges`) by modifying `_compute_metrics`.

---

## 2Ô∏è‚É£ Wire it into `02_DQ.ipynb`

### 2.1 Import + initialize after 2.0.0 bootstrap

Right after your 2.0.0 cell (where `SEC2_REPORTS_DIR` and `df` are ready), add:

```python
from telco_churn.utils.lineage import DFLineageTracker
from pathlib import Path

# 2.L0 üìä Initialize DF lineage tracker for Section 2
LINEAGE_PATH_2 = SEC2_REPORTS_DIR / "df_lineage_section2.csv"

df_lineage = DFLineageTracker(
    name="Section 2 Data Quality",
    tracked_columns=["TotalCharges", "tenure", "MonthlyCharges"],
    save_path=LINEAGE_PATH_2,
)

# First snapshot: initial load of df
df_lineage.snapshot(
    df=df,
    section="2.0.0",
    label="Initial load from RAW_DATA",
    note="Raw Telco CSV as loaded in 2.0.0",
)
```

Now you‚Äôve got the **origin point** recorded.

---

### 2.2 Drop snapshots after the key transformations

You don‚Äôt need to track every cell ‚Äî just the **interesting ones**.

#### After type coercion (where `TotalCharges` becomes numeric)

```python
# ... 2.1.x type coercion work on df ...

df_lineage.snapshot(
    df=df,
    section="2.1.x",
    label="Post-type coercion",
    note="Coerced TotalCharges to float, other numeric conversions",
)
```

#### After missing-value handling / imputations in 2.6A (df ‚Üí df_clean)

Where you apply `MISSING_VALUES.STRATEGIES.NUMERIC` and the overrides for `TotalCharges`:

```python
# ... 2.6.1‚Äì2.6.3 controlled imputation / outlier treatment ...
# df_clean = <resulting cleaned frame>

df_lineage.snapshot(
    df=df_clean,
    section="2.6.3",
    label="Post-missing & outlier handling",
    note="Applied MISSING_VALUES strategies; TotalCharges override to zero per config",
)
```

#### After 2.6.7 logic repairs (where the Telco rule might fix inconsistencies)

Immediately after the 2.6.7 block you pasted earlier:

```python
df_lineage.snapshot(
    df=df_clean,
    section="2.6.7",
    label="Post-logic-driven repairs",
    note="Applied LOGIC_REPAIR / tenure‚ÄìTotalCharges rule where applicable",
)
```

#### After 2.6.8 / 2.6.9

```python
# after derived feature regeneration
df_lineage.snapshot(
    df=df_clean,
    section="2.6.8",
    label="Post-derived feature regeneration",
    note="All derived features recomputed from cleaned base columns",
)

# after encoding preparation / final model-ready df_clean
df_lineage.snapshot(
    df=df_clean,
    section="2.6.9",
    label="Post-encoding prep",
    note="Ready for 3.x feature engineering & modeling",
)
```

You can add more snapshots wherever you want extra visibility (e.g. after 2.3.x numeric diagnostics, after 2.4.x categorical cleanup).

---

## 3Ô∏è‚É£ Visual: table + chart at the end of the notebook

Add a final cell near the end of `02_DQ.ipynb`:

```python
# 2.9.x üìä DF Lineage Overview for Section 2

lineage_df = df_lineage.to_frame()

if lineage_df.empty:
    print("‚ö†Ô∏è No DF lineage snapshots recorded.")
else:
    print("üìä DF lineage across Section 2 (sorted by step):")
    cols_to_show = [
        "step_index",
        "section",
        "label",
        "n_rows",
        "n_cols",
        "rows_delta",
        "cols_delta",
        "changed_columns",
        "TotalCharges_present",
        "TotalCharges_nulls",
        "TotalCharges_non_nulls",
        "note",
    ]
    cols_to_show = [c for c in cols_to_show if c in lineage_df.columns]
    display(lineage_df[cols_to_show])

    # Save as CSV for dashboards / HTML UX
    final_lineage_path = df_lineage.save_csv()
    print(f"üßæ Saved DF lineage to: {final_lineage_path}")
```

This will give you a clean table like:

| step_index | section | label                     | n_rows | n_cols | TotalCharges_nulls | changed_columns      |
| ---------: | ------: | ------------------------- | ------ | ------ | ------------------ | -------------------- |
|          0 |   2.0.0 | Initial load from RAW     | 7,043  | 21     | 11                 |                      |
|          1 |   2.1.x | Post-type coercion        | 7,043  | 21     | 11                 | TotalCharges, tenure |
|          2 |   2.6.3 | Post-missing & outliers   | 7,043  | 21     | 0                  | TotalCharges         |
|          3 |   2.6.7 | Post-logic-driven repairs | 7,043  | 21     | 0                  |                      |

You can literally watch the `TotalCharges_nulls` drop from `11` ‚Üí `0` at the step that fixes it.

---

## 4Ô∏è‚É£ Optional: quick chart to *see* where the fix happens

Right after the table, add:

```python
# Simple visual: where did TotalCharges_nulls change?

df_lineage.plot_metric(
    metric_col="TotalCharges_nulls",
    title="TotalCharges nulls across Section 2 pipeline",
)
```

That will show a line: each point is a section label, y-axis is `TotalCharges_nulls`. The cliff where it hits **0** is your fix step.

---

## 5Ô∏è‚É£ How this answers your original need

> ‚ÄúHow can I get a visual of where the df first gets loaded and each transformation it goes through during the entire file?‚Äù

With this setup:

* **Origin**: first snapshot in 2.0.0 shows where `df` was created.
* **Each major transformation**: you drop a `df_lineage.snapshot(...)` after any step that conceptually changes `df` or `df_clean`.
* **Visual**:

  * A **timeline table** (shape, null counts, changed columns, notes).
  * A **chart** for any metric (e.g. `TotalCharges_nulls`) so you can see exactly where the TotalCharges mess gets cleaned up.

If you want, next step I can also show:

* a decorator style (`@df_lineage.track_step("2.6.7", "Logic repairs")`) for when you promote these cells into functions in your Level_4+ pipeline.

</details>


