```
OUTLINE:

PROBLEM #1:
```

```
# PROBLEM 1:

# Import what we need
from data.loader import DataLoader
from data.preprocessor import clean_telco_data
from analysis.statistics import perform_statistical_analysis

# Load configuration
import yaml
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("Configuration loaded!")
print(f"Project: {config['project']['name']}")
print(f"Data path: {config['data']['filepath']}")

output:
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[9], line 8
      6 # Import what we need
      7 from data.loader import DataLoader
----> 8 from data.preprocessor import clean_telco_data
      9 from analysis.statistics import perform_statistical_analysis
     11 # Load configuration

File ~/DATA/PROJECTS/Telco/Level_3/src/data/preprocessor.py:128
    125     logger.info(f"Train churn rate: {y_train.mean():.2%}, Test churn rate: {y_test.mean():.2%}")
    127     return X_train, X_test, y_train, y_test
--> 128 src/telco_analysis/analysis/statistics.py
    129 """
    130 Statistical Analysis Module - What Separates Data Analysts from Data Scientists
    131 
   (...)    140 - "I report effect sizes alongside significance"
    141 """
    143 import pandas as pd

NameError: name 'src' is not defined
```

# #1 SOLUTION:
```
I see what‚Äôs happening here. That **`NameError: name 'src' is not defined`** isn‚Äôt actually about Python not knowing `src` ‚Äî it‚Äôs about an *extra line of text* that accidentally got left inside your `preprocessor.py` file.

Take a look at the error trace:

```python
File ~/DATA/PROJECTS/Telco/Level_3/src/data/preprocessor.py:128
...
--> 128 src/telco_analysis/analysis/statistics.py
```

That line (`src/telco_analysis/analysis/statistics.py`) looks like a **stray reference** (maybe from a copy-paste, docstring, or bad merge). Python is trying to *execute it* as code, which triggers the `NameError`.

---

## ‚úÖ How to Fix

1. Open `src/data/preprocessor.py`.

2. Scroll to around **line 128**.

3. Remove or comment out the stray line:

   ```python
   # src/telco_analysis/analysis/statistics.py
   ```

   It should not be there.

4. Save the file and rerun your script.

---

## üîç Why it happened

It looks like you meant to **document** or **reference** another module (`analysis/statistics.py`), but instead of a comment or import, you pasted the raw path.

* If your goal was to *import it*, you should have:

  ```python
  from analysis.statistics import perform_statistical_analysis
  ```
* If your goal was just to *leave a note*, you should use a comment:

  ```python
  # src/telco_analysis/analysis/statistics.py
  ```

---

‚ö° Once you delete that stray line, the import error should go away.

Do you want me to also help you **restructure your imports** so that modules under `src/` are always found (using `__init__.py` and `sys.path` setup), instead of hitting these issues repeatedly?


---

>### PROBLEM 1: "ModuleNotFoundError: No module named 'src'"
#### input:
```py
# notebooks/analysis.ipynb

from src.data.loader import DataLoader
from src.config import load_config

config = load_config('config/config.yaml')
loader = DataLoader(config)
df = loader.load_telco_data('../resources/data/raw/Telco-Customer-Churn.csv')
```

#### output:
```py
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 2
      1 # notebooks/analysis.ipynb
----> 2 from src.data.loader import DataLoader
      3 from src.config import load_config
      5 config = load_config('config/config.yaml')

ModuleNotFoundError: No module named 'src'
```

>## FIX:
You‚Äôre hitting that error because of how Python resolves imports. Right now your notebook is trying to import:

```python
from notebooks.src.data.loader import DataLoader
```

But your actual file is here:

```
/Users/b/DATA/PROJECTS/Telco/Level_3/src/data/loader.py
```

There is no `notebooks.src` package ‚Äî so Python can‚Äôt find it.

---

### Fix Options

#### 1. Import directly from `src`

If your project structure looks like this:

```
Level_3/
‚îú‚îÄ‚îÄ notebooks/
‚îÇ   ‚îî‚îÄ‚îÄ analysis.ipynb
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îî‚îÄ‚îÄ data/
‚îÇ       ‚îî‚îÄ‚îÄ loader.py
```

Then in `analysis.ipynb` you should import like:

```python
from src.data.loader import DataLoader
```

But for that to work, your notebook needs to know about the `Level_3` folder as the project root.

---

#### 2. Add project root to `sys.path`

At the top of your notebook:

```python
import sys, os
sys.path.append(os.path.abspath(".."))  # from notebooks/ to project root

from src.data.loader import DataLoader
```

If your notebook is at `Level_3/notebooks/analysis.ipynb`, then `..` takes you back to `Level_3`, so `src/` is visible.

---

#### 3. Use a `.env` or Jupyter kernel setup

A cleaner approach for ongoing work is to configure `PYTHONPATH` so `src` is always discoverable:

```bash
export PYTHONPATH=/Users/b/DATA/PROJECTS/Telco/Level_3
```

Or in VS Code add this to `.vscode/settings.json`:

```json
{
  "python.analysis.extraPaths": ["./src"]
}
```

---

‚úÖ Recommended: Start with option 2 (quick fix), then move to option 3 for a permanent setup.

Do you want me to show you how to make `src` always discoverable automatically in Jupyter (so you don‚Äôt need to hack `sys.path` in every notebook)?


># PROBLEM #2:
> TypeError: 'NoneType' object is not subscriptable.
# Practical Example
## Using Our Modules in Practice

# Import what we need
from data.loader import DataLoader
from data.preprocessor import clean_telco_data
from analysis.statistics import perform_statistical_analysis

# Load configuration
import yaml
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

print("Configuration loaded!")
print(f"Project: {config['project']['name']}")
print(f"Data path: {config['data']['filepath']}")

output:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[12], line 17
     14     config = yaml.safe_load(f)
     16 print("Configuration loaded!")
---> 17 print(f"Project: {config['project']['name']}")
     18 print(f"Data path: {config['data']['filepath']}")

TypeError: 'NoneType' object is not subscriptable

>## PROBLEM #2 FIX:
`yaml.safe_load` returned **None**, which means your `../config/config.yaml` parsed as an **empty YAML document** (e.g., file is empty, only comments, or just `---` with nothing else). That‚Äôs why `config['project']['name']` blows up: you‚Äôre subscripting `None`.
## Quick checks
1. Open the file and confirm it has real keys/values:

   ```yaml
   project:
     name: Telco Churn
   data:
     filepath: data/WA_Fn-UseC_-Telco-Customer-Churn.csv
   ```
2. Sanity-print what you loaded:
   ```python
   import yaml, pprint
   with open('../config/config.yaml', 'r') as f:
       config = yaml.safe_load(f)
   print("Raw config:", config)
   ```
## Make the loader robust
Add a small guard so you get a helpful error instead of a cryptic traceback:
```python
import yaml
from pathlib import Path

cfg_path = Path('../config/config.yaml')
if not cfg_path.exists():
    raise FileNotFoundError(f"Config file not found: {cfg_path.resolve()}")

with cfg_path.open('r') as f:
    config = yaml.safe_load(f)

if not isinstance(config, dict) or not config:
    raise ValueError(
        f"Config file {cfg_path} is empty or invalid YAML. "
        "Expected a mapping with keys like 'project' and 'data'."
    )

required = [('project', 'name'), ('data', 'filepath')]
for section, key in required:
    if section not in config or key not in config[section]:
        raise KeyError(f"Missing config key: {section}.{key}")
```
## Minimal working `config.yaml`
Copy this in to unblock:

```yaml
project:
  name: Telco Churn
data:
  filepath: data/WA_Fn-UseC_-Telco-Customer-Churn.csv
```

## Why this happens

* **Empty file / only comments** ‚Üí `safe_load` returns `None`.
* **Only `---`** (YAML doc start) with no content ‚Üí `None`.
* **Multiple docs** (using `---` and `...`) but the first is empty and you used `safe_load` instead of `safe_load_all`. If you truly have multiple documents, do:

  ```python
  docs = list(yaml.safe_load_all(open('../config/config.yaml')))
  print(docs)  # pick the doc you need
  ```

Once the file has valid YAML with those keys, your prints will work:

```text
Configuration loaded!
Project: Telco Churn
Data path: data/WA_Fn-UseC_-Telco-Customer-Churn.csv
```
># PROBLEM # 3: 
```py
# Cell 1: Statistical Analysis Deep Dive
"""
# Statistical Analysis with Business Context

This notebook demonstrates how to apply statistical thinking to business problems.
"""

# Setup
import sys
from pathlib import Path
sys.path.insert(0, str(Path.cwd().parent / 'src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Our modules
from data.loader import DataLoader
from data.preprocessor import clean_telco_data
from src.analysis.statistics import (
    test_numerical_vs_churn,
    test_categorical_vs_churn,
    identify_risk_segments
)

output:
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[7], line 22
     20 from data.loader import DataLoader
     21 from data.preprocessor import clean_telco_data
---> 22 from src.analysis.statistics import (
     23     test_numerical_vs_churn,
     24     test_categorical_vs_churn,
     25     identify_risk_segments
     26 )

ModuleNotFoundError: No module named 'src'
```
># SOLUTION # 3: 
```
the error is because you‚Äôre importing from src.analysis... while your sys.path already points inside src. So Python is looking for a top-level package named src (which it can‚Äôt find).

Since you did:

sys.path.insert(0, str(Path.cwd().parent / 'src'))  # => adds .../Level_3/src to sys.path


you should import packages directly under src/ like this:

# correct imports when .../Level_3/src is on sys.path
from data.loader import DataLoader
from data.preprocessor import clean_telco_data
from analysis.statistics import (
    test_numerical_vs_churn,
    test_categorical_vs_churn,
    identify_risk_segments
)

Quick checklist

Ensure your layout is:

Level_3/
‚îú‚îÄ src/
‚îÇ  ‚îú‚îÄ data/
‚îÇ  ‚îÇ  ‚îú‚îÄ __init__.py
‚îÇ  ‚îÇ  ‚îî‚îÄ loader.py
‚îÇ  ‚îú‚îÄ analysis/
‚îÇ  ‚îÇ  ‚îú‚îÄ __init__.py
‚îÇ  ‚îÇ  ‚îî‚îÄ statistics.py
‚îÇ  ‚îî‚îÄ data/
‚îî‚îÄ notebooks/


Make sure __init__.py exists in data/ and analysis/ (helps avoid namespace weirdness).

Keep sys.path pointing to .../Level_3/src (as you did), and do not prefix imports with src.

Alternative (if you insist on from src.analysis ...)

Insert the project root instead, and treat src as the package:

sys.path.insert(0, str(Path.cwd().parent))  # adds .../Level_3
from src.analysis.statistics import ...


In that case, add src/__init__.py (or rely on namespace packages). Generally not recommended‚Äîusing the project name as the top-level package (e.g., telco.analysis) is cleaner.
```

># PROBLEM # 4: 
```py

# Cell 2: Load and Prepare Data
"""
## Load and Prepare Data

Using our modular functions makes this clean and reproducible
"""

# Load configuration
import yaml
with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Load data
loader = DataLoader(config)
df_raw, load_report = loader.load_data(config['data']['filepath'])

print("Data Load Report:")
for key, value in load_report.items():
    if key != 'dtypes':  # Skip dtypes for brevity
        print(f"  {key}: {value}")

# Clean data
df_clean = clean_telco_data(df_raw)
print(f"\nCleaned data shape: {df_clean.shape}")

output:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[10], line 15
     13 # Load data
     14 loader = DataLoader(config)
---> 15 df_raw, load_report = loader.load_data(config['data']['filepath'])
     17 print("Data Load Report:")
     18 for key, value in load_report.items():

KeyError: 'filepath'
```
># SOLUTION # 4: 
```
The traceback makes perfect sense:

Your config currently looks like this (from what you shared earlier):

data:
  raw_path: "data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
  processed_path: "data/processed/telco_clean.csv"


But your code is trying to do:

config['data']['filepath']


üëâ There is no filepath key, so you get a KeyError.

‚úÖ Two Ways to Fix
Option 1: Update Python code to use the keys you actually have
# Load data
df_raw, load_report = loader.load_data(config['data']['raw_path'])


If later you want to load the processed version:

df_clean, load_report = loader.load_data(config['data']['processed_path'])

Option 2: Change your YAML to match your code

If you prefer your Python to stay with filepath, update your config:

data:
  filepath: "data/WA_Fn-UseC_-Telco-Customer-Churn.csv"

üîë Best Practice

I‚Äôd recommend Option 1 (keep raw_path and processed_path in YAML) ‚Äî it‚Äôs clearer as your project grows. Your loader can even accept either key:

raw_file = config['data'].get('raw_path') or config['data'].get('filepath')
df_raw, load_report = loader.load_data(raw_file)
```

> # PROBLEM # 5: 
```
My raw data actually lives here because I am doing multiple projects with the same data: /Users/b/DATA/PROJECTS/Telco/resources/data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv
ChatGPT said:

Got it üëç ‚Äî thanks for clarifying. Since you‚Äôre re-using the same raw dataset across multiple projects, it‚Äôs best to make that path explicit and keep your project configs consistent.
```

> # SOLUTION # 5: 
Since you‚Äôre re-using the same raw dataset across multiple projects, it‚Äôs best to make that path explicit and keep your project configs consistent.
‚úÖ Update your config file

Instead of this:

```py
data:
  raw_path: "data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
  processed_path: "data/processed/telco_clean.csv"
```

use the absolute path for the shared raw data, and keep the processed data relative to each project:
```py
data:
  raw_path: "/Users/b/DATA/PROJECTS/Telco/resources/data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv"
  processed_path: "data/processed/telco_clean.csv"
```

That way:

raw_path ‚Üí always points to your central shared dataset

processed_path ‚Üí lives inside each project‚Äôs working directory (Level_3/data/processed/‚Ä¶)
```py
‚úÖ Update your notebook code
# Load data
df_raw, load_report = loader.load_data(config['data']['raw_path'])

print("Data Load Report:")
for key, value in load_report.items():
    if key != 'dtypes':  # Skip dtypes for brevity
        print(f"  {key}: {value}")
```
```
üîë Best Practice (Flexible paths)

If you‚Äôd like portability (e.g., move between machines), you can parameterize a base data directory in your config and build paths dynamically:
```

```py
paths:
  base_data: "/Users/b/DATA/PROJECTS/Telco/resources/data"

data:
  raw_filename: "raw/WA_Fn-UseC_-Telco-Customer-Churn.csv"
  processed_path: "processed/telco_clean.csv"
```

Then in code:
```py
from pathlib import Path

base = Path(config['paths']['base_data'])
raw_file = base / config['data']['raw_filename']
processed_file = Path("data") / config['data']['processed_path']
```

This way, only base_data changes if you move your datasets.

># PROBLEM # 6: 
Should I add code to save the cleaned data?

># SOLUTION # 6:
```
Yes ‚úÖ ‚Äî saving the cleaned dataset is a really good idea. It gives you a reproducible checkpoint so you don‚Äôt have to re-run cleaning every time, and downstream steps (EDA, modeling) can just read the cleaned file.

üîë Why save cleaned data

Reproducibility ‚Üí same input for everyone using the project.

Speed ‚Üí skip cleaning on every run.

Versioning ‚Üí keep track of raw vs processed.

Flexibility ‚Üí you can reload the cleaned version when experimenting.

‚úÖ How to do it
1. Update your config.yaml

You already have:

data:
  raw_path: "/Users/b/DATA/PROJECTS/Telco/resources/data/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv"
  processed_path: "data/processed/telco_clean.csv"


Make sure the processed_path is there.

2. Add saving code to your notebook
from pathlib import Path

# Ensure the directory exists
processed_path = Path(config['data']['processed_path'])
processed_path.parent.mkdir(parents=True, exist_ok=True)

# Save as CSV
df_clean.to_csv(processed_path, index=False)
print(f"Cleaned data saved to {processed_path}")

3. Optional improvements

Save as Parquet (faster I/O, preserves types):

df_clean.to_parquet(processed_path.with_suffix('.parquet'), index=False)


Add a timestamped version if you want to keep history:

import datetime
ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
versioned_file = processed_path.with_name(f"telco_clean_{ts}.csv")
df_clean.to_csv(versioned_file, index=False)


Wrap this into a helper function in your utils/io_ops.py (e.g., save_dataset(df, path)).
```

># PROBLEM # 7: 
>Why does this not clean the 11 entries with blanks in the TotalCharges?
```
"""
## Load and Prepare Data

Using our modular functions makes this clean and reproducible
"""

# Load configuration
import yaml
from pathlib import Path

with open('../config/config.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Load data
df_raw, load_report = loader.load_data(config['data']['raw_path'])

print("Data Load Report:")
for key, value in load_report.items():
    if key != 'dtypes':  # Skip dtypes for brevity
        print(f"  {key}: {value}")

# Clean data
df_clean = clean_telco_data(df_raw)
print(f"\nCleaned data shape: {df_clean.shape}")

# Ensure the directory exists
processed_path = Path(config['data']['processed_path'])
processed_path.parent.mkdir(parents=True, exist_ok=True)

# Save as CSV
df_clean.to_csv(processed_path, index=False)
print(f"Cleaned data saved to {processed_path}")

# Optional improvements

# 1. Save as Parquet (faster I/O, preserves types):
# df_clean.to_parquet(processed_path.with_suffix('.parquet'), index=False)

# 2. Add a timestamped version if you want to keep history:

# import datetime
# ts = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
# versioned_file = processed_path.with_name(f"telco_clean_{ts}.csv")
# df_clean.to_csv(versioned_file, index=False)
```
># SOLUTION # 7A:
```
Short answer: those 11 ‚Äúblanks‚Äù in TotalCharges are actually space-only strings (" ") in the CSV. If your loader/cleaner doesn‚Äôt explicitly convert space-only strings ‚Üí NaN ‚Üí numeric, they‚Äôll sneak through as strings and never get fixed.
```
Here‚Äôs a bullet-proof way to handle it:
```py
1) Make sure you load TotalCharges as text first (so you can sanitize)

If your DataLoader sets dtypes, don‚Äôt force TotalCharges to numeric on read. Either let pandas infer, or explicitly read it as str and clean later.

# inside DataLoader.load_data(...)
df = pd.read_csv(config['data']['raw_path'], dtype={"TotalCharges": "string"})

2) In clean_telco_data, normalize whitespace ‚Üí NaN ‚Üí numeric
import numpy as np
import pandas as pd

def clean_telco_data(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()

    # 1) Trim whitespace in object/string cols
    str_cols = out.select_dtypes(include=["object", "string"]).columns
    out[str_cols] = out[str_cols].apply(lambda s: s.str.strip())

    # 2) Convert space-only TotalCharges to NaN, then numeric
    out["TotalCharges"] = (
        out["TotalCharges"]
        .replace(r"^\s*$", np.nan, regex=True)     # empty/space-only ‚Üí NaN
        .replace("", np.nan)                       # double-safety
    )
    out["TotalCharges"] = pd.to_numeric(out["TotalCharges"], errors="coerce")

    # 3) Business rule: when tenure == 0, set TotalCharges = 0 (the classic 11 rows)
    mask_tenure0 = out["tenure"].fillna(0).astype(float).eq(0)
    out.loc[mask_tenure0, "TotalCharges"] = out.loc[mask_tenure0, "TotalCharges"].fillna(0.0)

    # 4) Sanity check: after fix, there should be no NaN left in TotalCharges
    #    If any remain, surface them loudly (these would be true data issues).
    remaining = out["TotalCharges"].isna().sum()
    if remaining:
        raise ValueError(f"TotalCharges still has {remaining} NaN after cleaning (non-tenure==0 rows).")

    # (Optional) enforce types you expect downstream
    out["TotalCharges"] = out["TotalCharges"].astype(float)
    out["MonthlyCharges"] = pd.to_numeric(out["MonthlyCharges"], errors="coerce")

    return out

3) Quick diagnostic you can run before/after cleaning
# BEFORE
n_blank = df_raw["TotalCharges"].astype(str).str.fullmatch(r"\s*").sum()
print("Space-only TotalCharges rows (raw):", n_blank)

# AFTER
print("Any NaN in TotalCharges after cleaning?", df_clean["TotalCharges"].isna().any())
print("dtype:", df_clean["TotalCharges"].dtype)

Common gotchas that cause your symptom

Cleaning uses fillna(0) without first converting " " to NaN ‚Üí blanks remain strings.

Converted to numeric with errors="ignore" (or not at all) ‚Üí strings persist.

Read CSV with skipinitialspace=True and assume it fixes everything ‚Äî it doesn‚Äôt remove entirely space-only fields.

If you paste this clean_telco_data in place, those 11 rows (tenure==0) will have TotalCharges=0.0, TotalCharges will be float, and no blanks remain.
```
># SOLUTION # 7B: 
Short answer: your cleaning function isn‚Äôt catching the **11 rows where `TotalCharges` is a blank string on rows with `tenure == 0`**. Common causes:

1. `TotalCharges` is read as **object/string** and the blanks are `" "` (spaces), not true NaN.
2. You convert to numeric **without stripping**, so `" "` stays non-NaN (or raises).
3. You fill NaNs but **don‚Äôt gate by `tenure == 0`**.

Here‚Äôs a robust, drop-in fix you can add inside `clean_telco_data` (or run once to verify):
```python
import numpy as np
import pandas as pd

def fix_total_charges_blank_for_zero_tenure(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Ensure tenure is numeric
    df['tenure'] = pd.to_numeric(df['tenure'], errors='coerce')

    # Normalize TotalCharges: strip spaces, set empty -> NaN, then to numeric
    tc = (df['TotalCharges']
            .astype('string')          # safe string dtype
            .str.strip()               # remove whitespace
            .replace('', np.nan))      # blank -> NaN
    df['TotalCharges'] = pd.to_numeric(tc, errors='coerce')

    # Only set to 0 when tenure == 0 and TotalCharges is NaN (the 11 special rows)
    mask = (df['tenure'].fillna(0) == 0) & (df['TotalCharges'].isna())
    df.loc[mask, 'TotalCharges'] = 0.0

    # Optional: report to confirm we hit 11 rows
    print(f"Filled TotalCharges=0 for {mask.sum()} rows where tenure==0 and TotalCharges blank.")

    return df
```

If you prefer to do it at **load** time, make your loader more NA-aware:

```python
pd.read_csv(
    path,
    dtype={'TotalCharges': 'string'},      # keep as string first
    na_values=['', ' ', 'NA', 'NaN'],      # treat blanks as NaN
    keep_default_na=True,
    skipinitialspace=True
)
```

Then call the fix:

```python
df_raw = loader.load_data(config['data']['raw_path'])[0]
df_clean = clean_telco_data(df_raw)
df_clean = fix_total_charges_blank_for_zero_tenure(df_clean)
```

Finally, assert the dataset is in the expected state:

```python
# Expect exactly 11 rows corrected
assert ((df_clean['tenure'] == 0) & (df_clean['TotalCharges'] == 0)).sum() >= 11

# Guard against any remaining NaNs
assert df_clean['TotalCharges'].notna().all()
```

If this still doesn‚Äôt print ‚Äú11,‚Äù check for **column name mismatches** (e.g., `total_charges` vs `TotalCharges`) or upstream transformations that renamed/typed the columns.

># WHERE DO I PUT IT?

Good clarifier üëç

You **have two good options** for where this fix can live:

---

## üîπ Option 1: Inside `loader.py`

If you want every load of the Telco dataset to *always* normalize `TotalCharges`, you can fold it into your `DataLoader`.
Example (`loader.py`):

```python
import pandas as pd
import numpy as np
from pathlib import Path

class DataLoader:
    def __init__(self, config: dict):
        self.config = config

    def load_data(self, path: str):
        p = Path(path)
        if not p.exists():
            raise FileNotFoundError(p)

        df = pd.read_csv(
            p,
            dtype={'TotalCharges': 'string'},
            na_values=['', ' ', 'NA', 'NaN'],
            keep_default_na=True,
            skipinitialspace=True
        )

        # Fix TotalCharges blanks for tenure==0
        df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
        mask = (df['tenure'] == 0) & (df['TotalCharges'].isna())
        df.loc[mask, 'TotalCharges'] = 0.0

        report = {
            "rows": len(df),
            "cols": df.shape[1],
            "nulls": df.isna().sum().to_dict(),
            "dtypes": df.dtypes.to_dict()
        }
        return df, report
```

Benefit: every time you use the loader, the fix happens automatically.
Downside: it mixes **raw ingestion** with **domain-specific cleaning**.

---

## üîπ Option 2: Inside `preprocessor.py` (clean\_telco\_data)

If you want your loader to stay ‚Äúpure‚Äù (just bring the raw CSV into memory), keep the fix in `clean_telco_data`.
Example (`preprocessor.py`):

```python
import pandas as pd
import numpy as np

def clean_telco_data(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Normalize TotalCharges
    tc = (df['TotalCharges']
            .astype('string')
            .str.strip()
            .replace('', np.nan))
    df['TotalCharges'] = pd.to_numeric(tc, errors='coerce')

    # Fix blanks where tenure == 0
    mask = (df['tenure'] == 0) & (df['TotalCharges'].isna())
    df.loc[mask, 'TotalCharges'] = 0.0

    # (other cleaning steps here...)
    return df
```

>**Benefit:** clean separation ‚Äî loader does **I/O**, preprocessor does **data fixes**.
>**Downside:** you must remember to always run the preprocessor after loading.

># ‚úÖ Recommendation for your project

Since this `TotalCharges` issue is **domain-specific to Telco**, it belongs in `preprocessor.py` (`clean_telco_data`), not `loader.py`.

That way:

* `loader.py` stays reusable across other datasets.
* `preprocessor.py` documents your Telco-specific cleaning rules in one place.

---
># PROBLEM # 9: AssertionError
```py
# Verify data cleaning of TotalCharges
# Count tenure==0 rows
n_tenure0 = (df_clean['tenure'] == 0).sum()
print("Rows with tenure==0:", n_tenure0)

# How many had TotalCharges fixed to 0?
n_fixed = ((df_clean['tenure'] == 0) & (df_clean['TotalCharges'] == 0)).sum()
print("Rows with tenure==0 AND TotalCharges==0:", n_fixed)

# Sanity: any NaNs left in TotalCharges?
print("Remaining NaNs in TotalCharges:", df_clean['TotalCharges'].isna().sum())

# Optional hard assertions (will raise if unexpected)
assert n_fixed == 11, f"Expected 11 fixed rows, found {n_fixed}"
assert df_clean['TotalCharges'].notna().all(), "TotalCharges still has NaNs!"

output:
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[20], line 14
     11 print("Remaining NaNs in TotalCharges:", df_clean['TotalCharges'].isna().sum())
     13 # Optional hard assertions (will raise if unexpected)
---> 14 assert n_fixed == 11, f"Expected 11 fixed rows, found {n_fixed}"
     15 assert df_clean['TotalCharges'].notna().all(), "TotalCharges still has NaNs!"

AssertionError: Expected 11 fixed rows, found 0
```

># SOLUTION # 9:

># PROBLEM # 10:
```py 
import  pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Our modules
from utils.loader import DataLoader
from utils.preprocessor import clean_telco_data
from utils.statistics import (
    test_numerical_vs_churn,
    test_categorical_vs_churn,
    identify_risk_segments
)
```
># SOLUTION # 10:


>### PROBLEM 1: "ModuleNotFoundError: No module named 'src'"
#### input:
```py
# notebooks/analysis.ipynb

from src.data.loader import DataLoader
from src.config import load_config

config = load_config('config/config.yaml')
loader = DataLoader(config)
df = loader.load_telco_data('../resources/data/raw/Telco-Customer-Churn.csv')
```

#### output:
```py
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[5], line 2
      1 # notebooks/analysis.ipynb
----> 2 from src.data.loader import DataLoader
      3 from src.config import load_config
      5 config = load_config('config/config.yaml')

ModuleNotFoundError: No module named 'src'
```

>## FIX: You‚Äôre hitting that error because of how Python resolves imports. Right now your notebook is trying to import:

```python
from notebooks.src.data.loader import DataLoader
```

But your actual file is here:

```
/Users/b/DATA/PROJECTS/Telco/Level_3/src/data/loader.py
```

There is no `notebooks.src` package ‚Äî so Python can‚Äôt find it.

---

### Fix Options

#### 1. Import directly from `src`

If your project structure looks like this:

```
Level_3/
‚îú‚îÄ‚îÄ notebooks/
‚îÇ   ‚îî‚îÄ‚îÄ analysis.ipynb
‚îú‚îÄ‚îÄ src/
‚îÇ   ‚îî‚îÄ‚îÄ data/
‚îÇ       ‚îî‚îÄ‚îÄ loader.py
```

Then in `analysis.ipynb` you should import like:

```python
from src.data.loader import DataLoader
```

But for that to work, your notebook needs to know about the `Level_3` folder as the project root.

---

#### 2. Add project root to `sys.path`

At the top of your notebook:

```python
import sys, os
sys.path.append(os.path.abspath(".."))  # from notebooks/ to project root

from src.data.loader import DataLoader
```

If your notebook is at `Level_3/notebooks/analysis.ipynb`, then `..` takes you back to `Level_3`, so `src/` is visible.

---

#### 3. Use a `.env` or Jupyter kernel setup

A cleaner approach for ongoing work is to configure `PYTHONPATH` so `src` is always discoverable:

```bash
export PYTHONPATH=/Users/b/DATA/PROJECTS/Telco/Level_3
```

Or in VS Code add this to `.vscode/settings.json`:

```json
{
  "python.analysis.extraPaths": ["./src"]
}
```

---

‚úÖ Recommended: Start with option 2 (quick fix), then move to option 3 for a permanent setup.

Do you want me to show you how to make `src` always discoverable automatically in Jupyter (so you don‚Äôt need to hack `sys.path` in every notebook)?

```
Great question. Let‚Äôs unpack exactly where Python is looking when you do:

```python
with open('../config/config.yaml', 'r') as f:
```

---

## üìç How Relative Paths Work in Python

* Relative paths like `../config/config.yaml` are resolved **relative to the process‚Äôs current working directory (CWD)** ‚Äî *not* relative to the Python file‚Äôs location.
* The CWD is usually the directory from which you launched Python or Jupyter.

You can always check what that is:

```python
import os
print(os.getcwd())
```

If `os.getcwd()` prints, say:

```
/Users/b/DATA/PROJECTS/Telco/Level_3/src
```

then Python looks for:

```
/Users/b/DATA/PROJECTS/Telco/Level_3/config/config.yaml
```

(because `../` goes up one level from `src/` to `Level_3/`).

---

## ‚úÖ Common Scenarios

1. **Running from inside `src/`**

   * CWD = `.../Level_3/src`
   * `../config/config.yaml` ‚Üí `.../Level_3/config/config.yaml`

2. **Running from project root (`Level_3/`)**

   * CWD = `.../Level_3/`
   * `../config/config.yaml` ‚Üí `.../config/config.yaml` (probably wrong, too far up).

3. **Running in Jupyter Notebook**

   * CWD is often the folder where the notebook lives (but can be the project root depending on how Jupyter launched).

---

## üõ†Ô∏è How to Make it Robust

If you want the path **relative to the script‚Äôs file**, not the CWD:

```python
from pathlib import Path

# Path of the current script
BASE_DIR = Path(__file__).resolve().parent.parent  # go up from src/ to project root
cfg_path = BASE_DIR / "config" / "config.yaml"

with cfg_path.open('r') as f:
    config = yaml.safe_load(f)
```

This way it always resolves relative to your project‚Äôs folder structure, no matter where you run Python from.

---
```
Q: would it be better to have my config folder and file at Level_3/config/config.yaml or Level_3/src/config/config.yaml
```

```
Short answer: put it at the project root ‚Äî Level_3/config/config.yaml, not under src/.

Why root is better

Separation of concerns: code lives in src/, configuration lives in config/. (12-factor friendly)

Packaging: src/ is often your importable package; keeping config out means you don‚Äôt accidentally ship environment-specific files.

Reuse: notebooks, scripts, and tests outside src/ can all read config/ without fiddly paths.

Overrides: easy to keep config/local.yaml in .gitignore while committing config/base.yaml.

When src/config/ might make sense

You‚Äôre distributing a library that needs default, versioned package_data bundled (and you‚Äôll access it via importlib.resources). For app projects, this is rare.
```

Recommended layout
```
Level_3/
‚îú‚îÄ src/
‚îÇ  ‚îî‚îÄ your_package/...
‚îú‚îÄ config/
‚îÇ  ‚îú‚îÄ base.yaml
‚îÇ  ‚îú‚îÄ dev.yaml
‚îÇ  ‚îú‚îÄ prod.yaml
‚îÇ  ‚îî‚îÄ local.yaml   # .gitignore
‚îú‚îÄ data/
‚îú‚îÄ notebooks/
‚îú‚îÄ tests/
‚îî‚îÄ run_analysis.py
```
Robust loading (works from anywhere)
from pathlib import Path
import yaml

# project root = directory that contains the "config" folder
ROOT = Path(__file__).resolve().parents[1]  # adjust if file is elsewhere
CFG = ROOT / "config" / "base.yaml"

with CFG.open() as f:
    config = yaml.safe_load(f)

Optional niceties

Env var override:

import os
cfg_path = Path(os.getenv("APP_CONFIG", ROOT / "config" / "base.yaml"))


Layered configs (base + env-specific + local):

def load_yaml(p): return yaml.safe_load(open(p)) or {}
cfg = load_yaml(ROOT/"config/base.yaml")
cfg |= load_yaml(ROOT/"config/dev.yaml")
cfg |= load_yaml(ROOT/"config/local.yaml")


If you want, I can drop in a tiny paths.py helper for centralizing ROOT, CONFIG_DIR, DATA_DIR, etc., so every module uses the same paths.
```
You said:
what about my utils folder?
ChatGPT said:
Thought for 4 seconds

Great question. Put your utils inside your package under src/ so they‚Äôre importable as part of your code‚Äînot at the project root.
```
‚úÖ Recommended placement
Level_3/
‚îú‚îÄ src/
‚îÇ  ‚îî‚îÄ telco/                # your package name
‚îÇ     ‚îú‚îÄ __init__.py
‚îÇ     ‚îú‚îÄ utils/             # utils lives here
‚îÇ     ‚îÇ  ‚îú‚îÄ __init__.py
‚îÇ     ‚îÇ  ‚îú‚îÄ io_ops.py       # file I/O, paths
‚îÇ     ‚îÇ  ‚îú‚îÄ validate.py     # schema checks, assertions
‚îÇ     ‚îÇ  ‚îú‚îÄ transform.py    # cleaning & feature funcs
‚îÇ     ‚îÇ  ‚îú‚îÄ viz.py          # plotting helpers
‚îÇ     ‚îÇ  ‚îî‚îÄ model.py        # baseline model helpers (optional)
‚îÇ     ‚îú‚îÄ data/              # domain modules, not raw data files
‚îÇ     ‚îî‚îÄ analysis/
‚îú‚îÄ config/
‚îÇ  ‚îî‚îÄ base.yaml
‚îú‚îÄ notebooks/
‚îú‚îÄ tests/
‚îî‚îÄ run_analysis.py
```

Why:

Importable: from telco.utils.validate import run_checks works everywhere.

Separation: keeps app code in src/ and non-code (configs, data, notebooks) outside.

Refactor-friendly: easier to split utils by domain (I/O, validation, viz) instead of one junk drawer.

üîß Import patterns (clean & stable)

In your scripts/notebooks:
```py
import sys
from pathlib import Path

# ensure src on sys.path for notebooks/scripts run from project root
sys.path.append(str(Path(__file__).resolve().parents[0] / "src"))  # or set in your notebook
from telco.utils.io_ops import load_dataset
from telco.utils.validate import run_checks
```
Inside package modules (prefer absolute imports):
```py
from telco.utils.io_ops import load_dataset
```

Avoid from utils import * and deep relative imports like from ..utils import x.

üß© Make utils discoverable (optional)

Expose a friendly top-level API in telco/utils/__init__.py:
```py
from .io_ops import load_dataset, save_table
from .validate import run_checks
from .transform import clean_and_feature_engineer

__all__ = ["load_dataset", "save_table", "run_checks", "clean_and_feature_engineer"]
```
Now you can do:
```py
from telco import utils
df = utils.load_dataset(path)
```
üóÇÔ∏è What goes in utils vs domain modules?

>utils: generic, reusable helpers (I/O wrappers, logging setup, plotting wrappers, schema checks)

>domain modules (telco/data, telco/analysis): business-specific logic (Telco cleaning rules, churn-specific features)

>If a function mentions Telco-specific columns or business rules, it belongs in a domain module, not utils.

üß™ Testing layout
```
tests/
‚îú‚îÄ test_utils_io_ops.py
‚îú‚îÄ test_utils_validate.py
‚îî‚îÄ test_transform_telco.py
```
üìù Small starter for io_ops.py
```py
from pathlib import Path
import pandas as pd

def load_dataset(path) -> pd.DataFrame:
    p = Path(path)
    if not p.exists():
        raise FileNotFoundError(p)
    if p.suffix == ".csv":
        return pd.read_csv(p)
    if p.suffix == ".parquet":
        return pd.read_parquet(p)
    raise ValueError(f"Unsupported file type: {p.suffix}")

def ensure_dir(path: str | Path) -> Path:
    p = Path(path)
    p.mkdir(parents=True, exist_ok=True)
    return p
```
üß≠ One place for common paths (nice-to-have)

Create telco/paths.py:
```py
from pathlib import Path

ROOT = Path(__file__).resolve().parents[2]   # up to Level_3/
CONFIG = ROOT / "config"
DATA = ROOT / "data"
ARTIFACTS = ROOT / "artifacts"
```
Then use:
```py
from telco.paths import CONFIG
```
>Bottom line: src/<your_package>/utils/ is the sweet spot. Keep it modular (io/validate/transform/viz), keep domain rules outside utils, and expose a clean import surface.


># PROBLEM # 3:
```
Should I change my working directory? 

this is the current: /Users/b/DATA/PROJECTS/Telco/Level_3/notebooks

ChatGPT said: That‚Äôs a smart thing to consider üëå
```
```
Right now your working directory (CWD) is:

/Users/b/DATA/PROJECTS/Telco/Level_3/notebooks


That means any relative path (like ../config/config.yaml) is being resolved from inside notebooks/, not from your project root. That‚Äôs why you‚Äôve been juggling ../ and hitting issues.
```
```
üéØ Best Practices for CWD in a Project

Project root as CWD

If you treat Level_3/ as your project root, running everything from there makes paths predictable:

Level_3/
‚îú‚îÄ src/
‚îú‚îÄ config/
‚îú‚îÄ data/
‚îú‚îÄ notebooks/
‚îî‚îÄ run_analysis.py


Then your code can load config simply as config/config.yaml without ../.

Don‚Äôt rely on ‚Äúwhatever CWD is‚Äù

Instead, compute paths relative to your package or project root:
```
```py
from pathlib import Path
ROOT = Path(__file__).resolve().parents[1]  # points to Level_3/
CONFIG = ROOT / "config" / "config.yaml"
```
That way it doesn‚Äôt matter if you launch code from notebooks/, src/, or even VS Code.

For Jupyter Notebooks

They often start in the notebooks/ folder, which is convenient for saving .ipynb files but awkward for relative imports.

Two fixes:
```py
1. Add project root to sys.path:

import sys
from pathlib import Path
sys.path.append(str(Path.cwd().parents[0] / "src"))
```

**OR**

```py 
2. launch Jupyter 

with project root as CWD:

cd /Users/b/DATA/PROJECTS/Telco/Level_3
jupyter notebook notebooks/
```

```
‚úÖ My Recommendation for You
- Keep your CWD as the project root (Level_3/) when running scripts or notebooks.
- Put configs in Level_3/config/ and data in Level_3/data/.
- Use path utilities (like a paths.py) so your code doesn‚Äôt depend on where it‚Äôs run.
```
---
> # PROBLEM # 4:
```
from pathlib import Path
ROOT = Path(__file__).resolve().parents[1]  # points to Level_3/
CONFIG = ROOT / "config" / "config.yaml"

output: 
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[7], line 2
      1 from pathlib import Path
----> 2 ROOT = Path(__file__).resolve().parents[1]  # points to Level_3/
      3 CONFIG = ROOT / "config" / "config.yaml"

NameError: name '__file__' is not defined
```
> # SOLUTION # 4:
```
that error is happening because you‚Äôre running this inside a Jupyter notebook (or IPython).

üëâ In a notebook there is no __file__, because __file__ only exists for actual Python files.
```
‚úÖ How to fix
```py
1. Use Path.cwd() in notebooks

If you launch Jupyter from your project root (Level_3/):

from pathlib import Path

ROOT = Path.cwd()              # current working directory
CONFIG = ROOT / "config" / "config.yaml"
print(CONFIG.resolve())
```
```py
2. If your notebook lives in notebooks/ but project root is one level up
from pathlib import Path

ROOT = Path.cwd().parents[0]   # go up from notebooks/ ‚Üí Level_3/
CONFIG = ROOT / "config" / "config.yaml"
print(CONFIG.resolve())
```
```py
3. Portable solution: detect notebook vs script
from pathlib import Path

try:
    # Works in scripts
    ROOT = Path(__file__).resolve().parents[1]
except NameError:
    # Works in notebooks
    ROOT = Path.cwd().parents[0]  # assumes you‚Äôre in notebooks/
    
CONFIG = ROOT / "config" / "config.yaml"
print(CONFIG.resolve())
```
```
üîë Best Practice

Always treat Level_3/ as your project root.

In scripts ‚Üí use __file__.

In notebooks ‚Üí use Path.cwd().parents[0].

If you want to make this bulletproof, put it in a paths.py module under src/ so you import it consistently instead of repeating logic.
```

># PROBLEM 5
```
How do I set my cwd to /Users/b/DATA/PROJECTS/Telco/Level_3
ChatGPT said:

You‚Äôve got a couple of clean ways to do this depending on where you‚Äôre working.
```
```
üñ•Ô∏è In Bash / Zsh before running Python or Jupyter
cd /Users/b/DATA/PROJECTS/Telco/Level_3
pwd   # confirms your cwd is set


Anything you start after that (Python REPL, Jupyter, VS Code debug session) will inherit this as its current working directory.
```
üêç Inside a Python script or REPL
```py
import os

os.chdir("/Users/b/DATA/PROJECTS/Telco/Level_3")
print(os.getcwd())   # confirm
```

üìì Inside a Jupyter notebook

Two easy options:

Option A ‚Äî Python
```py
import os
os.chdir("/Users/b/DATA/PROJECTS/Telco/Level_3")
print(os.getcwd())
```

Option B ‚Äî Jupyter magic
```py
%cd /Users/b/DATA/PROJECTS/Telco/Level_3
```
‚ö° Best Practice:
- If Level_3/ is your project root, it‚Äôs often best to always launch Jupyter or scripts from there. That way all relative paths like config/config.yaml or data/ ‚Äújust work.‚Äù
- If you want Jupyter to always start in Level_3/ by default, you can configure it in Jupyter‚Äôs settings (c.NotebookApp.notebook_dir in jupyter_notebook_config.py).

># PROBLEM 6:
```
Thanks for sharing your actual config.yaml üôå ‚Äî now it‚Äôs clear why you‚Äôre getting a NoneType error.

Your file does not have a project section.
But your Python code was trying to do:

print(f"Project: {config['project']['name']}")

Since project doesn‚Äôt exist in the YAML, config['project'] raises a KeyError. But because you earlier hit the ‚Äúempty/None‚Äù state, my guess is you accidentally saved the file differently once. Now that you have a valid YAML mapping, the current problem is just a wrong expectation about keys.

‚úÖ Two ways to fix
Option 1: Update your Python code to match your config
print("Configuration loaded!")
print(f"Raw data: {config['data']['raw_path']}")
print(f"Processed data: {config['data']['processed_path']}")
print(f"Optimize memory? {config['options']['optimize_memory']}")

Option 2: Add a project section to your YAML

If you want to keep the Python code as-is:

project:
  name: Telco Churn

data:
  raw_path: "data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
  processed_path: "data/processed/telco_clean.csv"

üîë Best Practice

Decide on a schema for your config file (e.g., always have project, data, options, etc.).

Keep your YAML + Python expectations in sync ‚Äî otherwise you‚Äôll keep hitting KeyError.

üëâ Do you want me to draft a recommended config.yaml schema for your Telco project (with project, data, preprocessing, analysis, visualization) so your code and config stay aligned as you grow this into Levels 4‚Äì6?
```

># PROBLEM 7:
```
# Import what we need
from pathlib import Path
import yaml

ROOT = Path.cwd()  # /Users/b/DATA/PROJECTS/Telco/Level_3
CFG = ROOT / "config" / "config.yaml"

with CFG.open('r') as f:
    config = yaml.safe_load(f)

print("Configuration loaded!")
print(f"Project: {config['project']['name']}")
print(f"Data path: {config['data']['filepath']}")

output:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[18], line 18
     16 print("Configuration loaded!")
     17 print(f"Project: {config['project']['name']}")
---> 18 print(f"Data path: {config['data']['filepath']}")

KeyError: 'filepath'
```
```
Now it‚Äôs clear üéØ

Your YAML file (from the message you shared) looks like this:

data:
  raw_path: "data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
  processed_path: "data/processed/telco_clean.csv"


But your Python is expecting:

config['data']['filepath']


There‚Äôs no filepath key in your YAML ‚Üí hence the KeyError.

‚õ≥Ô∏è Two ways to fix 

‚úÖ Option 1: Update your Python to match YAML
print("Configuration loaded!")
print(f"Raw data: {config['data']['raw_path']}")
print(f"Processed data: {config['data']['processed_path']}")

‚ùå Option 2: Update your YAML to match Python

If you **REALLY** want a single filepath (instead of raw/processed), change your YAML:

data:
  filepath: "data/WA_Fn-UseC_-Telco-Customer-Churn.csv"

üîë Best Practice

Be consistent: either define raw_path + processed_path in YAML and use them in Python, or stick to a simple single filepath.

For a growing project, I‚Äôd recommend keeping both paths (raw_path, processed_path) ‚Äî more flexible as you add cleaning steps.
```