# Logistic Regression Classifier (From Scratch)
### New dataset: IBM Telco Customer Churn

This notebook is written in a *workshop style*: every code cell is preceded by a markdown cell that explains **what** and **why**.

**Goal:** Train a Logistic Regression model to predict whether a customer will churn (leave) and explain results with charts.

**You will learn:** data loading → cleaning → preprocessing → training → thresholding → evaluation → interpretation.


## 0) Setup (install packages)
Run this only once in your environment (terminal or notebook). If you're already using your existing venv from your other labs, you can skip.

The workshop repo you referenced also uses a requirements file approach; we do the same.


In [None]:
!python -m pip install -r requirements.txt

## 1) Import libraries
We import standard data + ML libraries, and metrics for evaluation.


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, classification_report, roc_auc_score, roc_curve,
    precision_recall_curve, average_precision_score
)

import json
from pathlib import Path


## 2) Download the dataset (IBM Telco Customer Churn)
We download a CSV directly from IBM's public GitHub (easy + reproducible). If you're offline, download manually and place it in `data/raw/`.


In [None]:

DATA_DIR = Path("data/raw")
DATA_DIR.mkdir(parents=True, exist_ok=True)

url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
csv_path = DATA_DIR / "telco_churn.csv"

if not csv_path.exists():
    df = pd.read_csv(url)
    df.to_csv(csv_path, index=False)
else:
    df = pd.read_csv(csv_path)

df.head()


## 3) Quick data check
We confirm columns, types, missing values, and how the target looks.


In [None]:

print(df.shape)
df.info()
print(df["Churn"].value_counts(dropna=False))


## 4) Clean + define target
In this dataset, `Churn` is 'Yes'/'No'. Logistic Regression needs numeric labels.

- We map: Yes→1, No→0
- We also clean `TotalCharges` because it often comes as text with blanks.


In [None]:

df = df.copy()

# Target
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

# TotalCharges cleanup (common issue: blank strings)
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

# Optional: drop customerID (it's an identifier, not a predictive feature)
if "customerID" in df.columns:
    df = df.drop(columns=["customerID"])

df.isna().sum().sort_values(ascending=False).head(10)


## 5) Split into features (X) and target (y)
This matches the workshop pattern:
- X = inputs/features
- y = label/target


In [None]:

target_col = "Churn"
X = df.drop(columns=[target_col])
y = df[target_col]

X.head()


## 6) Train/Test Split
We keep a test set to evaluate on unseen data.

**Stratify** keeps the churn ratio similar in train/test.


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train:", X_train.shape, "Test:", X_test.shape)
print("Churn rate train:", y_train.mean().round(3), "test:", y_test.mean().round(3))


## 7) Build preprocessing (numeric + categorical)
This is the main upgrade vs the simple workshop example (hours studied → pass/fail).

- Numeric columns: impute median + scale
- Categorical columns: impute most frequent + one-hot encode

We keep everything inside a **Pipeline** so it’s reproducible and avoids data leakage.


In [None]:

num_cols = X_train.select_dtypes(include=["number"]).columns.tolist()
cat_cols = [c for c in X_train.columns if c not in num_cols]

numeric_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipe = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, num_cols),
        ("cat", categorical_pipe, cat_cols),
    ],
    remainder="drop"
)

print("Numeric cols:", len(num_cols), "Categorical cols:", len(cat_cols))


## 8) Train Logistic Regression model
Logistic Regression outputs **probabilities** using the sigmoid function.

We set `max_iter` high to avoid convergence warnings.


In [None]:

model = LogisticRegression(max_iter=2000)

clf = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", model)
])

clf.fit(X_train, y_train)
print("Model trained ✅")


## 9) Predict probabilities and labels
The workshop explains classification like:
- if p(X) >= 0.5 → Class 1
- else → Class 0

We'll start with 0.5, then we will tune it.


In [None]:

y_proba = clf.predict_proba(X_test)[:, 1]

threshold = 0.5
y_pred = (y_proba >= threshold).astype(int)

y_proba[:10], y_pred[:10]


## 10) Evaluate: Confusion matrix + classification report
These metrics translate directly to real-world outcomes:
- False positives: customers we think will churn but won't
- False negatives: customers who will churn but we miss


In [None]:

cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:
", cm)
print("
Classification Report:
")
print(classification_report(y_test, y_pred, digits=3))


## 11) ROC Curve + AUC
ROC AUC measures how well the model ranks churners above non-churners across all thresholds.


In [None]:

auc = roc_auc_score(y_test, y_proba)
print("ROC AUC:", round(auc, 4))

fpr, tpr, _ = roc_curve(y_test, y_proba)

plt.figure()
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.show()


## 12) Precision–Recall curve (often better for churn)
If the positive class is rare, PR curves can be more informative.


In [None]:

ap = average_precision_score(y_test, y_proba)
print("Average Precision:", round(ap, 4))

precision, recall, _ = precision_recall_curve(y_test, y_proba)

plt.figure()
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.show()


## 13) Choose a better threshold (business decision)
Just like your predictive maintenance project used thresholds + persistence to control noise,
classification uses a probability threshold to control false alarms vs missed churners.

Here we pick the threshold that gives at least a target recall (e.g., 0.75) if you want to catch churners.


In [None]:

target_recall = 0.75

# Find smallest threshold that achieves recall >= target
# (simple sweep over sorted probabilities)
candidates = np.unique(np.round(y_proba, 4))
best_thr = 0.5

from sklearn.metrics import recall_score

for thr in candidates:
    pred = (y_proba >= thr).astype(int)
    r = recall_score(y_test, pred)
    if r >= target_recall:
        best_thr = thr
        break

print("Chosen threshold for recall >= ", target_recall, "is:", best_thr)

y_pred_tuned = (y_proba >= best_thr).astype(int)
print("
Tuned confusion matrix:
", confusion_matrix(y_test, y_pred_tuned))
print("
Tuned report:
", classification_report(y_test, y_pred_tuned, digits=3))


## 14) Interpretability: which features push churn up/down?
Logistic regression is interpretable: each feature has a coefficient.

We extract feature names from the pipeline and show the largest positive/negative coefficients.


In [None]:

# Get feature names after preprocessing
ohe = clf.named_steps["preprocess"].named_transformers_["cat"].named_steps["onehot"]
cat_feature_names = ohe.get_feature_names_out(cat_cols)

feature_names = np.concatenate([np.array(num_cols), cat_feature_names])

coefs = clf.named_steps["model"].coef_[0]
coef_df = pd.DataFrame({"feature": feature_names, "coef": coefs}).sort_values("coef")

print("Top features decreasing churn probability:")
display(coef_df.head(12))

print("Top features increasing churn probability:")
display(coef_df.tail(12))


## 15) Save outputs (metrics + charts)
This matches the MLOps discipline from your other project: keep artifacts for grading/auditing.


In [None]:

OUT_DIR = Path("outputs")
OUT_DIR.mkdir(exist_ok=True)

metrics = {
    "roc_auc": float(auc),
    "average_precision": float(ap),
    "default_threshold": float(threshold),
    "tuned_threshold": float(best_thr),
}

(Path("outputs") / "metrics.json").write_text(json.dumps(metrics, indent=2))
print("Saved outputs/metrics.json")

# Save charts
plt.figure()
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.savefig(OUT_DIR / "roc_curve.png", dpi=160, bbox_inches="tight")
plt.close()

plt.figure()
plt.plot(recall, precision)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.savefig(OUT_DIR / "pr_curve.png", dpi=160, bbox_inches="tight")
plt.close()

print("Saved outputs/roc_curve.png and outputs/pr_curve.png")


## 16) 3 Real‑World Talking Points (use these in your presentation)
1. **Prediction → Action:** We turn a churn probability into a decision (who to contact, what offer to send). The threshold controls cost vs savings.
2. **Charts → Trust:** Confusion matrix shows the *types* of mistakes; ROC/PR curves show trade‑offs. Stakeholders can choose a threshold aligned with business risk.
3. **Process → Reliability:** A reproducible pipeline (split, preprocessing, model, saved metrics) prevents accidental leakage and makes results auditable and repeatable.


## 17) Comparison to the workshop notebook
The workshop uses a simple 1‑feature example (e.g., study hours → pass/fail) to teach the sigmoid + 0.5 cutoff concept.
We keep the same logic but scale it to a real dataset:

- **Workshop:** 1 feature, easy plot, focus on theory and the 0.5 rule.
- **This notebook:** many features (numeric + categorical), full preprocessing pipeline, ROC/PR charts, threshold tuning, and model interpretation.

Key link to workshop idea: classification rule `p(X) >= 0.5` (or tuned threshold) → class 1.