
# Week 5 – Classification Models (Colab Format)

This week you'll build a **classification model** with scikit-learn. You'll learn how to:
- Load a labeled dataset
- Train/test split
- Train a simple classifier
- Evaluate with **accuracy, precision, recall, F1**, confusion matrix, and **ROC AUC**
- Explain results in plain English for stakeholders

**No heavy math** — just practical steps.

**How to use in Google Colab**
1. Download this notebook.
2. Open https://colab.research.google.com
3. File → Upload notebook → select this file.
4. Run cells top to bottom (Shift + Enter).

---

## 📚 Free Learning Resources
- Kaggle: [Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)
- scikit-learn: [Classification](https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html)


## 0) Setup

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, ConfusionMatrixDisplay,
    roc_auc_score, RocCurveDisplay
)

np.__version__, pd.__version__



## 1) Load Dataset

We'll use scikit-learn's built-in **Breast Cancer** dataset (binary classification). The goal is to predict whether a tumor is malignant or benign.


In [None]:

data = load_breast_cancer(as_frame=True)
df = data.frame
df.head()



## 2) Features & Target


In [None]:

X = df.drop(columns=["target"])
y = df["target"]  # 0 = malignant, 1 = benign
X.shape, y.value_counts()



## 3) Train/Test Split


In [None]:

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train.shape, X_test.shape



## 4) Baseline Model: Logistic Regression

We'll standardize features (common with linear models), then fit the classifier.


In [None]:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

log_reg = LogisticRegression(max_iter=500, random_state=42)
log_reg.fit(X_train_scaled, y_train)

y_pred_lr = log_reg.predict(X_test_scaled)
y_proba_lr = log_reg.predict_proba(X_test_scaled)[:, 1]  # needed for ROC AUC



## 5) Evaluate: Accuracy, Precision, Recall, F1


In [None]:

def eval_classification(y_true, y_pred, y_proba=None, positive_label=1):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, pos_label=positive_label)
    rec = recall_score(y_true, y_pred, pos_label=positive_label)
    f1 = f1_score(y_true, y_pred, pos_label=positive_label)
    print(f"Accuracy:  {acc:.3f}")
    print(f"Precision: {prec:.3f}")
    print(f"Recall:    {rec:.3f}")
    print(f"F1 score:  {f1:.3f}")
    if y_proba is not None:
        auc = roc_auc_score(y_true, y_proba)
        print(f"ROC AUC:   {auc:.3f}")

eval_classification(y_test, y_pred_lr, y_proba_lr)



## 6) Confusion Matrix (Visual)


In [None]:

cm = confusion_matrix(y_test, y_pred_lr)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data.target_names)
disp.plot(values_format='d')
plt.title("Logistic Regression – Confusion Matrix")
plt.show()



## 7) ROC Curve (Visual)


In [None]:

RocCurveDisplay.from_predictions(y_test, y_proba_lr)
plt.title("Logistic Regression – ROC Curve")
plt.show()



## 8) Try a Different Model: Random Forest

Tree-based models often perform well without scaling.


In [None]:

rf = RandomForestClassifier(n_estimators=300, random_state=42)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:, 1]

print("Random Forest metrics:")
eval_classification(y_test, y_pred_rf, y_proba_rf)

# Confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
ConfusionMatrixDisplay(confusion_matrix=cm_rf, display_labels=data.target_names).plot(values_format='d')
plt.title("Random Forest – Confusion Matrix")
plt.show()

# ROC curve
RocCurveDisplay.from_predictions(y_test, y_proba_rf)
plt.title("Random Forest – ROC Curve")
plt.show()



## 9) Plain-English Interpretation (Client-Friendly)

Use the printed metrics and plots to explain:
- What does **precision** vs **recall** mean here?
- Which model would you choose and why?
- What trade-offs matter for stakeholders (e.g., missing a malignant case vs false alarms)?

*(Double-click this cell in Colab to write your notes.)*



## 10) Optional: Threshold Tuning

Try changing the decision threshold from 0.5 to another value and see how precision/recall trade off.


In [None]:

# Example: custom threshold for logistic regression
threshold = 0.4  # try 0.3, 0.6, etc.
y_pred_thresh = (y_proba_lr >= threshold).astype(int)
print(f"Using threshold={threshold}")
eval_classification(y_test, y_pred_thresh, y_proba_lr)



## ✅ Week 5 Deliverables
- Train **two classifiers** (Logistic Regression + Random Forest)
- Report Accuracy, Precision, Recall, F1, ROC AUC
- Include a Confusion Matrix and ROC Curve
- Short business explanation of the trade-offs
- (Bonus) Show how a different threshold changes metrics

**Next (Week 6):** Framing AI projects for business (use-case mapping & ROI thinking).



---

### 📤 Save Your Work to GitHub
1) File → Download → Download `.ipynb`  
2) In GitHub Desktop, **Show in Explorer** → copy the file into your `ai-journey` repo  
3) Commit: `Add Week 5 Colab notebook` → **Push origin**  
4) Add a new section in `README.md` with an **Open in Colab** badge pointing to:  
   `https://colab.research.google.com/github/YOUR_USERNAME/ai-journey/blob/main/Week_5_Classification_Models_Colab.ipynb`
