# Guided ML Walkthrough (Codespaces-ready)

> Created/updated: 2025-09-16 07:56 

This notebook walks you through **supervised regression**, **supervised classification**, **unsupervised clustering**, **model validation**, and **saving models** using scikit-learn. It’s designed for GitHub Codespaces and uses only widely available libraries.

**How to use this notebook**
1. Run the setup in the next section if this is a fresh Codespace.
2. Execute cells from top to bottom. **Read the explanations** above each code block—they tell you exactly what's happening and why.
3. Inspect printed outputs, plots and tables; answer the checkpoint questions to test your understanding.


## 0) One-time setup (Codespaces)
**What & Why:** We create and activate a Python virtual environment so our dependencies don’t clash with system Python. Then we install the exact libraries used in this notebook. Restarting the kernel lets Jupyter see the new packages.

Open a terminal in your Codespace and run:
```bash
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txt
```
Then **restart the kernel** (Kernel → Restart) so the packages are available to this notebook.

## 1) Warm-up: verify the tools
**What this does:** Imports the core libraries and prints their versions to confirm that everything installed correctly. The `%matplotlib inline` directive ensures that plots render inside the notebook.


In [None]:
# Import core libs used throughout the notebook.
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, model_selection, metrics, preprocessing

# Print versions so you can record your environment (useful for debugging/reproducibility).
print("NumPy", np.__version__)
print("Pandas", pd.__version__)
print("Matplotlib", plt.matplotlib.__version__)
print("scikit-learn imported OK")

# Make plots appear inline in the notebook UI
%matplotlib inline

## 2) Regression — predict a health score (Diabetes dataset)
**Goal:** Predict a **continuous** target (disease progression) from 10 numeric features.

**Why this approach?**
- The dataset is built into scikit-learn → no downloads.
- Linear Regression is the most basic regression model → great for introducing concepts like **train/test split**, **fit**, **predict**, and **evaluation** with MAE and R².

**What the code below does (step-by-step):**
1. **Load** the dataset as a Pandas DataFrame (`as_frame=True`).
2. **Split** into training and testing sets to simulate unseen data (20% held out). We fix `random_state` for reproducibility.
3. **Fit** a `LinearRegression()` model on the training set (learns coefficients using least squares).
4. **Predict** on the test set and compute **MAE** (average absolute error) and **R²** (explained variance).
5. **Plot** the relationship between actual vs predicted to visually inspect systematic errors or scatter.
6. **Inspect coefficients** to get a sense of which features push predictions up/down (interpretation caution: coefficients assume features are on comparable scales).


In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
import pandas as pd
import matplotlib.pyplot as plt

# 1) Load data as (X, y). X has 10 features; y is a continuous target.
data = load_diabetes(as_frame=True)
X = data.data          # features (DataFrame)
y = data.target        # target (Series)

# 2) Split: we train on 80% and evaluate on 20% to measure generalisation performance.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

# 3) Define and fit the model: finds coefficients that minimise squared errors on training data.
linreg = LinearRegression()
linreg.fit(X_train, y_train)

# 4) Evaluate on held-out test set (data never seen during training).
preds = linreg.predict(X_test)
mae = mean_absolute_error(y_test, preds)        # average absolute difference
r2  = r2_score(y_test, preds)                   # proportion of variance explained (1.0 is perfect)
print(f"MAE: {mae:.2f}, R²: {r2:.3f}")

# 5) Visual diagnostic: perfect predictions would lie exactly on the diagonal line.
plt.figure()
plt.scatter(y_test, preds)
plt.xlabel("Actual target (y_test)")
plt.ylabel("Predicted target (ŷ)")
plt.title("Linear Regression: Actual vs Predicted")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()])
plt.show()

# 6) Coefficient table: larger magnitude = stronger linear effect on predictions (sign indicates direction).
coef_table = pd.DataFrame({
    "feature": X.columns,
    "coefficient": linreg.coef_
}).sort_values("coefficient", ascending=False)
coef_table

**Checkpoint (think & write):**
- What do **MAE** and **R²** each reveal about model performance?
- Which features have the strongest positive/negative coefficients? Does that make intuitive sense?
- If features are on different scales, how might scaling change coefficient comparisons?


## 3) Classification — Iris dataset (KNN vs Logistic Regression)
**Goal:** Predict a **categorical** target: one of three iris species.

**Why two models?**
- **KNN** is a simple, distance-based method; scaling helps because distances are measured feature-by-feature.
- **Logistic Regression** learns a linear decision boundary in feature space and outputs class probabilities.

**What the code below does:**
1. **Load** the iris dataset (150 rows, 4 features, 3 classes).
2. **Stratified split** so class ratios are similar in train/test.
3. Build **pipelines** that first **scale** features, then apply the model. Pipelines help prevent **data leakage** and keep preprocessing + model together.
4. **Fit** both models and **compare** accuracies.
5. Print a **classification report** (precision/recall/F1 for each class) and a **confusion matrix** for error analysis.


In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt

# 1) Load data
iris = load_iris(as_frame=True)
X = iris.data
y = iris.target

# 2) Stratified split preserves class proportions in train/test.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, test_size=0.2, random_state=0
)

# 3) KNN pipeline: scaling is critical for distance-based KNN.
knn = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("model", KNeighborsClassifier(n_neighbors=5))
])
knn.fit(X_train, y_train)
knn_preds = knn.predict(X_test)

# 4) Logistic Regression pipeline: scaling generally helps convergence/stability.
logreg = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=200))
])
logreg.fit(X_train, y_train)
log_preds = logreg.predict(X_test)

# 5) Compare accuracies and inspect detailed metrics.
print("KNN accuracy:", accuracy_score(y_test, knn_preds))
print("LogReg accuracy:", accuracy_score(y_test, log_preds))

print("\nLogReg classification report (per-class precision/recall/F1):")
print(classification_report(y_test, log_preds, target_names=iris.target_names))

# Confusion matrix: rows = actual, cols = predicted
cm = confusion_matrix(y_test, log_preds)
fig, ax = plt.subplots()
im = ax.imshow(cm)
ax.set_xticks(range(3)); ax.set_yticks(range(3))
ax.set_xticklabels(iris.target_names); ax.set_yticklabels(iris.target_names)
plt.xlabel("Predicted"); plt.ylabel("Actual"); plt.title("Confusion Matrix (LogReg)")
for i in range(3):
    for j in range(3):
        ax.text(j, i, cm[i, j], ha="center", va="center")
plt.colorbar(im)
plt.show()

**Checkpoint:**
- Which model scored higher **accuracy** on your split? Why might that be?
- Using the confusion matrix, which classes get confused? What features might separate them better?
- How would performance change if we **didn’t** scale features for KNN?


## 4) Unsupervised — K-Means clustering on Iris
**Goal:** Discover structure in data **without labels** and then compare clusters to the true species (for learning purposes only—the model has no access to labels).

**Key ideas:**
- K-Means tries to partition points into `k` clusters by minimising within-cluster variance.
- Results can depend on initialisation; we use `n_init=10` and a fixed random seed for reproducibility.
- Clusters won’t necessarily match the true classes—unsupervised learning optimises a different objective.

**What the code does:**
1. Fit K-Means with `k=3` (since Iris has 3 species) and get cluster assignments.
2. Use **PCA** to reduce the 4D features to 2D for visualisation only (PCA is unsupervised dimensionality reduction).
3. Cross-tab cluster IDs vs true labels to see alignment.


In [None]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt

# 1) Fit KMeans with 3 clusters (we know iris has 3 species).
kmeans = KMeans(n_clusters=3, n_init=10, random_state=0)
clusters = kmeans.fit_predict(X)

# 2) Visualise in 2D using PCA (for display only; the model is trained on the original 4D space).
pca = PCA(n_components=2, random_state=0)
X_2d = pca.fit_transform(X)

plt.figure()
plt.scatter(X_2d[:,0], X_2d[:,1], c=clusters)
plt.title("K-Means clusters on Iris (PCA 2D)")
plt.xlabel("PC1"); plt.ylabel("PC2")
plt.show()

# 3) Compare clusters with true labels (for evaluation/learning only).
mapping_table = pd.crosstab(pd.Series(clusters, name="cluster"),
                             pd.Series(y, name="true_label"))
mapping_table.index.name = "kmeans_cluster"
mapping_table.columns = iris.target_names
mapping_table

**Checkpoint:**
- Do the clusters align with the species reasonably well? Which species is hardest to recover via K-Means?
- Try different `n_clusters` (e.g., 2 or 4). How does the cross-tab change? Why?


## 5) Model validation — choose K via cross-validation
**Goal:** Use **K-fold cross-validation (CV)** on the **training set only** to select the best hyperparameter `k` for KNN **without peeking at the test set**. This helps prevent optimistic bias.

**What the code does:**
1. For `k` from 1 to 20, build a pipeline (**scaler + KNN**) and compute **5-fold CV accuracy** on the training data.
2. Choose the `k` with the highest mean CV accuracy.
3. Refit on the full training set using the best `k` and evaluate once on the untouched test set.


In [None]:
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

ks = range(1, 21)
cv_scores = []
for k in ks:
    # IMPORTANT: The scaler is inside the CV loop via Pipeline → avoids data leakage.
    pipe = Pipeline(steps=[
        ("scaler", StandardScaler()),
        ("model", KNeighborsClassifier(n_neighbors=k))
    ])
    # 5-fold CV on the training set only
    scores = cross_val_score(pipe, X_train, y_train, cv=5)
    cv_scores.append(scores.mean())

best_k = ks[int(np.argmax(cv_scores))]
print("Best k by CV:", best_k)

# Visualise the CV curve to see how k affects bias/variance.
plt.figure()
plt.plot(list(ks), cv_scores, marker="o")
plt.xlabel("k (neighbors)")
plt.ylabel("Mean CV accuracy (5-fold, train set)")
plt.title("KNN: choose k via cross-validation")
plt.show()

# Final evaluation on the untouched test set using the selected k.
best_knn = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("model", KNeighborsClassifier(n_neighbors=best_k))
]).fit(X_train, y_train)
print("Test accuracy with best k:", accuracy_score(y_test, best_knn.predict(X_test)))

**Checkpoint:**
- Does the CV-selected `k` improve test accuracy compared to `k=5`? If not, why might that be?
- How would using the **test set** to pick `k` lead to overfitting?


## 6) End-to-end pipeline with SVM
**Goal:** Demonstrate a common production-style workflow: combine preprocessing and model in a single object that can be fit, evaluated, and saved.

**Key parameters:**
- `kernel="rbf"` allows non-linear decision boundaries.
- `C` controls regularisation strength (higher C = fit training data more tightly).
- `gamma="scale"` sets a sensible default for the RBF kernel.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Build a single pipeline that encapsulates preprocessing + model.
svm_clf = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("model", SVC(kernel="rbf", C=1.0, gamma="scale", probability=True))
])

# Fit on training data and evaluate on test data.
svm_clf.fit(X_train, y_train)
print("SVM test accuracy:", accuracy_score(y_test, svm_clf.predict(X_test)))

## 7) Save & load a trained model (deployment starter)
**Why save models?** To reuse a trained pipeline in another script or service (e.g., a CLI tool or a small web app) **without retraining**.

**What the code does:**
1. Save the trained SVM pipeline to a file with `joblib.dump`.
2. Load it back with `joblib.load`.
3. Verify the loaded model produces the same predictions/accuracy.


In [None]:
import joblib
from sklearn.metrics import accuracy_score

# 1) Save to disk
joblib.dump(svm_clf, "iris_svm_pipeline.joblib")
print("Saved to iris_svm_pipeline.joblib")

# 2) Load from disk
loaded = joblib.load("iris_svm_pipeline.joblib")

# 3) Sanity check: accuracy should match
print("Reloaded accuracy:", accuracy_score(y_test, loaded.predict(X_test)))

## 8) Stretch goals (choose any)
- Use `DecisionTreeClassifier` and inspect `feature_importances_` to see which features drive splits.
- Compare models with and without `StandardScaler` to observe the impact on distance-based models.
- For regression, try `Ridge`, `Lasso`, or `RandomForestRegressor` and compare MAE/R².
- Plot pairwise feature scatter using `plt.scatter` to see class separation.
- Build a tiny CLI that loads `iris_svm_pipeline.joblib`, accepts four inputs, and prints the predicted species.


## Troubleshooting
- If plots don't show, run `%matplotlib inline` in a cell.
- If a package import fails, re-run the setup commands in a terminal and **restart the kernel**.
- If accuracy varies, remember that train/test splits are random unless you fix `random_state` (we did).
