# Part 1.4: Classical Machine Learning

Before diving into deep learning, you need to understand the algorithms that dominated machine learning for decades — and still outperform neural networks on many real-world problems. Classical ML algorithms are the foundation:
- They're **interpretable** — you can explain *why* a prediction was made
- They work with **small datasets** where deep learning would overfit
- They're **fast** to train and deploy
- They're the **baseline** that every deep learning model must beat to justify its complexity

This notebook covers decision trees, ensemble methods, SVMs, clustering, and model evaluation — the toolkit every ML practitioner needs before moving to neural networks.

## Learning Objectives
- [ ] Build decision trees from scratch and understand information gain / Gini impurity
- [ ] Explain why ensembles (Random Forests, Gradient Boosting) beat single models
- [ ] Visualize SVM decision boundaries and understand the kernel trick
- [ ] Apply k-means and DBSCAN clustering to discover structure in data
- [ ] Evaluate models with cross-validation, confusion matrices, and ROC curves
- [ ] Know when classical ML is the right choice over deep learning

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans, DBSCAN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc
from sklearn.datasets import make_moons, make_blobs, make_classification
from sklearn.preprocessing import StandardScaler

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

---

## 1. Decision Trees

### Intuitive Explanation

A decision tree works exactly like a game of 20 questions. To classify a data point, the tree asks a series of yes/no questions about its features, splitting the data at each step until it reaches a conclusion.

**Example**: Should I play tennis today?
1. Is it raining? **No** →
2. Is it windy? **No** →
3. Prediction: **Play tennis!**

The key question is: **which question should the tree ask first?** The answer: whichever question best separates the classes. We measure this with *information gain* or *Gini impurity*.

### Splitting Criteria

| Criterion | Formula | Intuition | When to Use |
|-----------|---------|-----------|-------------|
| **Gini Impurity** | $G = 1 - \sum_{k=1}^{K} p_k^2$ | Probability of misclassifying a randomly chosen sample | Default for classification (fast) |
| **Entropy** | $H = -\sum_{k=1}^{K} p_k \log_2(p_k)$ | Amount of "surprise" or disorder | When you want more balanced trees |
| **Information Gain** | $IG = H(parent) - \sum \frac{n_i}{n} H(child_i)$ | Reduction in uncertainty after splitting | Used to choose the best split |

**What this means:** A pure node (all same class) has Gini = 0 and Entropy = 0. A maximally impure node (50/50 split) has Gini = 0.5 and Entropy = 1.0. The tree greedily picks the split that reduces impurity the most.

In [None]:
# Visualize Gini Impurity and Entropy
p = np.linspace(0.001, 0.999, 200)
gini = 1 - p**2 - (1-p)**2
entropy = -p * np.log2(p) - (1-p) * np.log2(1-p)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Gini vs Entropy
axes[0].plot(p, gini, 'b-', linewidth=2, label='Gini Impurity')
axes[0].plot(p, entropy, 'r-', linewidth=2, label='Entropy')
axes[0].axvline(x=0.5, color='gray', linestyle='--', alpha=0.5, label='Maximum impurity (50/50)')
axes[0].set_xlabel('Proportion of Class 1 (p)')
axes[0].set_ylabel('Impurity')
axes[0].set_title('Gini vs Entropy: Measuring Node Impurity')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Right: Information Gain example
# Parent: 50/50 split (max impurity)
# After split: one child is 80/20, other is 20/80
parent_gini = 0.5
splits = np.linspace(0.5, 1.0, 100)
child_gini = 1 - splits**2 - (1-splits)**2
# Assume equal-sized children, mirror split
ig = parent_gini - child_gini  # simplified for equal-sized, symmetric children

axes[1].plot(splits, ig, 'g-', linewidth=2)
axes[1].fill_between(splits, 0, ig, alpha=0.2, color='green')
axes[1].set_xlabel('Purity of Children (proportion of majority class)')
axes[1].set_ylabel('Information Gain')
axes[1].set_title('Information Gain: Better Splits → More Gain')
axes[1].annotate('Perfect split\n(pure children)', xy=(1.0, 0.5), fontsize=10,
                  ha='right', va='top', color='green')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Key takeaway: Both Gini and Entropy measure the same thing — impurity.")
print("Gini is faster to compute; Entropy gives slightly more balanced trees.")
print("In practice, the difference is negligible.")

### Deep Dive: Building a Decision Tree from Scratch

Let's implement the core logic of a decision tree to understand how splitting works. The algorithm is:

1. For each feature, try every possible split threshold
2. Calculate the information gain (reduction in impurity) for each split
3. Pick the split with the highest information gain
4. Recurse on the left and right children
5. Stop when a node is pure, reaches max depth, or has too few samples

#### Key Insight

Decision trees are **greedy** — they pick the locally best split at each step without considering future splits. This makes them fast but potentially suboptimal. This is why ensembles (coming next) work so much better.

In [None]:
def gini_impurity(y):
    """
    Calculate the Gini impurity of a set of labels.
    
    Args:
        y: Array of class labels
    
    Returns:
        Gini impurity score (0 = pure, 0.5 = maximally impure for binary)
    """
    if len(y) == 0:
        return 0
    classes, counts = np.unique(y, return_counts=True)
    proportions = counts / len(y)
    return 1 - np.sum(proportions**2)


def information_gain(y, left_mask):
    """
    Calculate information gain from a binary split.
    
    Args:
        y: Array of class labels
        left_mask: Boolean mask for left child
    
    Returns:
        Information gain (higher is better)
    """
    parent_impurity = gini_impurity(y)
    left_y = y[left_mask]
    right_y = y[~left_mask]
    
    n = len(y)
    n_left = len(left_y)
    n_right = len(right_y)
    
    if n_left == 0 or n_right == 0:
        return 0
    
    child_impurity = (n_left / n) * gini_impurity(left_y) + (n_right / n) * gini_impurity(right_y)
    return parent_impurity - child_impurity


def find_best_split(X, y):
    """
    Find the best feature and threshold to split on.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Class labels
    
    Returns:
        best_feature, best_threshold, best_gain
    """
    best_gain = 0
    best_feature = None
    best_threshold = None
    
    for feature in range(X.shape[1]):
        thresholds = np.unique(X[:, feature])
        for threshold in thresholds:
            left_mask = X[:, feature] <= threshold
            gain = information_gain(y, left_mask)
            if gain > best_gain:
                best_gain = gain
                best_feature = feature
                best_threshold = threshold
    
    return best_feature, best_threshold, best_gain


# Demo: find the best split on simple data
X_demo = np.array([[1, 2], [2, 3], [3, 1], [6, 5], [7, 8], [8, 6]])
y_demo = np.array([0, 0, 0, 1, 1, 1])

best_feat, best_thresh, best_gain = find_best_split(X_demo, y_demo)
print(f"Best split: Feature {best_feat} <= {best_thresh}")
print(f"Information gain: {best_gain:.4f}")
print(f"\nThis means: split on {'x' if best_feat == 0 else 'y'} <= {best_thresh}")
print(f"Left group (class 0): {X_demo[X_demo[:, best_feat] <= best_thresh]}")
print(f"Right group (class 1): {X_demo[X_demo[:, best_feat] > best_thresh]}")

In [None]:
# Now use sklearn's DecisionTreeClassifier and visualize
X, y = make_classification(n_samples=300, n_features=2, n_redundant=0,
                           n_informative=2, n_clusters_per_class=1, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, max_depth in enumerate([1, 3, None]):
    tree = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    tree.fit(X, y)
    
    # Create mesh for decision boundary
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200),
                         np.linspace(y_min, y_max, 200))
    Z = tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    axes[idx].scatter(X[y==0, 0], X[y==0, 1], c='blue', edgecolors='k', alpha=0.6, label='Class 0')
    axes[idx].scatter(X[y==1, 0], X[y==1, 1], c='red', edgecolors='k', alpha=0.6, label='Class 1')
    
    depth_label = max_depth if max_depth else 'Unlimited'
    score = tree.score(X, y)
    axes[idx].set_title(f'Depth = {depth_label} (Accuracy: {score:.2f})')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Decision Tree Decision Boundaries: Depth Matters', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Notice: Deeper trees create more complex (jagged) boundaries.")
print("Too deep → overfitting. Too shallow → underfitting.")

In [None]:
# Visualize the actual tree structure
tree_shallow = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_shallow.fit(X, y)

plt.figure(figsize=(16, 8))
plot_tree(tree_shallow, filled=True, feature_names=['Feature 1', 'Feature 2'],
          class_names=['Class 0', 'Class 1'], rounded=True, fontsize=9)
plt.title('Decision Tree Structure (depth=3)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Reading the tree:")
print("- Each node shows: the split condition, Gini impurity, sample count, class distribution")
print("- Blue nodes lean toward Class 0, orange nodes lean toward Class 1")
print("- Darker color = more confident (lower impurity)")

### Common Misconceptions

| Misconception | Reality |
|---------------|--------|
| Deeper trees are always better | Deeper trees overfit — they memorize the training data instead of learning patterns |
| Decision trees find the globally optimal splits | They use **greedy** search — each split is locally optimal but the overall tree may not be |
| Decision trees can't handle continuous features | They can! They find the best threshold to split a continuous feature into two groups |
| Decision trees are too simple for real problems | Single trees are weak, but ensembles of trees (Random Forests, XGBoost) win Kaggle competitions |

---

## 2. Ensemble Methods

### Intuitive Explanation

A single decision tree is like asking one person for directions — they might be wrong. An ensemble is like asking 100 people and going with the majority vote. Even if each person is only slightly better than random, the crowd's collective answer is surprisingly accurate.

This is the **wisdom of crowds** applied to machine learning.

| Method | Strategy | Intuition | Strength |
|--------|----------|-----------|----------|
| **Bagging** (Random Forest) | Train many trees on random subsets, vote | Reduce variance by averaging | Robust, hard to overfit |
| **Boosting** (Gradient Boosting) | Train trees sequentially, each fixing previous mistakes | Reduce bias by focusing on errors | Higher accuracy, can overfit |
| **XGBoost** | Optimized gradient boosting with regularization | Best of both worlds | State-of-the-art on tabular data |

### Deep Dive: Why Ensembles Work

The magic of ensembles comes from the **bias-variance tradeoff**:

- **Bias**: How far off the model's average prediction is from the truth (systematic error)
- **Variance**: How much predictions change when you train on different data (instability)
- **Total Error = Bias² + Variance + Noise**

A single deep decision tree has **low bias** (it can fit complex patterns) but **high variance** (small changes in data lead to very different trees). Ensembles fix this:

- **Random Forests** (bagging): Average many high-variance trees → variance drops, bias stays low
- **Gradient Boosting**: Sequentially add low-bias trees that correct residual errors → bias drops further

#### Key Insight

If each tree's errors are somewhat **independent** (which random feature selection encourages), then averaging N trees reduces variance by roughly 1/N. This is why Random Forests use random subsets of features at each split — it decorrelates the trees.

In [None]:
# Random Forest: Visualize single tree vs forest
X_moons, y_moons = make_moons(n_samples=300, noise=0.25, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Single tree
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_moons, y_moons)

x_min, x_max = X_moons[:, 0].min() - 0.5, X_moons[:, 0].max() + 0.5
y_min, y_max = X_moons[:, 1].min() - 0.5, X_moons[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

Z = tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
axes[0].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
axes[0].scatter(X_moons[y_moons==0, 0], X_moons[y_moons==0, 1], c='blue', edgecolors='k', alpha=0.6)
axes[0].scatter(X_moons[y_moons==1, 0], X_moons[y_moons==1, 1], c='red', edgecolors='k', alpha=0.6)
axes[0].set_title(f'Single Decision Tree\nAccuracy: {tree.score(X_moons, y_moons):.2f}')
axes[0].grid(True, alpha=0.3)

# Random Forest (10 trees)
rf_small = RandomForestClassifier(n_estimators=10, random_state=42)
rf_small.fit(X_moons, y_moons)
Z = rf_small.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
axes[1].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
axes[1].scatter(X_moons[y_moons==0, 0], X_moons[y_moons==0, 1], c='blue', edgecolors='k', alpha=0.6)
axes[1].scatter(X_moons[y_moons==1, 0], X_moons[y_moons==1, 1], c='red', edgecolors='k', alpha=0.6)
axes[1].set_title(f'Random Forest (10 trees)\nAccuracy: {rf_small.score(X_moons, y_moons):.2f}')
axes[1].grid(True, alpha=0.3)

# Random Forest (100 trees)
rf_large = RandomForestClassifier(n_estimators=100, random_state=42)
rf_large.fit(X_moons, y_moons)
Z = rf_large.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
axes[2].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
axes[2].scatter(X_moons[y_moons==0, 0], X_moons[y_moons==0, 1], c='blue', edgecolors='k', alpha=0.6)
axes[2].scatter(X_moons[y_moons==1, 0], X_moons[y_moons==1, 1], c='red', edgecolors='k', alpha=0.6)
axes[2].set_title(f'Random Forest (100 trees)\nAccuracy: {rf_large.score(X_moons, y_moons):.2f}')
axes[2].grid(True, alpha=0.3)

plt.suptitle('Single Tree vs Random Forest: The Power of Ensembles', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Notice how the forest's boundary is smoother and more generalizable.")
print("The single tree overfits with jagged, irregular boundaries.")

In [None]:
# Gradient Boosting: Watch the model improve iteratively
fig, axes = plt.subplots(1, 4, figsize=(20, 4))

for idx, n_estimators in enumerate([1, 5, 20, 100]):
    gb = GradientBoostingClassifier(n_estimators=n_estimators, max_depth=2,
                                    learning_rate=0.5, random_state=42)
    gb.fit(X_moons, y_moons)
    
    Z = gb.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    axes[idx].scatter(X_moons[y_moons==0, 0], X_moons[y_moons==0, 1], c='blue', edgecolors='k', alpha=0.5, s=15)
    axes[idx].scatter(X_moons[y_moons==1, 0], X_moons[y_moons==1, 1], c='red', edgecolors='k', alpha=0.5, s=15)
    axes[idx].set_title(f'{n_estimators} tree{"s" if n_estimators > 1 else ""}\nAcc: {gb.score(X_moons, y_moons):.2f}')
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Gradient Boosting: Each Tree Fixes Previous Mistakes', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Key difference from Random Forest:")
print("- Random Forest: trees are built independently (parallel), then averaged")
print("- Gradient Boosting: trees are built sequentially, each one correcting residual errors")
print("- Gradient Boosting often achieves higher accuracy but is more prone to overfitting")

In [None]:
# Compare all tree-based methods
X_train, X_test, y_train, y_test = train_test_split(X_moons, y_moons, 
                                                     test_size=0.3, random_state=42)

models = {
    'Single Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest (100)': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting (100)': GradientBoostingClassifier(n_estimators=100, random_state=42),
}

print(f"{'Model':<30} {'Train Accuracy':>15} {'Test Accuracy':>15} {'CV Score (5-fold)':>18}")
print('-' * 80)

for name, model in models.items():
    model.fit(X_train, y_train)
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    cv_scores = cross_val_score(model, X_moons, y_moons, cv=5)
    print(f"{name:<30} {train_acc:>15.4f} {test_acc:>15.4f} {cv_scores.mean():>12.4f} ± {cv_scores.std():.4f}")

print("\nNotice: The single tree likely overfits (high train, lower test).")
print("Ensembles generalize better, with Random Forest being more robust.")

---

## 3. Support Vector Machines (SVMs)

### Intuitive Explanation

Imagine you have two groups of points on a table and you want to separate them with a straight line. There are many possible lines, but the SVM finds the **best** one — the line that maximizes the **margin** (distance) between the line and the nearest points from each class.

These nearest points are called **support vectors** because they "support" (define) the decision boundary. If you moved any other point, the boundary wouldn't change.

**What this means:** SVMs are fundamentally about finding the widest possible "highway" between two classes. A wider margin means better generalization to new data.

### The Kernel Trick

What if the data isn't linearly separable? The **kernel trick** maps data into a higher-dimensional space where a linear boundary *does* exist — without actually computing the transformation.

| Kernel | What It Does | When to Use |
|--------|-------------|-------------|
| **Linear** | No transformation (straight line) | Linearly separable data, high dimensions |
| **RBF (Gaussian)** | Maps to infinite dimensions via similarity | Most common default, handles nonlinear data |
| **Polynomial** | Maps to polynomial feature space | When you suspect polynomial decision boundaries |

**Key Insight**: The kernel trick is mathematically elegant — it lets you compute dot products in a high-dimensional space without ever going there. The RBF kernel effectively measures how "similar" two points are; nearby points get high similarity, distant points get low similarity.

In [None]:
# SVM: Linear vs RBF kernel on moons data
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Scale data for SVM (SVMs are sensitive to feature scales)
scaler = StandardScaler()
X_moons_scaled = scaler.fit_transform(X_moons)

x_min, x_max = X_moons_scaled[:, 0].min() - 0.5, X_moons_scaled[:, 0].max() + 0.5
y_min, y_max = X_moons_scaled[:, 1].min() - 0.5, X_moons_scaled[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

kernels = ['linear', 'rbf', 'poly']
kernel_names = ['Linear Kernel', 'RBF Kernel (Gaussian)', 'Polynomial Kernel (degree=3)']

for idx, (kernel, name) in enumerate(zip(kernels, kernel_names)):
    svm = SVC(kernel=kernel, C=1.0, random_state=42)
    svm.fit(X_moons_scaled, y_moons)
    
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    axes[idx].scatter(X_moons_scaled[y_moons==0, 0], X_moons_scaled[y_moons==0, 1], 
                      c='blue', edgecolors='k', alpha=0.6, label='Class 0')
    axes[idx].scatter(X_moons_scaled[y_moons==1, 0], X_moons_scaled[y_moons==1, 1], 
                      c='red', edgecolors='k', alpha=0.6, label='Class 1')
    
    # Highlight support vectors
    axes[idx].scatter(X_moons_scaled[svm.support_, 0], X_moons_scaled[svm.support_, 1],
                      s=100, facecolors='none', edgecolors='green', linewidths=2, label='Support vectors')
    
    axes[idx].set_title(f'{name}\nAccuracy: {svm.score(X_moons_scaled, y_moons):.2f}, SVs: {len(svm.support_)}')
    axes[idx].set_xlabel('Feature 1 (scaled)')
    axes[idx].set_ylabel('Feature 2 (scaled)')
    axes[idx].legend(fontsize=8)
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('SVM Decision Boundaries: The Kernel Trick', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Green circles = support vectors (the points that define the boundary)")
print("Linear kernel fails on nonlinear data; RBF handles it beautifully.")
print("The kernel trick: map to higher dimensions where data IS linearly separable.")

In [None]:
# Visualize the margin concept with a simple 2D example
np.random.seed(42)
X_simple = np.vstack([
    np.random.randn(30, 2) + np.array([2, 2]),
    np.random.randn(30, 2) + np.array([-2, -2])
])
y_simple = np.array([0]*30 + [1]*30)

svm_linear = SVC(kernel='linear', C=1.0)
svm_linear.fit(X_simple, y_simple)

plt.figure(figsize=(10, 6))

# Decision boundary and margins
x_min, x_max = X_simple[:, 0].min() - 1, X_simple[:, 0].max() + 1
y_min, y_max = X_simple[:, 1].min() - 1, X_simple[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

# Get decision function values for margin visualization
Z = svm_linear.decision_function(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.contour(xx, yy, Z, levels=[-1, 0, 1], colors=['blue', 'black', 'red'],
            linestyles=['--', '-', '--'], linewidths=[1, 2, 1])
plt.contourf(xx, yy, Z, levels=[-1, 1], alpha=0.1, colors=['yellow'])

plt.scatter(X_simple[y_simple==0, 0], X_simple[y_simple==0, 1], c='blue', edgecolors='k', alpha=0.6, label='Class 0')
plt.scatter(X_simple[y_simple==1, 0], X_simple[y_simple==1, 1], c='red', edgecolors='k', alpha=0.6, label='Class 1')
plt.scatter(X_simple[svm_linear.support_, 0], X_simple[svm_linear.support_, 1],
            s=150, facecolors='none', edgecolors='green', linewidths=2, label='Support vectors')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('SVM Maximum Margin: The Widest "Highway" Between Classes', fontsize=13)
plt.legend()
plt.grid(True, alpha=0.3)

plt.annotate('Margin\n(maximize this)', xy=(0, 0), fontsize=11,
             ha='center', va='center', color='orange', fontweight='bold',
             bbox=dict(boxstyle='round,pad=0.3', facecolor='yellow', alpha=0.3))

plt.show()

print("Solid black line: decision boundary (where the SVM predicts the class changes)")
print("Dashed lines: margin boundaries (the 'edge of the highway')")
print("Yellow region: the margin — the SVM maximizes this distance")
print(f"Number of support vectors: {len(svm_linear.support_)} out of {len(y_simple)} total points")

### Deep Dive: Why SVMs Matter for Deep Learning

SVMs introduce concepts that recur throughout deep learning:

| SVM Concept | Deep Learning Connection |
|-------------|-------------------------|
| Maximum margin | Contrastive learning maximizes distance between embeddings |
| Kernel trick | Neural networks learn nonlinear feature mappings automatically |
| Support vectors | Hard examples in curriculum learning / hard negative mining |
| Hinge loss | Still used in some architectures (e.g., face verification) |
| Feature scaling | Batch normalization, layer normalization |

**The kernel trick was revolutionary** because it showed you could work in infinite-dimensional spaces efficiently. Neural networks took a different approach: instead of a fixed kernel, they *learn* the feature mapping from data. But the core insight — transform the data until the problem becomes linear — is the same.

---

## 4. Clustering

### Intuitive Explanation

Clustering is **unsupervised learning** — there are no labels, and the algorithm must discover structure on its own. The goal: group similar data points together.

Think of it like sorting a pile of mixed laundry without labels. You'd naturally group items by color, fabric type, or size — clustering algorithms do the same with numerical features.

| Algorithm | How It Works | Shape of Clusters | Needs K? |
|-----------|-------------|-------------------|----------|
| **k-Means** | Iteratively assign points to nearest center, update centers | Spherical/convex | Yes |
| **DBSCAN** | Group points in dense regions, mark sparse points as noise | Arbitrary shapes | No (uses epsilon, min_samples) |

In [None]:
# k-Means: Visualize the iterative process
X_blobs, y_blobs = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Manual k-means iterations to show the process
from sklearn.cluster import KMeans as KM

# Initialize with random centroids
np.random.seed(42)
initial_centers = X_blobs[np.random.choice(len(X_blobs), 4, replace=False)]

# Show iterations
colors = ['blue', 'red', 'green', 'orange']
iterations = [1, 2, 3, 5, 10, 20]

for idx, max_iter in enumerate(iterations):
    row, col = idx // 3, idx % 3
    
    km = KMeans(n_clusters=4, init=initial_centers, n_init=1, max_iter=max_iter, random_state=42)
    labels = km.fit_predict(X_blobs)
    centers = km.cluster_centers_
    
    for k in range(4):
        mask = labels == k
        axes[row, col].scatter(X_blobs[mask, 0], X_blobs[mask, 1], 
                               c=colors[k], alpha=0.5, s=20)
    
    axes[row, col].scatter(centers[:, 0], centers[:, 1], c='black', marker='X', 
                           s=200, edgecolors='white', linewidths=2, zorder=5)
    axes[row, col].set_title(f'Iteration {max_iter}')
    axes[row, col].grid(True, alpha=0.3)

plt.suptitle('k-Means Clustering: Iterative Convergence\n'
             '(X marks = cluster centers, colors = assignments)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("k-Means algorithm:")
print("1. Initialize K random centers")
print("2. Assign each point to its nearest center")
print("3. Move each center to the mean of its assigned points")
print("4. Repeat steps 2-3 until convergence")

In [None]:
# Elbow Method: How to choose K
inertias = []
K_range = range(1, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_blobs)
    inertias.append(km.inertia_)

plt.figure(figsize=(10, 6))
plt.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
plt.axvline(x=4, color='red', linestyle='--', alpha=0.7, label='Elbow at K=4 (true number of clusters)')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (within-cluster sum of squares)')
plt.title('The Elbow Method: Finding the Right Number of Clusters')
plt.legend()
plt.grid(True, alpha=0.3)

plt.annotate('"Elbow" — diminishing\nreturns after this K',
             xy=(4, inertias[3]), xytext=(6, inertias[1]),
             arrowprops=dict(arrowstyle='->', color='red'),
             fontsize=11, color='red')

plt.show()

print("The elbow method: plot inertia vs K and look for the 'bend'.")
print("Before the elbow: adding clusters gives big improvement.")
print("After the elbow: adding clusters gives diminishing returns.")
print(f"\nInertia values: {[f'{x:.0f}' for x in inertias]}")

In [None]:
# k-Means vs DBSCAN: Different strengths
# Create data with non-convex clusters
X_circles, y_circles = make_moons(n_samples=300, noise=0.08, random_state=42)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# True labels
axes[0].scatter(X_circles[y_circles==0, 0], X_circles[y_circles==0, 1], c='blue', alpha=0.6, label='Class 0')
axes[0].scatter(X_circles[y_circles==1, 0], X_circles[y_circles==1, 1], c='red', alpha=0.6, label='Class 1')
axes[0].set_title('True Labels (moons data)')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# k-Means (fails on non-convex)
km = KMeans(n_clusters=2, random_state=42, n_init=10)
km_labels = km.fit_predict(X_circles)
axes[1].scatter(X_circles[km_labels==0, 0], X_circles[km_labels==0, 1], c='blue', alpha=0.6)
axes[1].scatter(X_circles[km_labels==1, 0], X_circles[km_labels==1, 1], c='red', alpha=0.6)
axes[1].scatter(km.cluster_centers_[:, 0], km.cluster_centers_[:, 1], c='black', marker='X', s=200, zorder=5)
axes[1].set_title('k-Means (K=2) — FAILS on non-convex shapes')
axes[1].grid(True, alpha=0.3)

# DBSCAN (handles non-convex)
db = DBSCAN(eps=0.2, min_samples=5)
db_labels = db.fit_predict(X_circles)
for label in np.unique(db_labels):
    if label == -1:
        axes[2].scatter(X_circles[db_labels==label, 0], X_circles[db_labels==label, 1], 
                       c='gray', marker='x', alpha=0.5, label='Noise')
    else:
        color = ['blue', 'red', 'green'][label % 3]
        axes[2].scatter(X_circles[db_labels==label, 0], X_circles[db_labels==label, 1], 
                       c=color, alpha=0.6, label=f'Cluster {label}')
axes[2].set_title('DBSCAN — Finds arbitrary cluster shapes')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.suptitle('k-Means vs DBSCAN: Choosing the Right Clustering Algorithm', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("k-Means assumes spherical clusters — it fails on crescent-shaped data.")
print("DBSCAN finds clusters of arbitrary shape by following dense regions.")
print("DBSCAN also detects noise points (gray X markers) that don't belong to any cluster.")

### Deep Dive: Clustering in the ML Pipeline

Clustering isn't just an end in itself — it's a tool used throughout ML:

| Application | How Clustering Is Used |
|-------------|----------------------|
| **Feature engineering** | Cluster IDs become features for downstream models |
| **Data exploration** | Discover natural groups before building supervised models |
| **Anomaly detection** | Points far from any cluster center are anomalies |
| **Semi-supervised learning** | Propagate labels from labeled to unlabeled points in same cluster |
| **Embeddings** | k-means on word embeddings discovers topic clusters |
| **Vector quantization** | Compress continuous embeddings to discrete cluster IDs (used in VQ-VAE) |

---

## 5. k-Nearest Neighbors and Naive Bayes

### k-Nearest Neighbors (k-NN)

The simplest classifier: to predict a new point, find the K closest training points and take a majority vote.

**Strengths:** No training phase, works for any number of classes, intuitive.  
**Weaknesses:** Slow at prediction time (must compare to all training points), sensitive to feature scales, struggles in high dimensions ("curse of dimensionality").

### Naive Bayes

Uses Bayes' theorem with the "naive" assumption that features are independent:

$$P(y \mid x_1, x_2, \ldots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)$$

**Strengths:** Extremely fast, works well with high-dimensional sparse data (text), good with small training sets.  
**Weaknesses:** The independence assumption is almost never true, so probability estimates are often poorly calibrated.

| Algorithm | Type | Training | Prediction | Best For |
|-----------|------|----------|------------|----------|
| **k-NN** | Distance-based | None (stores data) | Slow (O(n) per query) | Small datasets, few features |
| **Naive Bayes** | Probabilistic | Fast (O(n*d)) | Very fast (O(d)) | Text classification, spam detection |

In [None]:
# k-NN: Effect of K on decision boundary
X_knn, y_knn = make_moons(n_samples=200, noise=0.2, random_state=42)

fig, axes = plt.subplots(1, 4, figsize=(20, 4))

x_min, x_max = X_knn[:, 0].min() - 0.5, X_knn[:, 0].max() + 0.5
y_min, y_max = X_knn[:, 1].min() - 0.5, X_knn[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

for idx, k in enumerate([1, 5, 15, 50]):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_knn, y_knn)
    
    Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
    axes[idx].scatter(X_knn[y_knn==0, 0], X_knn[y_knn==0, 1], c='blue', edgecolors='k', alpha=0.5, s=15)
    axes[idx].scatter(X_knn[y_knn==1, 0], X_knn[y_knn==1, 1], c='red', edgecolors='k', alpha=0.5, s=15)
    axes[idx].set_title(f'k = {k}\nAcc: {knn.score(X_knn, y_knn):.2f}')
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('k-NN: Effect of K on Decision Boundary\n'
             'Small K → overfits, Large K → underfits', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("k=1: Every training point is its own island (overfitting)")
print("k=50: Decision boundary is too smooth (underfitting)")
print("k=5 or k=15: Good balance between flexibility and smoothness")

In [None]:
# Naive Bayes vs k-NN comparison
X_compare, y_compare = make_classification(n_samples=500, n_features=2, n_redundant=0,
                                            n_informative=2, n_clusters_per_class=2, random_state=42)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

x_min, x_max = X_compare[:, 0].min() - 1, X_compare[:, 0].max() + 1
y_min, y_max = X_compare[:, 1].min() - 1, X_compare[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

# Naive Bayes
nb = GaussianNB()
nb.fit(X_compare, y_compare)
Z = nb.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
axes[0].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
axes[0].scatter(X_compare[y_compare==0, 0], X_compare[y_compare==0, 1], c='blue', edgecolors='k', alpha=0.5, s=15)
axes[0].scatter(X_compare[y_compare==1, 0], X_compare[y_compare==1, 1], c='red', edgecolors='k', alpha=0.5, s=15)
axes[0].set_title(f'Naive Bayes (Gaussian)\nAccuracy: {nb.score(X_compare, y_compare):.2f}')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')
axes[0].grid(True, alpha=0.3)

# k-NN
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_compare, y_compare)
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
axes[1].contourf(xx, yy, Z, alpha=0.3, cmap='RdBu')
axes[1].scatter(X_compare[y_compare==0, 0], X_compare[y_compare==0, 1], c='blue', edgecolors='k', alpha=0.5, s=15)
axes[1].scatter(X_compare[y_compare==1, 0], X_compare[y_compare==1, 1], c='red', edgecolors='k', alpha=0.5, s=15)
axes[1].set_title(f'k-NN (k=7)\nAccuracy: {knn.score(X_compare, y_compare):.2f}')
axes[1].set_xlabel('Feature 1')
axes[1].set_ylabel('Feature 2')
axes[1].grid(True, alpha=0.3)

plt.suptitle('Naive Bayes vs k-NN: Probabilistic vs Distance-Based', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Naive Bayes: smooth, linear-ish boundary (assumes Gaussian features)")
print("k-NN: flexible, adapts to local data density")
print("\nBoth are useful baselines — always try simple models first!")

---

## 6. Model Evaluation

### Intuitive Explanation

Training accuracy is meaningless if the model memorizes the data. We need to evaluate on data the model has **never seen**. This section covers the essential toolkit for honest model evaluation.

### Cross-Validation

Instead of a single train/test split (which depends on which points ended up in which set), **k-fold cross-validation** rotates through K different splits and averages the results:

1. Split data into K equal folds
2. For each fold: train on K-1 folds, test on the held-out fold
3. Average the K test scores

**What this means:** Cross-validation gives a more reliable estimate of model performance by testing on every data point exactly once.

### The Confusion Matrix

For classification, accuracy alone is not enough. The confusion matrix breaks down predictions into four categories:

|  | Predicted Positive | Predicted Negative |
|--|-------------------|-------------------|
| **Actually Positive** | True Positive (TP) | False Negative (FN) |
| **Actually Negative** | False Positive (FP) | True Negative (TN) |

From this matrix, we derive:

| Metric | Formula | Intuition | Use When |
|--------|---------|-----------|----------|
| **Accuracy** | (TP+TN) / Total | Overall correctness | Classes are balanced |
| **Precision** | TP / (TP+FP) | "Of predicted positives, how many are correct?" | Cost of false alarms is high |
| **Recall** | TP / (TP+FN) | "Of actual positives, how many did we catch?" | Cost of missing positives is high |
| **F1 Score** | 2 * (P*R) / (P+R) | Harmonic mean of precision and recall | Need to balance both |

In [None]:
# Cross-validation comparison of all models
X_eval, y_eval = make_classification(n_samples=500, n_features=10, n_informative=5,
                                      n_redundant=2, random_state=42)

models_eval = {
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'k-NN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB(),
}

cv_results = {}
print(f"{'Model':<25} {'CV Mean':>10} {'CV Std':>10} {'Min':>10} {'Max':>10}")
print('-' * 70)

for name, model in models_eval.items():
    scores = cross_val_score(model, X_eval, y_eval, cv=5, scoring='accuracy')
    cv_results[name] = scores
    print(f"{name:<25} {scores.mean():>10.4f} {scores.std():>10.4f} {scores.min():>10.4f} {scores.max():>10.4f}")

# Visualize
plt.figure(figsize=(10, 6))
positions = range(len(cv_results))
bp = plt.boxplot([scores for scores in cv_results.values()], labels=cv_results.keys(),
                  patch_artist=True)

colors_box = ['#3498db', '#2ecc71', '#e74c3c', '#9b59b6', '#f39c12', '#1abc9c']
for patch, color in zip(bp['boxes'], colors_box):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

plt.ylabel('Accuracy (5-fold CV)')
plt.title('Cross-Validation Comparison: All Classical ML Models')
plt.xticks(rotation=30, ha='right')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

In [None]:
# Confusion Matrix and Classification Report
X_train_eval, X_test_eval, y_train_eval, y_test_eval = train_test_split(
    X_eval, y_eval, test_size=0.3, random_state=42)

# Train a Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_eval, y_train_eval)
y_pred = rf.predict(X_test_eval)

# Confusion matrix
cm = confusion_matrix(y_test_eval, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw counts
im = axes[0].imshow(cm, interpolation='nearest', cmap='Blues')
axes[0].set_title('Confusion Matrix (Counts)', fontsize=13)
axes[0].set_xlabel('Predicted Label')
axes[0].set_ylabel('True Label')
axes[0].set_xticks([0, 1])
axes[0].set_yticks([0, 1])
axes[0].set_xticklabels(['Negative (0)', 'Positive (1)'])
axes[0].set_yticklabels(['Negative (0)', 'Positive (1)'])

# Add text annotations
for i in range(2):
    for j in range(2):
        text_color = 'white' if cm[i, j] > cm.max() / 2 else 'black'
        axes[0].text(j, i, f'{cm[i, j]}', ha='center', va='center', 
                    fontsize=20, color=text_color, fontweight='bold')

# Normalized (percentages)
cm_norm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
im2 = axes[1].imshow(cm_norm, interpolation='nearest', cmap='Blues', vmin=0, vmax=1)
axes[1].set_title('Confusion Matrix (Normalized)', fontsize=13)
axes[1].set_xlabel('Predicted Label')
axes[1].set_ylabel('True Label')
axes[1].set_xticks([0, 1])
axes[1].set_yticks([0, 1])
axes[1].set_xticklabels(['Negative (0)', 'Positive (1)'])
axes[1].set_yticklabels(['Negative (0)', 'Positive (1)'])

for i in range(2):
    for j in range(2):
        text_color = 'white' if cm_norm[i, j] > 0.5 else 'black'
        axes[1].text(j, i, f'{cm_norm[i, j]:.2%}', ha='center', va='center', 
                    fontsize=16, color=text_color, fontweight='bold')

plt.tight_layout()
plt.show()

# Classification report
print("\nClassification Report:")
print(classification_report(y_test_eval, y_pred, target_names=['Negative', 'Positive']))

In [None]:
# ROC Curves: Compare multiple models
# Need probability outputs, so we use models that support predict_proba
roc_models = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Naive Bayes': GaussianNB(),
    'k-NN (k=5)': KNeighborsClassifier(n_neighbors=5),
}

plt.figure(figsize=(10, 6))

colors_roc = ['blue', 'red', 'green', 'orange']
for (name, model), color in zip(roc_models.items(), colors_roc):
    model.fit(X_train_eval, y_train_eval)
    y_prob = model.predict_proba(X_test_eval)[:, 1]
    fpr, tpr, _ = roc_curve(y_test_eval, y_prob)
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, color=color, linewidth=2, label=f'{name} (AUC = {roc_auc:.3f})')

# Random baseline
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5, label='Random (AUC = 0.500)')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves: Comparing Model Performance')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()

print("Reading ROC curves:")
print("- Upper-left corner is perfect (all true positives, no false positives)")
print("- Diagonal line = random guessing")
print("- AUC (Area Under Curve): 1.0 = perfect, 0.5 = random")
print("- Higher AUC = better model at discriminating between classes")

### Deep Dive: When to Use What Metric

Choosing the right metric is one of the most important — and most overlooked — decisions in ML:

| Scenario | Best Metric | Why |
|----------|------------|-----|
| Balanced classes, equal costs | **Accuracy** | Simple and interpretable |
| Imbalanced classes | **F1 Score** or **AUC-ROC** | Accuracy is misleading (predicting majority class gets high accuracy) |
| Spam filter (cost of false alarm) | **Precision** | You don't want real emails in spam |
| Medical screening (cost of missing) | **Recall** | You don't want to miss a disease |
| Ranking/recommendation | **AUC-ROC** | Cares about ordering, not absolute threshold |
| Information retrieval | **Precision@K** | Only top K results matter |

#### Common Misconceptions

| Misconception | Reality |
|---------------|--------|
| High accuracy = good model | On imbalanced data, always predicting the majority class gives high accuracy |
| AUC-ROC is always best | AUC-ROC can be misleading with severe class imbalance; use Precision-Recall AUC instead |
| One metric tells the whole story | Always look at multiple metrics together |

---

## 7. When NOT to Use Deep Learning

### Practical Decision Guide

Deep learning is powerful but not always the right tool. Classical ML wins in many real-world scenarios:

In [None]:
# Demonstration: Small data — where classical ML shines
from sklearn.neural_network import MLPClassifier

sample_sizes = [30, 50, 100, 200, 500, 1000]
results_rf = []
results_gb = []
results_nn = []

for n_samples in sample_sizes:
    # Generate data
    X_size, y_size = make_classification(n_samples=n_samples, n_features=10, 
                                         n_informative=5, n_redundant=2, random_state=42)
    
    # Random Forest
    rf_scores = cross_val_score(RandomForestClassifier(n_estimators=100, random_state=42), 
                                X_size, y_size, cv=5)
    results_rf.append(rf_scores.mean())
    
    # Gradient Boosting
    gb_scores = cross_val_score(GradientBoostingClassifier(n_estimators=100, random_state=42),
                                X_size, y_size, cv=5)
    results_gb.append(gb_scores.mean())
    
    # Neural Network (small MLP)
    nn_scores = cross_val_score(MLPClassifier(hidden_layer_sizes=(50, 25), max_iter=500, 
                                              random_state=42),
                                X_size, y_size, cv=5)
    results_nn.append(nn_scores.mean())

plt.figure(figsize=(10, 6))
plt.plot(sample_sizes, results_rf, 'g-o', linewidth=2, markersize=8, label='Random Forest')
plt.plot(sample_sizes, results_gb, 'r-s', linewidth=2, markersize=8, label='Gradient Boosting')
plt.plot(sample_sizes, results_nn, 'b-^', linewidth=2, markersize=8, label='Neural Network (MLP)')

plt.xlabel('Number of Training Samples')
plt.ylabel('Cross-Validation Accuracy')
plt.title('Sample Size vs Model Performance:\nClassical ML Excels with Small Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xscale('log')

plt.annotate('Classical ML wins here', xy=(50, results_rf[1]), xytext=(100, results_rf[1] - 0.05),
             arrowprops=dict(arrowstyle='->', color='green'),
             fontsize=11, color='green')

plt.show()

print("Key insight: With <200 samples, Random Forests and Gradient Boosting")
print("typically outperform neural networks. Neural nets need data to learn features;")
print("classical ML uses hand-crafted features more efficiently.")

### The Decision Framework

| Situation | Use Classical ML | Use Deep Learning |
|-----------|-----------------|------------------|
| **Data size** | < 10K samples | > 100K samples |
| **Data type** | Tabular (rows and columns) | Images, text, audio, video |
| **Features** | Hand-crafted, meaningful | Raw pixels, tokens, waveforms |
| **Interpretability** | Required (medical, legal, finance) | Not critical |
| **Training budget** | Minutes on a laptop | Hours/days on GPUs |
| **Deployment** | Edge devices, low latency | Server-side, batch processing |
| **Baseline** | ALWAYS start here | After classical ML baseline |

### The Real-World Truth About Tabular Data

As of 2024, **gradient boosted trees (XGBoost, LightGBM, CatBoost) still outperform deep learning on most tabular datasets**. This is a well-studied phenomenon:

1. Tabular data has **heterogeneous features** (mix of types, scales, meanings)
2. Tree-based models handle this naturally; neural nets need careful preprocessing
3. Trees are **rotation-invariant** to individual features; neural nets are not
4. Tabular datasets are typically smaller, favoring classical ML

**Rule of thumb:** If your data fits in a spreadsheet, try XGBoost before reaching for a neural network.

---

## Exercises

### Exercise 1: Build a Complete Classification Pipeline

Create a function that takes a dataset and compares multiple classifiers using cross-validation, returning the best model name and its score.

In [None]:
# EXERCISE 1: Complete classification pipeline
def find_best_classifier(X, y, cv=5):
    """
    Compare multiple classifiers using cross-validation and return the best one.
    
    Args:
        X: Feature matrix
        y: Labels
        cv: Number of cross-validation folds
    
    Returns:
        Tuple of (best_model_name, best_cv_score, results_dict)
        where results_dict maps model names to their mean CV scores
    """
    # TODO: Implement this!
    # 1. Scale the features using StandardScaler
    # 2. Define a dictionary of at least 4 models (Decision Tree, Random Forest,
    #    Gradient Boosting, SVM, k-NN, Naive Bayes)
    # 3. Run cross_val_score for each model
    # 4. Return the name and score of the best model, plus all results
    # Hint: Use StandardScaler().fit_transform(X) to scale features
    
    pass


# Test
X_ex1, y_ex1 = make_classification(n_samples=500, n_features=10, n_informative=5,
                                    n_redundant=2, random_state=42)

result = find_best_classifier(X_ex1, y_ex1)
if result is not None:
    best_name, best_score, all_results = result
    print(f"Best model: {best_name} (CV score: {best_score:.4f})")
    print(f"\nAll results:")
    for name, score in sorted(all_results.items(), key=lambda x: x[1], reverse=True):
        print(f"  {name}: {score:.4f}")
    
    # Verify
    assert best_score > 0.80, f"Best score should be > 0.80, got {best_score:.4f}"
    assert len(all_results) >= 4, f"Should compare at least 4 models, got {len(all_results)}"
    print(f"\nAll checks passed!")
else:
    print("TODO: Implement find_best_classifier")

### Exercise 2: Clustering Evaluation

Implement the **silhouette score** from scratch. The silhouette score measures how similar a point is to its own cluster vs the nearest neighboring cluster. It ranges from -1 (wrong cluster) to +1 (well-matched).

For each point $i$:
- $a(i)$ = average distance to all other points in the **same** cluster
- $b(i)$ = average distance to all points in the **nearest other** cluster
- $s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$

In [None]:
# EXERCISE 2: Silhouette score from scratch
def silhouette_score_manual(X, labels):
    """
    Calculate the mean silhouette score for a clustering.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        labels: Cluster assignments for each point
    
    Returns:
        Mean silhouette score (float between -1 and 1)
    """
    # TODO: Implement this!
    # For each point i:
    #   1. Compute a(i): mean distance to points in same cluster
    #   2. Compute b(i): mean distance to points in nearest OTHER cluster
    #   3. Compute s(i) = (b(i) - a(i)) / max(a(i), b(i))
    # Return the mean of all s(i)
    # Hint: Use np.linalg.norm(X[i] - X[j]) for distance
    # Hint: Handle edge case where a cluster has only 1 point (s(i) = 0)
    
    pass


# Test
from sklearn.metrics import silhouette_score as sklearn_silhouette

X_ex2, y_ex2 = make_blobs(n_samples=150, centers=3, cluster_std=0.8, random_state=42)
km_ex2 = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_ex2 = km_ex2.fit_predict(X_ex2)

manual_score = silhouette_score_manual(X_ex2, labels_ex2)
if manual_score is not None:
    sklearn_score = sklearn_silhouette(X_ex2, labels_ex2)
    print(f"Your silhouette score: {manual_score:.4f}")
    print(f"sklearn silhouette score: {sklearn_score:.4f}")
    print(f"Correct: {np.allclose(manual_score, sklearn_score, atol=1e-4)}")
else:
    print("TODO: Implement silhouette_score_manual")

### Exercise 3: Decision Boundary Explorer

Create a function that plots the decision boundary of any sklearn classifier on 2D data. Then use it to compare how different classifiers handle the `make_moons` dataset.

In [None]:
# EXERCISE 3: Decision boundary explorer
def plot_decision_boundary(clf, X, y, ax=None, title=''):
    """
    Plot the decision boundary of a fitted classifier.
    
    Args:
        clf: A fitted sklearn classifier
        X: Feature matrix (n_samples, 2) — must be 2D!
        y: Labels
        ax: Matplotlib axes (creates new figure if None)
        title: Plot title
    """
    # TODO: Implement this!
    # 1. Create a mesh grid spanning the data range (with some padding)
    # 2. Predict on every point in the mesh
    # 3. Use contourf to color the regions
    # 4. Scatter plot the actual data points on top
    # 5. Add title with the model's test accuracy
    # Hint: Use np.meshgrid and clf.predict(np.c_[xx.ravel(), yy.ravel()])
    
    pass


# Test: Compare 6 classifiers on moons data
X_ex3, y_ex3 = make_moons(n_samples=300, noise=0.2, random_state=42)
X_ex3_scaled = StandardScaler().fit_transform(X_ex3)

classifiers_ex3 = [
    ('Decision Tree', DecisionTreeClassifier(max_depth=5, random_state=42)),
    ('Random Forest', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('Gradient Boosting', GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ('SVM (RBF)', SVC(kernel='rbf', random_state=42)),
    ('k-NN (k=5)', KNeighborsClassifier(n_neighbors=5)),
    ('Naive Bayes', GaussianNB()),
]

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

for idx, (name, clf) in enumerate(classifiers_ex3):
    row, col = idx // 3, idx % 3
    clf.fit(X_ex3_scaled, y_ex3)
    plot_decision_boundary(clf, X_ex3_scaled, y_ex3, ax=axes[row, col], title=name)

plt.suptitle('Decision Boundary Comparison: 6 Classifiers on Moons Data', 
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Compare the boundaries — which models handle the nonlinear shape best?")

### Exercise 4: Feature Importance Analysis

Train a Random Forest on a classification dataset and analyze which features matter most. This is one of the biggest advantages of tree-based models — built-in feature importance.

In [None]:
# EXERCISE 4: Feature importance
def analyze_feature_importance(X, y, feature_names=None):
    """
    Train a Random Forest and visualize feature importances.
    
    Args:
        X: Feature matrix
        y: Labels
        feature_names: List of feature names (optional)
    
    Returns:
        Sorted list of (feature_name, importance) tuples
    """
    # TODO: Implement this!
    # 1. Train a RandomForestClassifier on the data
    # 2. Extract .feature_importances_
    # 3. Sort features by importance (descending)
    # 4. Create a horizontal bar chart
    # 5. Return sorted (name, importance) tuples
    # Hint: If feature_names is None, use ["Feature 0", "Feature 1", ...]
    
    pass


# Test
X_ex4, y_ex4 = make_classification(n_samples=500, n_features=10, n_informative=3,
                                    n_redundant=2, random_state=42)

feature_names_ex4 = [f'Feature {i}' for i in range(10)]
result = analyze_feature_importance(X_ex4, y_ex4, feature_names_ex4)

if result is not None:
    print("Feature importance ranking:")
    for name, importance in result:
        bar = '#' * int(importance * 100)
        print(f"  {name:>12}: {importance:.4f} {bar}")
    
    # Verify: top 3 features should capture most importance
    top3_importance = sum(imp for _, imp in result[:3])
    print(f"\nTop 3 features capture {top3_importance:.1%} of total importance")
    assert top3_importance > 0.4, "Top 3 informative features should capture >40% importance"
    print("Check passed!")
else:
    print("TODO: Implement analyze_feature_importance")

---

## Summary

### Key Concepts

- **Decision Trees** split data using yes/no questions, choosing splits that maximize information gain. Simple but prone to overfitting.
- **Random Forests** (bagging) combine many trees trained on random data subsets. Reduces variance while keeping low bias.
- **Gradient Boosting** builds trees sequentially, each correcting previous errors. Often the highest accuracy on tabular data.
- **SVMs** find the maximum-margin decision boundary. The kernel trick handles nonlinear data without explicit feature transformation.
- **k-Means** clustering iteratively assigns points to nearest centers. Needs K specified upfront, assumes spherical clusters.
- **DBSCAN** finds clusters by density, handles arbitrary shapes, and automatically detects noise points.
- **k-NN** classifies by majority vote of nearest neighbors. Simple but slow at prediction time.
- **Naive Bayes** uses Bayes' theorem with feature independence assumption. Extremely fast, great for text.
- **Model evaluation** requires cross-validation, confusion matrices, and appropriate metrics for the problem.

### Connection to Deep Learning

| Classical ML Concept | Deep Learning Connection |
|---------------------|-------------------------|
| Decision tree splits | Feature thresholding in neural network activations (ReLU) |
| Ensemble averaging | Dropout as implicit ensemble, model ensembles in production |
| Gradient boosting residuals | Residual connections in ResNets |
| SVM kernel trick | Neural networks as learned feature mappings |
| SVM maximum margin | Contrastive loss, triplet loss in embedding learning |
| k-Means centroids | Learned prototypes in prototype networks, VQ-VAE codebook |
| Cross-validation | Standard evaluation protocol for all ML models |
| Feature importance | Attention weights, gradient-based attribution (Grad-CAM) |
| Bias-variance tradeoff | Underfitting/overfitting in neural network training |
| Confusion matrix & ROC | Same evaluation tools used for deep learning classifiers |

### Algorithm Selection Cheat Sheet

| Your Situation | Best Starting Point | Why |
|---------------|--------------------|----- |
| Tabular data, any size | Gradient Boosting (XGBoost) | State-of-the-art on tabular data |
| Need interpretability | Decision Tree or Logistic Regression | Transparent decision process |
| Small dataset (< 1K) | Random Forest or SVM | Robust with limited data |
| High-dimensional sparse data | Naive Bayes or Linear SVM | Fast, handles many features |
| No labels (unsupervised) | k-Means or DBSCAN | Discover natural groupings |
| Quick baseline | k-NN or Naive Bayes | Minimal tuning required |
| Images, text, audio | Deep Learning | Feature learning is key |

### Checklist

- [ ] I can explain how a decision tree chooses splits using Gini impurity or information gain
- [ ] I understand why ensembles (bagging, boosting) outperform single models
- [ ] I can explain the SVM maximum margin concept and the kernel trick
- [ ] I can apply k-means and DBSCAN, and know when each is appropriate
- [ ] I know the difference between precision, recall, F1, and when to use each
- [ ] I can read a confusion matrix and ROC curve
- [ ] I use cross-validation instead of single train/test splits
- [ ] I know when to use classical ML vs deep learning
- [ ] I always start with a simple baseline before trying complex models

---

## Next Steps

In **Part 1.5: Optimization & Linear Programming**, we'll study the mathematical machinery that powers model training:
- Gradient descent and its variants (SGD, Adam, RMSprop)
- Convex vs non-convex optimization
- Constrained optimization and Lagrange multipliers
- Learning rate schedules and convergence

Understanding optimization is critical because **every ML model** — from the simplest linear regression to the largest transformer — is trained by minimizing a loss function. The optimization algorithms we'll study next are the engine that makes learning possible.