# Module 1: AI & Machine Learning Fundamentals

## Overview

The field of Artificial Intelligence has evolved dramatically over the past several decades:

- **Rule-based systems** (1950s-1980s): Hand-coded if/else logic, expert systems
- **Machine Learning** (1990s-2010s): Algorithms that learn patterns from data
- **Deep Learning** (2012-present): Neural networks with many layers, fueled by GPUs and big data
- **Generative AI** (2020-present): Models that generate text, images, code, and more (GPT, DALL-E, Stable Diffusion)

This notebook covers the foundational ML concepts you need before diving into embeddings, RAG, and agents in later modules.

### What you'll learn

1. The AI landscape and when to use ML vs traditional programming
2. Supervised, unsupervised, and reinforcement learning paradigms
3. Linear regression from scratch with gradient descent
4. How learning rate affects convergence
5. Logistic regression for classification
6. Evaluation metrics: accuracy, precision, recall, F1, confusion matrix
7. How train/test split ratios affect model performance

## 1. Setup

In [None]:
!pip install -q scikit-learn matplotlib numpy

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, ConfusionMatrixDisplay, classification_report
)
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler

print("All imports successful!")
print(f"NumPy version: {np.__version__}")

## 2. The AI Landscape

### Evolution of AI

```
1950s          1980s          2000s          2012           2017           2020+
  |              |              |              |              |              |
  v              v              v              v              v              v

Rule-Based --> Expert -----> Classical ----> Deep --------> Transformers -> Generative
Systems        Systems        ML             Learning       (Attention)     AI
(if/else)      (knowledge     (SVM, Trees,   (CNNs, RNNs)   (BERT, GPT)    (ChatGPT,
               bases)         Regression)                                   DALL-E)
```

### When to use ML vs Traditional Programming

| Criteria | Traditional Programming | Machine Learning |
|----------|------------------------|------------------|
| Rules are clear and finite | Best choice | Overkill |
| Pattern is complex/unknown | Very difficult | Best choice |
| Data is abundant | Not needed | Required |
| Problem changes over time | Must rewrite rules | Model adapts with new data |
| Interpretability needed | Easy to explain | Can be a black box |
| Example | Tax calculator | Spam detection |

**Key insight:** Use ML when you cannot easily write explicit rules, but you have data that contains the patterns you want to capture.

## 3. Supervised vs Unsupervised vs Reinforcement Learning

### The Three Main Paradigms

**Supervised Learning** learns from labeled examples (input-output pairs).
- *"Here are 10,000 emails labeled as spam or not spam. Learn the pattern."*
- Tasks: Classification (discrete labels), Regression (continuous values)
- Examples: Spam detection, house price prediction, image recognition

**Unsupervised Learning** finds hidden patterns in unlabeled data.
- *"Here are 10,000 customer profiles. Find natural groupings."*
- Tasks: Clustering, dimensionality reduction, anomaly detection
- Examples: Customer segmentation, topic modeling, data compression

**Reinforcement Learning** learns by trial and error with rewards/penalties.
- *"Play this game millions of times. Maximize your score."*
- Tasks: Sequential decision making, control, game playing
- Examples: AlphaGo, robotics, self-driving cars, RLHF for LLMs

| Aspect | Supervised | Unsupervised | Reinforcement |
|--------|-----------|-------------|---------------|
| Data | Labeled | Unlabeled | Reward signal |
| Goal | Predict output | Find structure | Maximize reward |
| Feedback | Direct (correct answer) | None | Delayed (reward) |
| Examples | Regression, Classification | Clustering, PCA | Game AI, Robotics |
| Analogy | Studying with answer key | Exploring on your own | Learning by doing |

In this module, we focus on **supervised learning** -- the most widely used paradigm and the foundation for understanding modern AI systems.

## 4. Linear Regression from Scratch

Linear regression models the relationship between input features and a continuous output:

$$\hat{y} = X \cdot w + b$$

Where:
- $X$ = input features (e.g., house size, number of bedrooms)
- $w$ = weights (learned parameters)
- $b$ = bias term
- $\hat{y}$ = predicted output (e.g., house price)

We minimize the **Mean Squared Error (MSE)** loss:

$$L = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

### Generate Synthetic Housing Data

In [None]:
# Generate synthetic housing data
np.random.seed(42)
n_samples = 200

# Features: house size (sqft) and number of bedrooms
size = np.random.uniform(600, 4000, n_samples)       # square feet
bedrooms = np.random.randint(1, 6, n_samples)         # 1-5 bedrooms

# True relationship: price = 150*size + 20000*bedrooms + 50000 + noise
noise = np.random.normal(0, 30000, n_samples)
price = 150 * size + 20000 * bedrooms + 50000 + noise

# Combine features into matrix X
X = np.column_stack([size, bedrooms])
y = price

print(f"Dataset shape: X={X.shape}, y={y.shape}")
print(f"\nFirst 5 samples:")
print(f"{'Size (sqft)':<15} {'Bedrooms':<12} {'Price ($)':>12}")
print("-" * 40)
for i in range(5):
    print(f"{X[i, 0]:<15.0f} {X[i, 1]:<12.0f} {y[i]:>12,.0f}")

In [None]:
# Visualize the data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Price vs Size
axes[0].scatter(X[:, 0], y, alpha=0.5, c='steelblue', edgecolors='k', linewidth=0.5)
axes[0].set_xlabel('House Size (sqft)', fontsize=12)
axes[0].set_ylabel('Price ($)', fontsize=12)
axes[0].set_title('House Price vs Size', fontsize=14)

# Price vs Bedrooms
axes[1].scatter(X[:, 1], y, alpha=0.5, c='coral', edgecolors='k', linewidth=0.5)
axes[1].set_xlabel('Number of Bedrooms', fontsize=12)
axes[1].set_ylabel('Price ($)', fontsize=12)
axes[1].set_title('House Price vs Bedrooms', fontsize=14)

plt.tight_layout()
plt.show()

### Gradient Descent

Gradient descent is an optimization algorithm that iteratively updates parameters to minimize the loss function.

The update rules are:

$$w = w - \alpha \cdot \frac{\partial L}{\partial w}$$

$$b = b - \alpha \cdot \frac{\partial L}{\partial b}$$

Where $\alpha$ is the **learning rate** -- a hyperparameter that controls the step size.

The gradients for MSE loss are:

$$\frac{\partial L}{\partial w} = -\frac{2}{n} X^T (y - \hat{y})$$

$$\frac{\partial L}{\partial b} = -\frac{2}{n} \sum (y - \hat{y})$$

### Exercise 1: Implement Gradient Descent for Linear Regression

Your task: implement the `gradient_descent` function that trains a linear regression model from scratch.

**Hints:**
- Forward pass: compute predictions using $\hat{y} = X \cdot w + b$
- Compute the MSE loss
- Compute gradients of the loss with respect to weights and bias
- Update weights and bias using the gradients and learning rate

In [None]:
def gradient_descent(X, y, lr=0.01, epochs=1000):
    """
    Implement linear regression using gradient descent.
    
    Parameters:
        X: numpy array of shape (n_samples, n_features) - input features
        y: numpy array of shape (n_samples,) - target values
        lr: float - learning rate
        epochs: int - number of iterations
    
    Returns:
        weights: numpy array of shape (n_features,) - learned weights
        bias: float - learned bias
        loss_history: list of float - MSE loss at each epoch
    """
    n_samples, n_features = X.shape
    
    # TODO: Initialize weights to zeros and bias to 0
    weights = None
    bias = None
    loss_history = []
    
    for epoch in range(epochs):
        # TODO: Forward pass - compute predictions
        y_pred = None
        
        # TODO: Compute MSE loss
        loss = None
        loss_history.append(loss)
        
        # TODO: Compute gradients
        dw = None  # gradient with respect to weights
        db = None  # gradient with respect to bias
        
        # TODO: Update parameters
        weights = None
        bias = None
    
    return weights, bias, loss_history

# Test your implementation (will fail until you complete the TODOs)
# We normalize features first for stable gradient descent
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# weights, bias, loss_history = gradient_descent(X_scaled, y, lr=0.01, epochs=1000)
# print(f"Learned weights: {weights}")
# print(f"Learned bias: {bias:.2f}")
# print(f"Final loss: {loss_history[-1]:.2f}")

### Solution

In [None]:
def gradient_descent(X, y, lr=0.01, epochs=1000):
    """
    Implement linear regression using gradient descent.
    
    Parameters:
        X: numpy array of shape (n_samples, n_features) - input features
        y: numpy array of shape (n_samples,) - target values
        lr: float - learning rate
        epochs: int - number of iterations
    
    Returns:
        weights: numpy array of shape (n_features,) - learned weights
        bias: float - learned bias
        loss_history: list of float - MSE loss at each epoch
    """
    n_samples, n_features = X.shape
    
    # Initialize weights to zeros and bias to 0
    weights = np.zeros(n_features)
    bias = 0.0
    loss_history = []
    
    for epoch in range(epochs):
        # Forward pass - compute predictions
        y_pred = X.dot(weights) + bias
        
        # Compute MSE loss
        error = y - y_pred
        loss = np.mean(error ** 2)
        loss_history.append(loss)
        
        # Compute gradients
        dw = -(2 / n_samples) * X.T.dot(error)
        db = -(2 / n_samples) * np.sum(error)
        
        # Update parameters
        weights = weights - lr * dw
        bias = bias - lr * db
    
    return weights, bias, loss_history

# Normalize features for stable gradient descent
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train the model
weights, bias, loss_history = gradient_descent(X_scaled, y, lr=0.01, epochs=1000)

print(f"Learned weights: {weights}")
print(f"Learned bias: {bias:,.2f}")
print(f"Final loss: {loss_history[-1]:,.2f}")
print(f"Loss reduction: {loss_history[0]:,.2f} -> {loss_history[-1]:,.2f}")

In [None]:
# Visualize the results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loss curve over training iterations
axes[0].plot(loss_history, color='steelblue', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('MSE Loss', fontsize=12)
axes[0].set_title('Training Loss Curve', fontsize=14)
axes[0].set_yscale('log')
axes[0].grid(True, alpha=0.3)

# Plot 2: Predictions vs Actual (using size as x-axis for visualization)
y_pred = X_scaled.dot(weights) + bias
axes[1].scatter(y, y_pred, alpha=0.5, c='steelblue', edgecolors='k', linewidth=0.5)
# Perfect prediction line
min_val = min(y.min(), y_pred.min())
max_val = max(y.max(), y_pred.max())
axes[1].plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect prediction')
axes[1].set_xlabel('Actual Price ($)', fontsize=12)
axes[1].set_ylabel('Predicted Price ($)', fontsize=12)
axes[1].set_title('Predictions vs Actual Values', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Compare with sklearn's LinearRegression

In [None]:
# Train sklearn's linear regression for comparison
sklearn_model = LinearRegression()
sklearn_model.fit(X_scaled, y)

print("Comparison of our implementation vs sklearn:")
print(f"{'':>25} {'Ours':>15} {'sklearn':>15}")
print("-" * 55)
print(f"{'Weight (size)':>25} {weights[0]:>15,.2f} {sklearn_model.coef_[0]:>15,.2f}")
print(f"{'Weight (bedrooms)':>25} {weights[1]:>15,.2f} {sklearn_model.coef_[1]:>15,.2f}")
print(f"{'Bias':>25} {bias:>15,.2f} {sklearn_model.intercept_:>15,.2f}")

# Compute R-squared for both
y_pred_ours = X_scaled.dot(weights) + bias
y_pred_sklearn = sklearn_model.predict(X_scaled)

ss_res_ours = np.sum((y - y_pred_ours) ** 2)
ss_res_sklearn = np.sum((y - y_pred_sklearn) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)

r2_ours = 1 - (ss_res_ours / ss_tot)
r2_sklearn = 1 - (ss_res_sklearn / ss_tot)

print(f"{'R-squared':>25} {r2_ours:>15.6f} {r2_sklearn:>15.6f}")
print(f"\nOur gradient descent achieves nearly identical results to sklearn!")

## 5. Gradient Descent Visualization

The **learning rate** is one of the most important hyperparameters in ML. Let's see how it affects training:

- **Too small**: Very slow convergence, may not reach optimum in time
- **Just right**: Smooth convergence to the optimum
- **Too large**: Oscillation or divergence -- the loss may explode

In [None]:
# Compare different learning rates
learning_rates = [0.001, 0.01, 0.1]
colors = ['#e74c3c', '#2ecc71', '#3498db']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Loss curves for different learning rates
for lr, color in zip(learning_rates, colors):
    _, _, loss_hist = gradient_descent(X_scaled, y, lr=lr, epochs=500)
    axes[0].plot(loss_hist, label=f'lr={lr}', color=color, linewidth=2)

axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('MSE Loss', fontsize=12)
axes[0].set_title('Effect of Learning Rate on Convergence', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].set_yscale('log')
axes[0].grid(True, alpha=0.3)

# Plot 2: Zoomed in on first 100 epochs
for lr, color in zip(learning_rates, colors):
    _, _, loss_hist = gradient_descent(X_scaled, y, lr=lr, epochs=100)
    axes[1].plot(loss_hist, label=f'lr={lr}', color=color, linewidth=2)

axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('MSE Loss', fontsize=12)
axes[1].set_title('First 100 Epochs (Zoomed In)', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].set_yscale('log')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print final losses
print("Final MSE loss after 500 epochs:")
for lr in learning_rates:
    _, _, loss_hist = gradient_descent(X_scaled, y, lr=lr, epochs=500)
    print(f"  lr={lr:<8} -> loss = {loss_hist[-1]:>15,.2f}")

In [None]:
# Gradient descent steps on a 2D loss surface (contour plot)
# For visualization, we use a simple 1-feature linear regression
np.random.seed(42)
X_simple = np.random.uniform(0, 10, 50).reshape(-1, 1)
y_simple = 3 * X_simple.squeeze() + 7 + np.random.normal(0, 2, 50)

# Create a grid of weight and bias values to compute loss surface
w_range = np.linspace(-2, 8, 100)
b_range = np.linspace(-5, 20, 100)
W, B = np.meshgrid(w_range, b_range)

# Compute loss for each (w, b) pair
Loss_surface = np.zeros_like(W)
for i in range(W.shape[0]):
    for j in range(W.shape[1]):
        y_pred = W[i, j] * X_simple.squeeze() + B[i, j]
        Loss_surface[i, j] = np.mean((y_simple - y_pred) ** 2)

# Run gradient descent and track the path
def gradient_descent_1d_track(X, y, lr=0.01, epochs=50):
    """Track the path of gradient descent for 1D linear regression."""
    w = 0.0
    b = 0.0
    path = [(w, b)]
    n = len(X)
    
    for _ in range(epochs):
        y_pred = w * X + b
        error = y - y_pred
        dw = -(2 / n) * np.sum(error * X)
        db = -(2 / n) * np.sum(error)
        w = w - lr * dw
        b = b - lr * db
        path.append((w, b))
    
    return np.array(path)

# Track paths for different learning rates
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, (lr, color) in enumerate(zip([0.001, 0.01, 0.05], colors)):
    path = gradient_descent_1d_track(X_simple.squeeze(), y_simple, lr=lr, epochs=50)
    
    # Contour plot
    cs = axes[idx].contour(W, B, Loss_surface, levels=30, cmap='viridis', alpha=0.7)
    axes[idx].clabel(cs, inline=True, fontsize=7)
    
    # Gradient descent path
    axes[idx].plot(path[:, 0], path[:, 1], 'o-', color=color, markersize=4, linewidth=2,
                   label=f'GD path ({len(path)-1} steps)')
    axes[idx].plot(path[0, 0], path[0, 1], 'ks', markersize=10, label='Start')
    axes[idx].plot(path[-1, 0], path[-1, 1], 'r*', markersize=15, label='End')
    
    axes[idx].set_xlabel('Weight (w)', fontsize=12)
    axes[idx].set_ylabel('Bias (b)', fontsize=12)
    axes[idx].set_title(f'Learning Rate = {lr}', fontsize=14)
    axes[idx].legend(fontsize=9, loc='upper right')

plt.suptitle('Gradient Descent Steps on Loss Surface', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()

## 6. Logistic Regression for Classification

While linear regression predicts continuous values, **logistic regression** predicts discrete class labels.

### Classification vs Regression

| Aspect | Regression | Classification |
|--------|-----------|----------------|
| Output | Continuous (e.g., price) | Discrete (e.g., spam/not spam) |
| Loss function | MSE | Cross-entropy |
| Output layer | Linear | Sigmoid/Softmax |
| Example | Predict house price | Predict flower species |

Logistic regression applies the **sigmoid function** to a linear model:

$$P(y=1|X) = \sigma(X \cdot w + b) = \frac{1}{1 + e^{-(X \cdot w + b)}}$$

### The Iris Dataset

The Iris dataset is a classic ML benchmark with 150 samples of 3 flower species, each described by 4 features (sepal length/width, petal length/width).

In [None]:
# Load and explore the Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

print(f"Dataset shape: {X_iris.shape}")
print(f"Number of classes: {len(iris.target_names)}")
print(f"Class names: {iris.target_names}")
print(f"Feature names: {iris.feature_names}")
print(f"\nSamples per class:")
for i, name in enumerate(iris.target_names):
    print(f"  {name}: {np.sum(y_iris == i)}")

# Visualize the data
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors_iris = ['#e74c3c', '#2ecc71', '#3498db']
for i, (name, color) in enumerate(zip(iris.target_names, colors_iris)):
    mask = y_iris == i
    axes[0].scatter(X_iris[mask, 0], X_iris[mask, 1], label=name, c=color, alpha=0.7, edgecolors='k', linewidth=0.5)
    axes[1].scatter(X_iris[mask, 2], X_iris[mask, 3], label=name, c=color, alpha=0.7, edgecolors='k', linewidth=0.5)

axes[0].set_xlabel(iris.feature_names[0], fontsize=12)
axes[0].set_ylabel(iris.feature_names[1], fontsize=12)
axes[0].set_title('Sepal Features', fontsize=14)
axes[0].legend(fontsize=11)

axes[1].set_xlabel(iris.feature_names[2], fontsize=12)
axes[1].set_ylabel(iris.feature_names[3], fontsize=12)
axes[1].set_title('Petal Features', fontsize=14)
axes[1].legend(fontsize=11)

plt.tight_layout()
plt.show()

### Exercise 2: Build a Logistic Regression Classifier for Iris

Your task: split the data, train a logistic regression model, and evaluate it.

**Steps:**
1. Split data into 80% train / 20% test using `train_test_split`
2. Scale the features using `StandardScaler`
3. Train a `LogisticRegression` model
4. Make predictions on the test set
5. Print the accuracy

In [None]:
# Exercise 2: Build a logistic regression classifier

# TODO: Split data into train/test (80/20), use random_state=42
X_train, X_test, y_train, y_test = None, None, None, None

# TODO: Scale features using StandardScaler (fit on train, transform both)
scaler_iris = None
X_train_scaled = None
X_test_scaled = None

# TODO: Create and train a LogisticRegression model (use max_iter=200)
log_reg = None

# TODO: Make predictions on the test set
y_pred = None

# TODO: Calculate and print accuracy
accuracy = None
# print(f"Test Accuracy: {accuracy:.4f}")

### Solution

In [None]:
# Solution: Build a logistic regression classifier

# Split data into train/test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42
)
print(f"Train set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

# Scale features
scaler_iris = StandardScaler()
X_train_scaled = scaler_iris.fit_transform(X_train)
X_test_scaled = scaler_iris.transform(X_test)

# Train logistic regression
log_reg = LogisticRegression(max_iter=200, random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make predictions
y_pred = log_reg.predict(X_test_scaled)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest Accuracy: {accuracy:.4f}")
print(f"Correctly classified: {np.sum(y_pred == y_test)} / {len(y_test)}")

# Show individual predictions
print(f"\nSample predictions:")
print(f"{'Actual':<15} {'Predicted':<15} {'Correct':>8}")
print("-" * 40)
for i in range(min(10, len(y_test))):
    actual = iris.target_names[y_test[i]]
    predicted = iris.target_names[y_pred[i]]
    correct = "Yes" if y_test[i] == y_pred[i] else "No"
    print(f"{actual:<15} {predicted:<15} {correct:>8}")

## 7. Evaluation Metrics

Accuracy alone can be misleading, especially with imbalanced datasets. Here are the key metrics:

### Metric Definitions

- **Accuracy** = (correct predictions) / (total predictions) -- Overall correctness
- **Precision** = TP / (TP + FP) -- "Of all positive predictions, how many were actually positive?"
- **Recall** = TP / (TP + FN) -- "Of all actual positives, how many did we find?"
- **F1 Score** = 2 * (Precision * Recall) / (Precision + Recall) -- Harmonic mean of precision and recall

### When to Use Which Metric

| Scenario | Priority Metric | Why |
|----------|----------------|-----|
| Spam detection | Precision | Don't want to lose important emails |
| Cancer screening | Recall | Don't want to miss any cases |
| Balanced classes | Accuracy or F1 | All metrics are informative |
| Imbalanced classes | F1 or AUC | Accuracy can be misleading |

In [None]:
# Compute all evaluation metrics for our Iris classifier
print("=" * 50)
print("Classification Report")
print("=" * 50)
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Individual metrics
print("\nDetailed Metrics:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred, average='weighted'):.4f} (weighted)")
print(f"  Recall:    {recall_score(y_test, y_pred, average='weighted'):.4f} (weighted)")
print(f"  F1 Score:  {f1_score(y_test, y_pred, average='weighted'):.4f} (weighted)")

In [None]:
# Confusion Matrix Visualization
cm = confusion_matrix(y_test, y_pred)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Raw counts
disp1 = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
disp1.plot(ax=axes[0], cmap='Blues', values_format='d')
axes[0].set_title('Confusion Matrix (Counts)', fontsize=14)

# Normalized (percentages)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
disp2 = ConfusionMatrixDisplay(confusion_matrix=cm_normalized, display_labels=iris.target_names)
disp2.plot(ax=axes[1], cmap='Blues', values_format='.2f')
axes[1].set_title('Confusion Matrix (Normalized)', fontsize=14)

plt.tight_layout()
plt.show()

print("How to read the confusion matrix:")
print("  - Rows = actual class, Columns = predicted class")
print("  - Diagonal = correct predictions")
print("  - Off-diagonal = misclassifications")

### Exercise 3: Compare Model Performance Across Different Train/Test Splits

The train/test split ratio affects both model performance and our ability to evaluate it:
- **More training data** = better model learning
- **More test data** = more reliable evaluation

Your task: train and evaluate a logistic regression model with different split ratios and compare the results.

**Steps:**
1. Loop over split ratios: 60/40, 70/30, 80/20, 90/10
2. For each ratio, split, scale, train, predict, and compute accuracy, precision, recall, F1
3. Store the results and plot them

In [None]:
# Exercise 3: Compare performance across different train/test splits

split_ratios = [0.4, 0.3, 0.2, 0.1]  # test set proportions
results = {"test_size": [], "accuracy": [], "precision": [], "recall": [], "f1": []}

for test_size in split_ratios:
    # TODO: Split the data using train_test_split with random_state=42
    X_tr, X_te, y_tr, y_te = None, None, None, None
    
    # TODO: Scale features (fit on train, transform both)
    scaler_ex3 = None
    X_tr_scaled = None
    X_te_scaled = None
    
    # TODO: Train LogisticRegression (max_iter=200, random_state=42)
    model_ex3 = None
    
    # TODO: Make predictions
    y_pred_ex3 = None
    
    # TODO: Compute metrics and append to results dict
    # results["test_size"].append(test_size)
    # results["accuracy"].append(...)
    # results["precision"].append(...)
    # results["recall"].append(...)
    # results["f1"].append(...)
    pass

# TODO: Print results table
# TODO: Plot metrics vs split ratio

### Solution

In [None]:
# Solution: Compare performance across different train/test splits

split_ratios = [0.4, 0.3, 0.2, 0.1]  # test set proportions
results = {"test_size": [], "accuracy": [], "precision": [], "recall": [], "f1": []}

for test_size in split_ratios:
    # Split the data
    X_tr, X_te, y_tr, y_te = train_test_split(
        X_iris, y_iris, test_size=test_size, random_state=42
    )
    
    # Scale features
    scaler_ex3 = StandardScaler()
    X_tr_scaled = scaler_ex3.fit_transform(X_tr)
    X_te_scaled = scaler_ex3.transform(X_te)
    
    # Train model
    model_ex3 = LogisticRegression(max_iter=200, random_state=42)
    model_ex3.fit(X_tr_scaled, y_tr)
    
    # Predict
    y_pred_ex3 = model_ex3.predict(X_te_scaled)
    
    # Compute metrics
    results["test_size"].append(test_size)
    results["accuracy"].append(accuracy_score(y_te, y_pred_ex3))
    results["precision"].append(precision_score(y_te, y_pred_ex3, average='weighted'))
    results["recall"].append(recall_score(y_te, y_pred_ex3, average='weighted'))
    results["f1"].append(f1_score(y_te, y_pred_ex3, average='weighted'))

# Print results table
train_pcts = [f"{int((1 - ts) * 100)}/{int(ts * 100)}" for ts in split_ratios]
print(f"{'Split (Train/Test)':<20} {'Accuracy':>10} {'Precision':>10} {'Recall':>10} {'F1':>10} {'Test Samples':>13}")
print("-" * 75)
for i, ts in enumerate(split_ratios):
    n_test = int(len(y_iris) * ts)
    print(f"{train_pcts[i]:<20} {results['accuracy'][i]:>10.4f} {results['precision'][i]:>10.4f} "
          f"{results['recall'][i]:>10.4f} {results['f1'][i]:>10.4f} {n_test:>13}")

In [None]:
# Visualize the comparison
fig, ax = plt.subplots(figsize=(10, 6))

x_labels = [f"{int((1-ts)*100)}/{int(ts*100)}" for ts in split_ratios]
x_pos = np.arange(len(split_ratios))
width = 0.2

bars1 = ax.bar(x_pos - 1.5*width, results['accuracy'], width, label='Accuracy', color='#3498db', alpha=0.85)
bars2 = ax.bar(x_pos - 0.5*width, results['precision'], width, label='Precision', color='#2ecc71', alpha=0.85)
bars3 = ax.bar(x_pos + 0.5*width, results['recall'], width, label='Recall', color='#e74c3c', alpha=0.85)
bars4 = ax.bar(x_pos + 1.5*width, results['f1'], width, label='F1 Score', color='#f39c12', alpha=0.85)

ax.set_xlabel('Train/Test Split Ratio', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Model Performance Across Different Train/Test Splits', fontsize=14)
ax.set_xticks(x_pos)
ax.set_xticklabels(x_labels)
ax.legend(fontsize=11)
ax.set_ylim(0.85, 1.02)
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Observations:")
print("  - More training data generally improves model performance")
print("  - Very small test sets (e.g., 10%) give less reliable estimates")
print("  - The 80/20 split is a commonly used default and a good balance")
print("  - For small datasets, consider cross-validation instead of a single split")

## 8. Summary & Key Takeaways

### What we covered

1. **AI Landscape**: The evolution from rule-based systems to generative AI, and when ML is the right approach.

2. **Learning Paradigms**: Supervised learning (labeled data, prediction), unsupervised learning (unlabeled data, structure discovery), and reinforcement learning (reward-driven).

3. **Linear Regression**: Predicting continuous values by minimizing MSE with gradient descent. We implemented it from scratch and verified against sklearn.

4. **Gradient Descent**: The core optimization algorithm in ML. Learning rate is critical -- too small means slow convergence, too large means divergence.

5. **Logistic Regression**: Extending linear models to classification using the sigmoid function.

6. **Evaluation Metrics**: Accuracy is not enough. Precision, recall, and F1 give a more complete picture, especially for imbalanced datasets.

7. **Train/Test Split**: The ratio matters. The 80/20 split is a sensible default, but cross-validation is preferred for small datasets.

### Looking Ahead

These fundamentals form the foundation for understanding:
- **Embeddings** (Module 2): How models represent text and images as vectors
- **Transformers** (Module 3): The architecture behind GPT, BERT, and modern LLMs
- **RAG** (Module 4): Combining retrieval with generation for grounded AI responses
- **Agents** (Module 5): Autonomous AI systems that use tools and make decisions

---

### References

- **Book**: Aurelien Geron, *Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow*, 3rd Edition (Chapters 1-4)
- **Course**: Andrew Ng, *Machine Learning Specialization* (Coursera, Course 1: Supervised Machine Learning)
- **Video**: 3Blue1Brown, *"But what is a neural network?"* (YouTube) -- excellent visual intuition for neural networks
- **Documentation**: [scikit-learn User Guide](https://scikit-learn.org/stable/user_guide.html) -- comprehensive reference for all algorithms used in this notebook