# 📐 Mathematics for Machine Learning

Welcome to the mathematical foundations of ML! This notebook covers essential mathematical concepts with interactive examples.

**Learning Goals:**
- Master linear algebra fundamentals
- Understand calculus for optimization
- Apply probability theory
- Visualize mathematical concepts

**Sources:**
- "Mathematics for Machine Learning" - Deisenroth, Faisal, Ong (2020)
- "Deep Learning" - Goodfellow, Bengio, Courville, Chapter 2-4
- "Pattern Recognition and Machine Learning" - Bishop

In [None]:
# Import essential libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from scipy import linalg
import pandas as pd

# Set up plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
np.set_printoptions(precision=3, suppress=True)

print("✅ Libraries loaded successfully!")
print(f"NumPy version: {np.__version__}")

## 🔢 Part 1: Linear Algebra - The Language of ML

Linear algebra is fundamental to ML. Every dataset, model, and transformation uses vectors and matrices!

### Why Linear Algebra Matters:
- **Data representation**: Each data point is a vector
- **Model parameters**: Weights are matrices
- **Transformations**: Matrix operations transform data
- **Efficiency**: Vectorization speeds up computation

**Source:** "Mathematics for Machine Learning" Chapter 2

### 1.1 Vectors: Building Blocks of Data

In [None]:
# Vectors represent data points
# Example: A house with [size, bedrooms, age]
house_1 = np.array([1200, 3, 5])  # 1200 sqft, 3 bedrooms, 5 years old
house_2 = np.array([1800, 4, 2])  # 1800 sqft, 4 bedrooms, 2 years old

print("🏠 House 1 (vector):", house_1)
print("🏠 House 2 (vector):", house_2)
print("\n📏 Vector properties:")
print(f"  Dimension: {house_1.shape}")
print(f"  Length (L2 norm): {np.linalg.norm(house_1):.2f}")

# Vector operations
print("\n➕ Vector Addition (combining features):")
total = house_1 + house_2
print(f"  Sum: {total}")

print("\n✖️ Scalar Multiplication (scaling):")
scaled = 2 * house_1
print(f"  2 × house_1 = {scaled}")

print("\n📊 Dot Product (similarity measure):")
similarity = np.dot(house_1, house_2)
print(f"  house_1 · house_2 = {similarity}")
print("  → Larger dot product = more similar houses")

In [None]:
# Visualizing vectors in 2D
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Simple 2D vectors for visualization
v1 = np.array([3, 2])
v2 = np.array([1, 3])

# Plot 1: Individual vectors
axes[0].quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='blue', width=0.008, label='v1')
axes[0].quiver(0, 0, v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='red', width=0.008, label='v2')
axes[0].set_xlim(-1, 5)
axes[0].set_ylim(-1, 5)
axes[0].set_aspect('equal')
axes[0].grid(True, alpha=0.3)
axes[0].legend()
axes[0].set_title('Original Vectors')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')

# Plot 2: Vector addition
v_sum = v1 + v2
axes[1].quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='blue', width=0.008, alpha=0.5, label='v1')
axes[1].quiver(v1[0], v1[1], v2[0], v2[1], angles='xy', scale_units='xy', scale=1, color='red', width=0.008, alpha=0.5, label='v2')
axes[1].quiver(0, 0, v_sum[0], v_sum[1], angles='xy', scale_units='xy', scale=1, color='green', width=0.01, label='v1 + v2')
axes[1].set_xlim(-1, 5)
axes[1].set_ylim(-1, 6)
axes[1].set_aspect('equal')
axes[1].grid(True, alpha=0.3)
axes[1].legend()
axes[1].set_title('Vector Addition')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')

# Plot 3: Scalar multiplication
v_scaled = 1.5 * v1
axes[2].quiver(0, 0, v1[0], v1[1], angles='xy', scale_units='xy', scale=1, color='blue', width=0.008, alpha=0.5, label='v1')
axes[2].quiver(0, 0, v_scaled[0], v_scaled[1], angles='xy', scale_units='xy', scale=1, color='purple', width=0.01, label='1.5 × v1')
axes[2].set_xlim(-1, 6)
axes[2].set_ylim(-1, 5)
axes[2].set_aspect('equal')
axes[2].grid(True, alpha=0.3)
axes[2].legend()
axes[2].set_title('Scalar Multiplication')
axes[2].set_xlabel('x')
axes[2].set_ylabel('y')

plt.tight_layout()
plt.show()

print("💡 Key Insight: Vector operations preserve geometric relationships!")

### 1.2 Matrices: Organizing Data and Transformations

In [None]:
# Matrix: Collection of data points (each row = one data point)
houses = np.array([
    [1200, 3, 5],   # House 1
    [1800, 4, 2],   # House 2
    [1500, 3, 8],   # House 3
    [2200, 5, 1],   # House 4
    [1000, 2, 10]   # House 5
])

print("🏘️ Housing Dataset (Matrix):")
print(houses)
print(f"\n📐 Shape: {houses.shape} (5 houses, 3 features)")
print(f"\n📊 Column means (average per feature):")
print(f"  Avg size: {houses[:, 0].mean():.0f} sqft")
print(f"  Avg bedrooms: {houses[:, 1].mean():.1f}")
print(f"  Avg age: {houses[:, 2].mean():.1f} years")

In [None]:
# Matrix operations in ML
print("🔧 Common Matrix Operations in ML:")
print("="*60)

# 1. Transpose (swap rows and columns)
print("\n1️⃣ Transpose (features × samples):")
houses_T = houses.T
print(f"   Original shape: {houses.shape}")
print(f"   Transposed shape: {houses_T.shape}")
print(houses_T)

# 2. Matrix-vector multiplication (applying weights)
print("\n2️⃣ Matrix-Vector Multiplication (prediction):")
weights = np.array([0.1, 50, -10])  # $ per sqft, $ per bedroom, $ per year
prices = houses @ weights  # @ is matrix multiplication
print(f"   Weights: {weights}")
print(f"   Predicted prices: {prices}")
print("   → This is how linear regression makes predictions!")

# 3. Matrix-matrix multiplication
print("\n3️⃣ Matrix-Matrix Multiplication (neural network layer):")
W = np.random.randn(3, 2)  # Weight matrix (3 input features → 2 hidden units)
hidden = houses @ W
print(f"   Input shape: {houses.shape}")
print(f"   Weight shape: {W.shape}")
print(f"   Output shape: {hidden.shape}")
print("   → This is a single layer in a neural network!")

In [None]:
# Visualizing matrix transformations
print("🎨 Visualizing Matrix Transformations")
print("="*60)

# Create a simple 2D dataset (square)
square = np.array([
    [0, 0],
    [1, 0],
    [1, 1],
    [0, 1],
    [0, 0]  # Close the square
])

# Different transformation matrices
transformations = {
    'Scaling': np.array([[2, 0], [0, 2]]),
    'Rotation (45°)': np.array([[np.cos(np.pi/4), -np.sin(np.pi/4)],
                                [np.sin(np.pi/4), np.cos(np.pi/4)]]),
    'Shear': np.array([[1, 0.5], [0, 1]]),
    'Reflection': np.array([[-1, 0], [0, 1]])
}

fig, axes = plt.subplots(2, 2, figsize=(14, 14))
axes = axes.flatten()

for idx, (name, matrix) in enumerate(transformations.items()):
    # Apply transformation
    transformed = square @ matrix.T
    
    # Plot
    axes[idx].plot(square[:, 0], square[:, 1], 'b-o', linewidth=2, 
                   markersize=8, label='Original', alpha=0.5)
    axes[idx].plot(transformed[:, 0], transformed[:, 1], 'r-o', 
                   linewidth=2, markersize=8, label='Transformed')
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_aspect('equal')
    axes[idx].legend()
    axes[idx].set_title(f'{name}\n{matrix[0]} \n{matrix[1]}')
    axes[idx].set_xlim(-2, 3)
    axes[idx].set_ylim(-2, 3)
    axes[idx].axhline(y=0, color='k', linewidth=0.5)
    axes[idx].axvline(x=0, color='k', linewidth=0.5)

plt.tight_layout()
plt.show()

print("\n💡 Every transformation in ML is a matrix multiplication!")
print("   - PCA: Rotation to new axes")
print("   - Neural networks: Series of transformations")
print("   - Image processing: Convolution matrices")

### 1.3 Eigenvalues & Eigenvectors: Understanding Data Structure

**Intuition:** Eigenvectors show the "principal directions" of a transformation, and eigenvalues show how much stretching happens in those directions.

**ML Applications:**
- PCA (Principal Component Analysis)
- Covariance matrices
- Spectral clustering

**Source:** "Mathematics for Machine Learning" Chapter 4

In [None]:
# Generate correlated data
np.random.seed(42)
mean = [0, 0]
cov = [[3, 1.5], [1.5, 1]]  # Covariance matrix
data = np.random.multivariate_normal(mean, cov, 300)

# Compute eigenvalues and eigenvectors of covariance matrix
eigenvalues, eigenvectors = np.linalg.eig(cov)

print("🔍 Eigenanalysis of Data:")
print("="*60)
print(f"\nCovariance Matrix:")
print(cov)
print(f"\n📊 Eigenvalues: {eigenvalues}")
print(f"   → These show variance in principal directions")
print(f"\n🧭 Eigenvectors:")
print(eigenvectors)
print(f"   → These show the principal directions")

# Visualize
plt.figure(figsize=(10, 10))
plt.scatter(data[:, 0], data[:, 1], alpha=0.5, label='Data points')

# Plot eigenvectors scaled by eigenvalues
origin = np.array([[0, 0], [0, 0]])
for i in range(2):
    eigvec = eigenvectors[:, i] * np.sqrt(eigenvalues[i]) * 2
    plt.quiver(0, 0, eigvec[0], eigvec[1], 
               angles='xy', scale_units='xy', scale=1,
               color=['red', 'blue'][i], width=0.01,
               label=f'Eigenvector {i+1} (λ={eigenvalues[i]:.2f})')

plt.grid(True, alpha=0.3)
plt.axis('equal')
plt.legend()
plt.title('Principal Component Analysis (PCA)\nEigenvectors show directions of maximum variance')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

print("\n💡 PCA finds these eigenvectors to reduce dimensionality!")
print("   The first eigenvector (red) captures most variance.")

## 📈 Part 2: Calculus - The Language of Optimization

Machine learning is optimization! We use calculus to find the best model parameters.

**Key Concepts:**
- **Derivatives**: Rate of change (slopes)
- **Gradients**: Direction of steepest ascent
- **Optimization**: Finding minimum/maximum values
- **Gradient Descent**: The core ML training algorithm

**Source:** "Deep Learning" - Goodfellow et al., Chapter 4

### 2.1 Derivatives: Understanding Change

In [None]:
# Example: Cost function for a simple model
def cost_function(w):
    """Simple quadratic cost function: (w - 3)^2 + 5"""
    return (w - 3)**2 + 5

def derivative(w):
    """Derivative of cost function: 2(w - 3)"""
    return 2 * (w - 3)

# Visualize function and its derivative
w_values = np.linspace(-2, 8, 100)
cost_values = cost_function(w_values)
deriv_values = derivative(w_values)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Cost function
ax1.plot(w_values, cost_values, 'b-', linewidth=2, label='Cost function')
ax1.plot(3, 5, 'ro', markersize=15, label='Minimum (w=3)')
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=12)
ax1.set_xlabel('Weight (w)', fontsize=12)
ax1.set_ylabel('Cost', fontsize=12)
ax1.set_title('Cost Function: Goal is to find minimum', fontsize=14)

# Plot 2: Derivative (slope)
ax2.plot(w_values, deriv_values, 'g-', linewidth=2, label='Derivative (slope)')
ax2.axhline(y=0, color='r', linestyle='--', linewidth=2, label='Zero derivative = minimum')
ax2.plot(3, 0, 'ro', markersize=15)
ax2.grid(True, alpha=0.3)
ax2.legend(fontsize=12)
ax2.set_xlabel('Weight (w)', fontsize=12)
ax2.set_ylabel('Derivative', fontsize=12)
ax2.set_title('Derivative: Shows direction to move', fontsize=14)

plt.tight_layout()
plt.show()

print("📊 Key Insights:")
print("  • Derivative > 0 → Function increasing → Move left (decrease w)")
print("  • Derivative < 0 → Function decreasing → Move right (increase w)")
print("  • Derivative = 0 → At minimum/maximum → Stop!")
print("\n💡 This is the foundation of gradient descent!")

### 2.2 Gradient Descent: Training ML Models

In [None]:
# Implement gradient descent from scratch
def gradient_descent(starting_point, learning_rate, num_iterations):
    """
    Find minimum of cost function using gradient descent
    
    Parameters:
    - starting_point: Initial weight value
    - learning_rate: Step size (how far to move)
    - num_iterations: Number of steps to take
    """
    w = starting_point
    history = [w]
    
    for i in range(num_iterations):
        # Calculate gradient (derivative)
        grad = derivative(w)
        
        # Update weight: move opposite to gradient
        w = w - learning_rate * grad
        history.append(w)
        
        if i < 5 or i == num_iterations - 1:
            print(f"Iteration {i+1}: w={w:.4f}, cost={cost_function(w):.4f}, gradient={grad:.4f}")
    
    return w, history

print("🎯 Gradient Descent in Action!")
print("="*60)
print("\nStarting from w=7 (far from optimal w=3)...\n")

final_w, history = gradient_descent(starting_point=7, learning_rate=0.1, num_iterations=20)

print(f"\n✅ Converged to w={final_w:.4f} (optimal is w=3.0)")
print(f"   Final cost: {cost_function(final_w):.4f}")

In [None]:
# Visualize gradient descent path
fig = plt.figure(figsize=(16, 6))

# Plot 1: Path on cost function
ax1 = fig.add_subplot(121)
w_values = np.linspace(-1, 8, 100)
ax1.plot(w_values, cost_function(w_values), 'b-', linewidth=2, label='Cost function')

# Plot gradient descent steps
history_costs = [cost_function(w) for w in history]
ax1.plot(history, history_costs, 'ro-', linewidth=2, markersize=8, 
         label='Gradient descent path', alpha=0.7)
ax1.plot(history[0], history_costs[0], 'go', markersize=15, label='Start')
ax1.plot(history[-1], history_costs[-1], 'r*', markersize=20, label='End')

ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=11)
ax1.set_xlabel('Weight (w)', fontsize=12)
ax1.set_ylabel('Cost', fontsize=12)
ax1.set_title('Gradient Descent Path', fontsize=14)

# Plot 2: Cost over iterations
ax2 = fig.add_subplot(122)
ax2.plot(range(len(history_costs)), history_costs, 'b-o', linewidth=2, markersize=6)
ax2.grid(True, alpha=0.3)
ax2.set_xlabel('Iteration', fontsize=12)
ax2.set_ylabel('Cost', fontsize=12)
ax2.set_title('Cost Reduction Over Time', fontsize=14)

plt.tight_layout()
plt.show()

print("\n💡 This is EXACTLY how neural networks learn!")
print("   1. Compute gradient (derivative of cost w.r.t. weights)")
print("   2. Update weights in opposite direction of gradient")
print("   3. Repeat until convergence")

### 2.3 Partial Derivatives & Gradients: Multiple Variables

In [None]:
# 3D cost function (2 weights)
def cost_3d(w1, w2):
    """Bowl-shaped cost function"""
    return (w1 - 2)**2 + (w2 - 1)**2 + 5

def gradient_3d(w1, w2):
    """Gradient vector [∂C/∂w1, ∂C/∂w2]"""
    dw1 = 2 * (w1 - 2)
    dw2 = 2 * (w2 - 1)
    return np.array([dw1, dw2])

# Visualize 3D cost surface
w1_range = np.linspace(-1, 5, 50)
w2_range = np.linspace(-2, 4, 50)
W1, W2 = np.meshgrid(w1_range, w2_range)
Z = cost_3d(W1, W2)

fig = plt.figure(figsize=(18, 6))

# 3D surface plot
ax1 = fig.add_subplot(131, projection='3d')
surf = ax1.plot_surface(W1, W2, Z, cmap='viridis', alpha=0.8)
ax1.set_xlabel('w1')
ax1.set_ylabel('w2')
ax1.set_zlabel('Cost')
ax1.set_title('3D Cost Surface')
fig.colorbar(surf, ax=ax1, shrink=0.5)

# Contour plot
ax2 = fig.add_subplot(132)
contour = ax2.contour(W1, W2, Z, levels=20, cmap='viridis')
ax2.plot(2, 1, 'r*', markersize=20, label='Minimum (2, 1)')
ax2.set_xlabel('w1')
ax2.set_ylabel('w2')
ax2.set_title('Contour Plot (Top View)')
ax2.legend()
ax2.grid(True, alpha=0.3)
fig.colorbar(contour, ax=ax2)

# Gradient vectors
ax3 = fig.add_subplot(133)
ax3.contour(W1, W2, Z, levels=20, cmap='viridis', alpha=0.3)

# Sample points and their gradients
sample_points = np.array([[4, 3], [0, 2], [3, -1], [1, 0]])
for point in sample_points:
    grad = gradient_3d(point[0], point[1])
    # Normalize for visualization
    grad_norm = grad / (np.linalg.norm(grad) + 1e-8) * 0.5
    ax3.arrow(point[0], point[1], -grad_norm[0], -grad_norm[1],
             head_width=0.2, head_length=0.15, fc='red', ec='red', linewidth=2)
    ax3.plot(point[0], point[1], 'ro', markersize=8)

ax3.plot(2, 1, 'g*', markersize=20, label='Minimum')
ax3.set_xlabel('w1')
ax3.set_ylabel('w2')
ax3.set_title('Gradient Vectors\n(Red arrows point toward minimum)')
ax3.legend()
ax3.grid(True, alpha=0.3)
ax3.set_xlim(-1, 5)
ax3.set_ylim(-2, 4)

plt.tight_layout()
plt.show()

print("🧭 Gradient Points Downhill!")
print("  • Gradient = [∂C/∂w1, ∂C/∂w2] = vector of partial derivatives")
print("  • Points in direction of steepest ascent")
print("  • We move OPPOSITE to gradient (downhill) to minimize cost")

## 🎲 Part 3: Probability & Statistics

ML deals with uncertainty! Probability theory helps us model and reason about randomness.

**Key Concepts:**
- Probability distributions
- Expected values
- Variance and standard deviation
- Bayes' theorem

**Source:** "Pattern Recognition and Machine Learning" - Bishop, Chapter 1

In [None]:
# Common probability distributions in ML
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# 1. Normal/Gaussian Distribution (most important!)
x = np.linspace(-5, 5, 100)
for mu, sigma in [(0, 1), (0, 0.5), (0, 2)]:
    y = (1/(sigma * np.sqrt(2*np.pi))) * np.exp(-0.5*((x-mu)/sigma)**2)
    axes[0,0].plot(x, y, linewidth=2, label=f'μ={mu}, σ={sigma}')
axes[0,0].set_title('Normal Distribution\n(Most features, errors, noise)')
axes[0,0].set_xlabel('x')
axes[0,0].set_ylabel('Probability Density')
axes[0,0].legend()
axes[0,0].grid(True, alpha=0.3)

# 2. Bernoulli Distribution (binary outcomes)
p_values = [0.3, 0.5, 0.7]
x_pos = np.arange(2)
width = 0.25
for i, p in enumerate(p_values):
    axes[0,1].bar(x_pos + i*width, [1-p, p], width, label=f'p={p}', alpha=0.7)
axes[0,1].set_title('Bernoulli Distribution\n(Binary classification)')
axes[0,1].set_xlabel('Outcome')
axes[0,1].set_ylabel('Probability')
axes[0,1].set_xticks(x_pos + width)
axes[0,1].set_xticklabels(['0 (Failure)', '1 (Success)'])
axes[0,1].legend()
axes[0,1].grid(True, alpha=0.3, axis='y')

# 3. Uniform Distribution
a, b = -2, 2
x = np.linspace(-3, 3, 100)
y = np.where((x >= a) & (x <= b), 1/(b-a), 0)
axes[0,2].plot(x, y, linewidth=2)
axes[0,2].fill_between(x, y, alpha=0.3)
axes[0,2].set_title('Uniform Distribution\n(Random initialization)')
axes[0,2].set_xlabel('x')
axes[0,2].set_ylabel('Probability Density')
axes[0,2].grid(True, alpha=0.3)

# 4. Exponential Distribution
x = np.linspace(0, 5, 100)
for lam in [0.5, 1, 2]:
    y = lam * np.exp(-lam * x)
    axes[1,0].plot(x, y, linewidth=2, label=f'λ={lam}')
axes[1,0].set_title('Exponential Distribution\n(Time between events)')
axes[1,0].set_xlabel('x')
axes[1,0].set_ylabel('Probability Density')
axes[1,0].legend()
axes[1,0].grid(True, alpha=0.3)

# 5. Poisson Distribution
x = np.arange(0, 15)
for lam in [1, 4, 8]:
    y = (lam**x * np.exp(-lam)) / np.array([np.math.factorial(i) for i in x])
    axes[1,1].plot(x, y, 'o-', linewidth=2, label=f'λ={lam}')
axes[1,1].set_title('Poisson Distribution\n(Count data)')
axes[1,1].set_xlabel('k (number of events)')
axes[1,1].set_ylabel('Probability')
axes[1,1].legend()
axes[1,1].grid(True, alpha=0.3)

# 6. Beta Distribution
from scipy.stats import beta
x = np.linspace(0, 1, 100)
for a, b in [(0.5, 0.5), (2, 2), (5, 2)]:
    y = beta.pdf(x, a, b)
    axes[1,2].plot(x, y, linewidth=2, label=f'α={a}, β={b}')
axes[1,2].set_title('Beta Distribution\n(Probabilities as outputs)')
axes[1,2].set_xlabel('x')
axes[1,2].set_ylabel('Probability Density')
axes[1,2].legend()
axes[1,2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("📊 Distribution Use Cases in ML:")
print("  • Normal: Feature distributions, weight initialization, errors")
print("  • Bernoulli: Binary classification outputs")
print("  • Uniform: Random weight initialization")
print("  • Exponential: Time series, survival analysis")
print("  • Poisson: Count predictions (traffic, customers)")
print("  • Beta: Bayesian inference, probability priors")

### 3.1 Bayes' Theorem: The Foundation of Probabilistic ML

In [None]:
# Practical example: Medical diagnosis
print("🏥 Bayes' Theorem in Action: Medical Diagnosis")
print("="*60)

# Given information
P_disease = 0.01  # 1% of population has disease (prior)
P_positive_given_disease = 0.95  # Test is 95% sensitive (true positive rate)
P_positive_given_healthy = 0.10  # Test has 10% false positive rate

# Calculate P(positive) using law of total probability
P_positive = (P_positive_given_disease * P_disease + 
              P_positive_given_healthy * (1 - P_disease))

# Bayes' theorem: P(disease | positive test)
P_disease_given_positive = (P_positive_given_disease * P_disease) / P_positive

print(f"\n📊 Given Information:")
print(f"  • Disease prevalence: {P_disease:.1%}")
print(f"  • Test sensitivity: {P_positive_given_disease:.1%}")
print(f"  • False positive rate: {P_positive_given_healthy:.1%}")

print(f"\n🎯 Bayes' Theorem Result:")
print(f"  If you test positive, probability of having disease: {P_disease_given_positive:.1%}")

print(f"\n💡 Surprising insight:")
print(f"  Even with a positive test, only {P_disease_given_positive:.1%} chance of disease!")
print(f"  Why? The disease is rare (prior probability is low)")

# Visualize Bayes' theorem
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Visualization 1: Tree diagram
ax1 = axes[0]
ax1.text(0.1, 0.5, 'Population', fontsize=14, fontweight='bold', ha='center')
ax1.arrow(0.15, 0.5, 0.15, 0.2, head_width=0.02, head_length=0.05, fc='blue', ec='blue')
ax1.arrow(0.15, 0.5, 0.15, -0.2, head_width=0.02, head_length=0.05, fc='green', ec='green')

ax1.text(0.35, 0.75, f'Disease\n{P_disease:.1%}', fontsize=12, ha='center', 
         bbox=dict(boxstyle='round', facecolor='lightblue'))
ax1.text(0.35, 0.25, f'Healthy\n{1-P_disease:.1%}', fontsize=12, ha='center',
         bbox=dict(boxstyle='round', facecolor='lightgreen'))

ax1.arrow(0.45, 0.75, 0.15, 0.05, head_width=0.02, head_length=0.03, fc='red', ec='red')
ax1.arrow(0.45, 0.25, 0.15, 0.05, head_width=0.02, head_length=0.03, fc='orange', ec='orange')

ax1.text(0.7, 0.82, f'Test +\n{P_positive_given_disease:.1%}', fontsize=11, ha='center',
         bbox=dict(boxstyle='round', facecolor='lightcoral'))
ax1.text(0.7, 0.32, f'Test +\n{P_positive_given_healthy:.1%}', fontsize=11, ha='center',
         bbox=dict(boxstyle='round', facecolor='lightyellow'))

ax1.set_xlim(0, 1)
ax1.set_ylim(0, 1)
ax1.axis('off')
ax1.set_title('Probability Tree', fontsize=14, fontweight='bold')

# Visualization 2: Bar chart
ax2 = axes[1]
categories = ['Prior\nP(Disease)', 'Likelihood\nP(+|Disease)', 'Posterior\nP(Disease|+)']
values = [P_disease, P_positive_given_disease, P_disease_given_positive]
colors = ['skyblue', 'lightcoral', 'lightgreen']

bars = ax2.bar(categories, values, color=colors, alpha=0.7, edgecolor='black', linewidth=2)
for bar, value in zip(bars, values):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height + 0.02,
             f'{value:.1%}', ha='center', va='bottom', fontsize=14, fontweight='bold')

ax2.set_ylabel('Probability', fontsize=12)
ax2.set_title('Bayes\' Theorem: Updating Beliefs', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')
ax2.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

print("\n🧠 ML Applications of Bayes' Theorem:")
print("  • Naive Bayes classifier")
print("  • Bayesian neural networks")
print("  • Spam filters")
print("  • Reinforcement learning (belief updates)")

## 🎮 Interactive Exercises

Now it's your turn! Try these exercises to solidify your understanding.

### Exercise 1: Vector Operations

In [None]:
# TODO: Complete this exercise

# Given two vectors representing houses:
house_a = np.array([2000, 4, 3])  # [sqft, bedrooms, age]
house_b = np.array([1500, 3, 7])

# Task 1: Calculate the Euclidean distance between houses
# Hint: Use np.linalg.norm() or compute sqrt(sum((a-b)^2))
distance = None  # YOUR CODE HERE

# Task 2: Normalize house_a to unit length
# Hint: Divide by the vector's norm
house_a_normalized = None  # YOUR CODE HERE

# Task 3: Calculate cosine similarity
# Hint: (a · b) / (||a|| ||b||)
cosine_similarity = None  # YOUR CODE HERE

# Check your answers
print("✅ Solutions:")
print(f"Distance: {distance}")
print(f"Normalized house_a: {house_a_normalized}")
print(f"Cosine similarity: {cosine_similarity}")

### Exercise 2: Matrix Multiplication

In [None]:
# TODO: Complete this exercise

# Dataset: 4 samples, 3 features
X = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    [10, 11, 12]
])

# Weight matrix: 3 inputs -> 2 outputs
W = np.array([
    [0.1, 0.2],
    [0.3, 0.4],
    [0.5, 0.6]
])

# Task 1: Multiply X and W to get predictions
# What should be the output shape?
predictions = None  # YOUR CODE HERE

# Task 2: Calculate the mean of each feature (column) in X
feature_means = None  # YOUR CODE HERE

# Task 3: Center the data by subtracting means
X_centered = None  # YOUR CODE HERE

print("✅ Solutions:")
print(f"Predictions shape: {predictions.shape if predictions is not None else 'Not computed'}")
print(f"Feature means: {feature_means}")
print(f"Centered data:\n{X_centered}")

### Exercise 3: Implement Gradient Descent

In [None]:
# TODO: Implement gradient descent for a different function

def new_cost_function(w):
    """Cost function: w^3 - 2w^2 + 5"""
    return w**3 - 2*w**2 + 5

def new_derivative(w):
    """Derivative: 3w^2 - 4w"""
    # YOUR CODE HERE
    return None

def gradient_descent_exercise(start, lr, iterations):
    """
    Implement gradient descent
    
    Parameters:
    - start: starting weight
    - lr: learning rate
    - iterations: number of steps
    
    Returns:
    - final weight
    - history of weights
    """
    w = start
    history = [w]
    
    for i in range(iterations):
        # YOUR CODE HERE
        # 1. Calculate gradient
        # 2. Update weight
        # 3. Append to history
        pass
    
    return w, history

# Test your implementation
final, hist = gradient_descent_exercise(start=2.0, lr=0.01, iterations=100)
print(f"Final weight: {final}")
print(f"Final cost: {new_cost_function(final)}")

# Visualize
if hist is not None and len(hist) > 1:
    plt.figure(figsize=(12, 5))
    
    # Plot cost function
    plt.subplot(121)
    w_vals = np.linspace(-1, 3, 100)
    plt.plot(w_vals, new_cost_function(w_vals), 'b-', linewidth=2)
    plt.plot(hist, [new_cost_function(w) for w in hist], 'ro-', alpha=0.6)
    plt.xlabel('w')
    plt.ylabel('Cost')
    plt.title('Gradient Descent Path')
    plt.grid(True, alpha=0.3)
    
    # Plot convergence
    plt.subplot(122)
    plt.plot([new_cost_function(w) for w in hist], 'b-o')
    plt.xlabel('Iteration')
    plt.ylabel('Cost')
    plt.title('Cost vs Iteration')
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

## 🎓 Summary & Next Steps

### ✅ What You've Learned:

**Linear Algebra:**
- Vectors represent data points
- Matrices organize datasets and transformations
- Eigenvalues/eigenvectors reveal data structure
- Matrix operations are the foundation of neural networks

**Calculus:**
- Derivatives measure rate of change
- Gradients point uphill (we go opposite direction)
- Gradient descent is the core ML training algorithm
- Chain rule enables backpropagation

**Probability:**
- Distributions model uncertainty
- Bayes' theorem updates beliefs with evidence
- Expected values guide decision making
- Variance measures spread/uncertainty

### 📚 Key Formulas:

1. **Dot Product**: $\mathbf{a} \cdot \mathbf{b} = \sum_{i} a_i b_i$

2. **Matrix Multiplication**: $(AB)_{ij} = \sum_{k} A_{ik} B_{kj}$

3. **Gradient Descent**: $w_{t+1} = w_t - \alpha \nabla f(w_t)$

4. **Bayes' Theorem**: $P(A|B) = \frac{P(B|A) P(A)}{P(B)}$

### 🚀 Next Steps:

1. **[Statistics Fundamentals](03_statistics.ipynb)** - Dive deeper into statistical inference
2. **[Data Processing](04_data_processing.ipynb)** - Learn to clean and transform data
3. **[Classical ML](05_classical_ml.ipynb)** - Apply math to real algorithms

### 📖 Recommended Reading:

- "Mathematics for Machine Learning" - Deisenroth, Faisal, Ong
- "Deep Learning" - Goodfellow, Bengio, Courville (Chapters 2-4)
- 3Blue1Brown videos on Linear Algebra and Calculus
- Khan Academy: Linear Algebra, Calculus, Probability

### 💪 Challenge Problems:

1. Implement PCA from scratch using eigendecomposition
2. Code a simple neural network using only NumPy
3. Derive and implement backpropagation
4. Build a Naive Bayes classifier

**Remember**: Mathematics is the language of ML. Master these foundations, and everything else becomes easier! 🎯