# Vector Calculus for Machine Learning

**Course:** Mathematics for Machine Learning  
**Instructor:** Mohammed Alnemari  
**Lecture 5**

---

## What You'll Learn

This notebook covers the core vector calculus concepts used in machine learning:

1. **Numerical Differentiation** - Finite differences and derivative approximation
2. **Partial Derivatives and Gradients** - Gradient fields and visualization
3. **Jacobians** - Derivatives of vector-valued functions
4. **Gradient Descent** - Optimization from scratch
5. **Chain Rule and Backpropagation** - The engine behind deep learning
6. **Hessian and Second-Order Information** - Newton's method

---

## Google Colab Ready!

This notebook works perfectly in Google Colab. All required libraries are pre-installed!

In [None]:
# Import all required libraries
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize
from mpl_toolkits.mplot3d import Axes3D

# Set plotting style
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 12

print("All libraries imported successfully!")
print("NumPy version:", np.__version__)

---

# Part 1: Numerical Differentiation

The derivative of a function $f(x)$ is defined as:

$$f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}$$

When we cannot compute derivatives analytically, we approximate them using **finite differences**.

## 1.1 Forward and Central Differences

- **Forward difference:** $f'(x) \approx \frac{f(x+h) - f(x)}{h}$ (first-order accurate)
- **Central difference:** $f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}$ (second-order accurate)

In [None]:
def forward_difference(f, x, h=1e-5):
    """Compute derivative using forward difference."""
    return (f(x + h) - f(x)) / h

def central_difference(f, x, h=1e-5):
    """Compute derivative using central difference."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Define a test function: f(x) = sin(x)
def f(x):
    return np.sin(x)

# Analytical derivative: f'(x) = cos(x)
def f_prime_exact(x):
    return np.cos(x)

# Test at x = pi/4
x0 = np.pi / 4
exact = f_prime_exact(x0)
fwd = forward_difference(f, x0)
ctr = central_difference(f, x0)

print(f"Derivative of sin(x) at x = pi/4")
print(f"{'Method':<20} {'Value':<20} {'Error':<20}")
print(f"{'-'*60}")
print(f"{'Exact':<20} {exact:<20.12f} {'---':<20}")
print(f"{'Forward diff':<20} {fwd:<20.12f} {abs(fwd - exact):<20.2e}")
print(f"{'Central diff':<20} {ctr:<20.12f} {abs(ctr - exact):<20.2e}")

## 1.2 Effect of Step Size $h$

Choosing the right step size $h$ is important. Too large leads to truncation error; too small leads to floating-point error.

In [None]:
# Study how error changes with h
h_values = np.logspace(-15, 0, 100)
x0 = 1.0
exact = f_prime_exact(x0)

fwd_errors = [abs(forward_difference(f, x0, h) - exact) for h in h_values]
ctr_errors = [abs(central_difference(f, x0, h) - exact) for h in h_values]

plt.figure(figsize=(10, 6))
plt.loglog(h_values, fwd_errors, 'b-', label='Forward difference', linewidth=2)
plt.loglog(h_values, ctr_errors, 'r-', label='Central difference', linewidth=2)
plt.xlabel('Step size h')
plt.ylabel('Absolute error')
plt.title('Error vs Step Size for Numerical Differentiation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 1.3 Plotting Function and Its Derivative

In [None]:
# Define a more interesting function: f(x) = x^3 - 3x^2 + 2x
def g(x):
    return x**3 - 3*x**2 + 2*x

def g_prime_exact(x):
    return 3*x**2 - 6*x + 2

x = np.linspace(-1, 4, 300)
y = g(x)
dy_exact = g_prime_exact(x)
dy_numerical = central_difference(g, x)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot function
axes[0].plot(x, y, 'b-', linewidth=2)
axes[0].set_xlabel('x')
axes[0].set_ylabel('f(x)')
axes[0].set_title(r'$f(x) = x^3 - 3x^2 + 2x$')
axes[0].axhline(y=0, color='k', linewidth=0.5)
axes[0].grid(True, alpha=0.3)

# Plot derivatives
axes[1].plot(x, dy_exact, 'b-', linewidth=2, label='Analytical')
axes[1].plot(x[::10], dy_numerical[::10], 'ro', markersize=5, label='Numerical (central)')
axes[1].set_xlabel('x')
axes[1].set_ylabel("f'(x)")
axes[1].set_title('Derivative Comparison')
axes[1].axhline(y=0, color='k', linewidth=0.5)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

# Part 2: Partial Derivatives and Gradients

For a function $f(x, y)$, the **gradient** is the vector of partial derivatives:

$$\nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix}$$

The gradient points in the direction of **steepest ascent**.

## 2.1 Computing Partial Derivatives Numerically

In [None]:
def partial_derivative(f, point, var_index, h=1e-5):
    """Compute partial derivative of f with respect to variable var_index at given point."""
    point = np.array(point, dtype=float)
    point_forward = point.copy()
    point_backward = point.copy()
    point_forward[var_index] += h
    point_backward[var_index] -= h
    return (f(point_forward) - f(point_backward)) / (2 * h)

def numerical_gradient(f, point, h=1e-5):
    """Compute gradient of f at given point."""
    point = np.array(point, dtype=float)
    grad = np.zeros_like(point)
    for i in range(len(point)):
        grad[i] = partial_derivative(f, point, i, h)
    return grad

# Example: f(x, y) = x^2 + y^2
def f_simple(point):
    x, y = point
    return x**2 + y**2

# Analytical gradient: grad f = [2x, 2y]
test_point = [3.0, 4.0]
numerical_grad = numerical_gradient(f_simple, test_point)
analytical_grad = np.array([2 * test_point[0], 2 * test_point[1]])

print(f"f(x, y) = x^2 + y^2 at point ({test_point[0]}, {test_point[1]})")
print(f"Numerical gradient:  {numerical_grad}")
print(f"Analytical gradient: {analytical_grad}")
print(f"Difference: {np.linalg.norm(numerical_grad - analytical_grad):.2e}")

## 2.2 Gradient of $f(x, y) = x^2 + y^2$

In [None]:
# Evaluate gradient at several points
points = [[1, 0], [0, 1], [-1, 0], [0, -1], [1, 1], [-2, 3]]

print(f"{'Point':<15} {'Gradient':<20} {'Magnitude':<15}")
print(f"{'-'*50}")
for p in points:
    grad = numerical_gradient(f_simple, p)
    mag = np.linalg.norm(grad)
    print(f"({p[0]:>3}, {p[1]:>3})     [{grad[0]:>6.2f}, {grad[1]:>6.2f}]     {mag:.4f}")

## 2.3 Visualizing the Gradient Field with a Quiver Plot

In [None]:
# Create a grid
x_range = np.linspace(-3, 3, 15)
y_range = np.linspace(-3, 3, 15)
X, Y = np.meshgrid(x_range, y_range)

# Compute gradient at each grid point
# For f(x,y) = x^2 + y^2, grad = [2x, 2y]
U = 2 * X  # df/dx
V = 2 * Y  # df/dy

# Compute function values for contour
Z = X**2 + Y**2

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Contour plot with gradient arrows
contour = axes[0].contour(X, Y, Z, levels=15, cmap='viridis')
axes[0].clabel(contour, inline=True, fontsize=8)
axes[0].quiver(X, Y, U, V, color='red', alpha=0.7)
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title(r'Gradient field of $f(x,y) = x^2 + y^2$')
axes[0].set_aspect('equal')
axes[0].grid(True, alpha=0.3)

# Normalized gradient (direction only)
magnitude = np.sqrt(U**2 + V**2)
magnitude[magnitude == 0] = 1  # avoid division by zero
U_norm = U / magnitude
V_norm = V / magnitude

contour2 = axes[1].contourf(X, Y, Z, levels=20, cmap='viridis', alpha=0.7)
plt.colorbar(contour2, ax=axes[1], label='f(x, y)')
axes[1].quiver(X, Y, U_norm, V_norm, color='white', alpha=0.8)
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')
axes[1].set_title('Normalized gradient (direction only)')
axes[1].set_aspect('equal')

plt.tight_layout()
plt.show()

---

# Part 3: Jacobians

For a vector-valued function $\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m$, the **Jacobian** is the $m \times n$ matrix of partial derivatives:

$$J = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}$$

The Jacobian generalizes the gradient to vector-valued functions.

## 3.1 Computing the Jacobian Numerically

In [None]:
def numerical_jacobian(f, point, h=1e-5):
    """Compute Jacobian matrix of vector-valued function f at given point."""
    point = np.array(point, dtype=float)
    f0 = np.atleast_1d(f(point))
    n = len(point)
    m = len(f0)
    J = np.zeros((m, n))
    
    for j in range(n):
        point_fwd = point.copy()
        point_bwd = point.copy()
        point_fwd[j] += h
        point_bwd[j] -= h
        J[:, j] = (f(point_fwd) - f(point_bwd)) / (2 * h)
    
    return J

# Example: f(x, y) = [x^2*y, 5x + sin(y)]
def vector_func(point):
    x, y = point
    return np.array([x**2 * y, 5*x + np.sin(y)])

# Analytical Jacobian:
# J = [[2xy,  x^2],
#      [5,    cos(y)]]
def analytical_jacobian(point):
    x, y = point
    return np.array([
        [2*x*y, x**2],
        [5.0,   np.cos(y)]
    ])

test_point = [1.0, 2.0]
J_num = numerical_jacobian(vector_func, test_point)
J_exact = analytical_jacobian(test_point)

print("f(x, y) = [x^2 * y,  5x + sin(y)]")
print(f"\nAt point ({test_point[0]}, {test_point[1]}):\n")
print("Numerical Jacobian:")
print(J_num)
print("\nAnalytical Jacobian:")
print(J_exact)
print(f"\nMax absolute error: {np.max(np.abs(J_num - J_exact)):.2e}")

## 3.2 Verifying with SciPy

In [None]:
from scipy.optimize import approx_fprime

# SciPy's approx_fprime computes gradient (one row of Jacobian for scalar functions)
# For vector functions, we compute row by row

def f1(point):
    x, y = point
    return x**2 * y

def f2(point):
    x, y = point
    return 5*x + np.sin(y)

test_point = np.array([1.0, 2.0])
eps = np.sqrt(np.finfo(float).eps)

# Compute each row of the Jacobian using SciPy
row1 = approx_fprime(test_point, f1, eps)
row2 = approx_fprime(test_point, f2, eps)
J_scipy = np.array([row1, row2])

print("SciPy Jacobian:")
print(J_scipy)
print("\nOur numerical Jacobian:")
print(J_num)
print(f"\nMax difference: {np.max(np.abs(J_scipy - J_num)):.2e}")

---

# Part 4: Gradient Descent

**Gradient descent** is the workhorse of machine learning optimization. The update rule is:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha \nabla f(\mathbf{x}_k)$$

where $\alpha$ is the **learning rate**.

## 4.1 Gradient Descent from Scratch

In [None]:
def gradient_descent(f, grad_f, x0, learning_rate=0.1, n_iters=100, tol=1e-8):
    """
    Perform gradient descent to minimize f.
    
    Parameters:
        f:             objective function
        grad_f:        gradient of f
        x0:            initial point
        learning_rate: step size
        n_iters:       maximum iterations
        tol:           convergence tolerance
    
    Returns:
        path: list of points visited
        values: list of function values
    """
    x = np.array(x0, dtype=float)
    path = [x.copy()]
    values = [f(x)]
    
    for i in range(n_iters):
        grad = grad_f(x)
        x = x - learning_rate * grad
        path.append(x.copy())
        values.append(f(x))
        
        # Check convergence
        if np.linalg.norm(grad) < tol:
            print(f"Converged at iteration {i+1}")
            break
    
    return np.array(path), np.array(values)

print("Gradient descent function defined.")

## 4.2 Minimizing $f(x,y) = x^2 + y^2$ (Simple Quadratic)

In [None]:
# Simple quadratic
def quadratic(x):
    return x[0]**2 + x[1]**2

def grad_quadratic(x):
    return np.array([2*x[0], 2*x[1]])

# Run gradient descent
x0 = [4.0, 3.0]
path, values = gradient_descent(quadratic, grad_quadratic, x0, learning_rate=0.1, n_iters=50)

print(f"\nStarting point: ({x0[0]}, {x0[1]}), f = {quadratic(x0):.4f}")
print(f"Final point:    ({path[-1][0]:.6f}, {path[-1][1]:.6f}), f = {values[-1]:.8f}")
print(f"Steps taken:    {len(path) - 1}")

In [None]:
# Visualize descent path on contour plot
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Contour plot with path
x_range = np.linspace(-5, 5, 100)
y_range = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = X**2 + Y**2

axes[0].contourf(X, Y, Z, levels=30, cmap='viridis', alpha=0.7)
axes[0].plot(path[:, 0], path[:, 1], 'ro-', markersize=4, linewidth=1.5, label='GD path')
axes[0].plot(path[0, 0], path[0, 1], 'rs', markersize=12, label='Start')
axes[0].plot(path[-1, 0], path[-1, 1], 'r*', markersize=15, label='End')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title(r'Gradient Descent on $f(x,y) = x^2 + y^2$')
axes[0].legend()
axes[0].set_aspect('equal')

# Convergence plot
axes[1].semilogy(values, 'b-', linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('f(x, y) (log scale)')
axes[1].set_title('Convergence of Gradient Descent')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4.3 Minimizing the Rosenbrock Function (Harder)

The Rosenbrock function is a classic test for optimization:

$$f(x, y) = (1 - x)^2 + 100(y - x^2)^2$$

Minimum is at $(1, 1)$ with $f(1, 1) = 0$. The function has a narrow curved valley that makes optimization difficult.

In [None]:
def rosenbrock(x):
    return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2

def grad_rosenbrock(x):
    dfdx = -2*(1 - x[0]) + 100 * 2*(x[1] - x[0]**2) * (-2*x[0])
    dfdy = 100 * 2*(x[1] - x[0]**2)
    return np.array([dfdx, dfdy])

# Gradient descent with small learning rate (Rosenbrock needs careful tuning)
x0 = [-1.0, 1.0]
path_rosen, values_rosen = gradient_descent(
    rosenbrock, grad_rosenbrock, x0, 
    learning_rate=0.001, n_iters=10000, tol=1e-8
)

print(f"Starting point: ({x0[0]}, {x0[1]}), f = {rosenbrock(x0):.4f}")
print(f"Final point:    ({path_rosen[-1][0]:.6f}, {path_rosen[-1][1]:.6f})")
print(f"Final value:    f = {values_rosen[-1]:.8f}")
print(f"Steps taken:    {len(path_rosen) - 1}")
print(f"Known minimum:  (1.0, 1.0), f = 0.0")

In [None]:
# Visualize Rosenbrock descent
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Contour plot
x_range = np.linspace(-2, 2, 200)
y_range = np.linspace(-1, 3, 200)
X, Y = np.meshgrid(x_range, y_range)
Z = (1 - X)**2 + 100*(Y - X**2)**2

axes[0].contourf(X, Y, np.log10(Z + 1), levels=30, cmap='viridis', alpha=0.7)
# Plot every 50th point to avoid clutter
step = max(1, len(path_rosen) // 200)
axes[0].plot(path_rosen[::step, 0], path_rosen[::step, 1], 'r.-', 
             markersize=2, linewidth=0.5, alpha=0.8, label='GD path')
axes[0].plot(path_rosen[0, 0], path_rosen[0, 1], 'rs', markersize=10, label='Start')
axes[0].plot(1, 1, 'g*', markersize=15, label='Global min (1,1)')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('Gradient Descent on Rosenbrock Function')
axes[0].legend()

# Convergence
axes[1].semilogy(values_rosen, 'b-', linewidth=1)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('f(x, y) (log scale)')
axes[1].set_title('Convergence on Rosenbrock')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4.4 Effect of Learning Rate

In [None]:
# Compare different learning rates on the quadratic
learning_rates = [0.01, 0.1, 0.5, 0.9]
x0 = [4.0, 3.0]

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Contour background
x_range = np.linspace(-5, 5, 100)
y_range = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = X**2 + Y**2
axes[0].contourf(X, Y, Z, levels=20, cmap='viridis', alpha=0.5)

colors = ['red', 'blue', 'green', 'orange']
for lr, color in zip(learning_rates, colors):
    path_lr, values_lr = gradient_descent(quadratic, grad_quadratic, x0, 
                                          learning_rate=lr, n_iters=30)
    axes[0].plot(path_lr[:, 0], path_lr[:, 1], 'o-', color=color, 
                 markersize=4, linewidth=1.5, label=f'lr={lr}')
    axes[1].semilogy(values_lr, '-', color=color, linewidth=2, label=f'lr={lr}')

axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title('GD Paths for Different Learning Rates')
axes[0].legend()
axes[0].set_aspect('equal')

axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('f(x, y) (log scale)')
axes[1].set_title('Convergence for Different Learning Rates')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

# Part 5: Chain Rule and Backpropagation

The **chain rule** is the foundation of backpropagation in neural networks:

$$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}$$

In a computation graph, we propagate gradients backward from the loss to all parameters.

## 5.1 Simple Computation Graph Example

Consider the computation: $f(x, y, z) = (x + y) \cdot z$

Let $q = x + y$, then $f = q \cdot z$.

In [None]:
# Forward pass
x, y, z = 2.0, 3.0, -4.0

# Step 1: q = x + y
q = x + y
print(f"Forward pass:")
print(f"  q = x + y = {x} + {y} = {q}")

# Step 2: f = q * z
f_val = q * z
print(f"  f = q * z = {q} * {z} = {f_val}")

# Backward pass (backpropagation)
print(f"\nBackward pass:")

# df/df = 1 (seed gradient)
df_df = 1.0
print(f"  df/df = {df_df}")

# df/dq = z, df/dz = q
df_dq = z * df_df
df_dz = q * df_df
print(f"  df/dq = z = {df_dq}")
print(f"  df/dz = q = {df_dz}")

# dq/dx = 1, dq/dy = 1
# By chain rule: df/dx = df/dq * dq/dx
df_dx = df_dq * 1.0
df_dy = df_dq * 1.0
print(f"  df/dx = df/dq * dq/dx = {df_dq} * 1 = {df_dx}")
print(f"  df/dy = df/dq * dq/dy = {df_dq} * 1 = {df_dy}")

# Verify numerically
h = 1e-5
df_dx_num = ((x+h + y)*z - (x + y)*z) / h
df_dy_num = ((x + y+h)*z - (x + y)*z) / h
df_dz_num = ((x + y)*(z+h) - (x + y)*z) / h

print(f"\nNumerical verification:")
print(f"  df/dx = {df_dx_num:.6f} (analytical: {df_dx})")
print(f"  df/dy = {df_dy_num:.6f} (analytical: {df_dy})")
print(f"  df/dz = {df_dz_num:.6f} (analytical: {df_dz})")

## 5.2 Backpropagation for a 2-Layer Neural Network

We implement a simple 2-layer network for binary classification:

- **Forward:** $z_1 = W_1 x + b_1$, $a_1 = \sigma(z_1)$, $z_2 = W_2 a_1 + b_2$, $\hat{y} = \sigma(z_2)$
- **Loss:** $L = -(y \log \hat{y} + (1-y)\log(1 - \hat{y}))$

In [None]:
def sigmoid(z):
    """Sigmoid activation function."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_derivative(z):
    """Derivative of sigmoid: sigma(z) * (1 - sigma(z))."""
    s = sigmoid(z)
    return s * (1 - s)

class TwoLayerNet:
    def __init__(self, input_dim, hidden_dim, output_dim):
        """Initialize a 2-layer neural network."""
        np.random.seed(42)
        self.W1 = np.random.randn(hidden_dim, input_dim) * 0.5
        self.b1 = np.zeros((hidden_dim, 1))
        self.W2 = np.random.randn(output_dim, hidden_dim) * 0.5
        self.b2 = np.zeros((output_dim, 1))
    
    def forward(self, X):
        """Forward pass. X shape: (input_dim, n_samples)."""
        self.X = X
        self.z1 = self.W1 @ X + self.b1
        self.a1 = sigmoid(self.z1)
        self.z2 = self.W2 @ self.a1 + self.b2
        self.a2 = sigmoid(self.z2)
        return self.a2
    
    def compute_loss(self, y_pred, y_true):
        """Binary cross-entropy loss."""
        m = y_true.shape[1]
        eps = 1e-8
        loss = -np.mean(y_true * np.log(y_pred + eps) + 
                        (1 - y_true) * np.log(1 - y_pred + eps))
        return loss
    
    def backward(self, y_true):
        """Backward pass (backpropagation)."""
        m = y_true.shape[1]
        
        # Output layer gradients
        dz2 = self.a2 - y_true                     # (output_dim, m)
        self.dW2 = (1/m) * dz2 @ self.a1.T         # (output_dim, hidden_dim)
        self.db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True)
        
        # Hidden layer gradients (chain rule)
        da1 = self.W2.T @ dz2                      # (hidden_dim, m)
        dz1 = da1 * sigmoid_derivative(self.z1)     # element-wise
        self.dW1 = (1/m) * dz1 @ self.X.T          # (hidden_dim, input_dim)
        self.db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True)
    
    def update(self, learning_rate):
        """Update parameters using gradient descent."""
        self.W1 -= learning_rate * self.dW1
        self.b1 -= learning_rate * self.db1
        self.W2 -= learning_rate * self.dW2
        self.b2 -= learning_rate * self.db2

print("TwoLayerNet class defined.")

In [None]:
# Generate a simple 2D dataset (two concentric circles)
np.random.seed(42)
n_samples = 200

# Class 0: inner circle
r0 = np.random.randn(n_samples // 2) * 0.3 + 1.0
theta0 = np.random.uniform(0, 2*np.pi, n_samples // 2)
X0 = np.column_stack([r0 * np.cos(theta0), r0 * np.sin(theta0)])

# Class 1: outer circle
r1 = np.random.randn(n_samples // 2) * 0.3 + 2.5
theta1 = np.random.uniform(0, 2*np.pi, n_samples // 2)
X1 = np.column_stack([r1 * np.cos(theta1), r1 * np.sin(theta1)])

X = np.vstack([X0, X1]).T  # shape: (2, n_samples)
y = np.hstack([np.zeros(n_samples // 2), np.ones(n_samples // 2)]).reshape(1, -1)

# Train the network
net = TwoLayerNet(input_dim=2, hidden_dim=8, output_dim=1)
losses = []
n_epochs = 2000

for epoch in range(n_epochs):
    # Forward
    y_pred = net.forward(X)
    loss = net.compute_loss(y_pred, y)
    losses.append(loss)
    
    # Backward
    net.backward(y)
    
    # Update
    net.update(learning_rate=1.0)
    
    if (epoch + 1) % 500 == 0:
        acc = np.mean((y_pred > 0.5).astype(float) == y) * 100
        print(f"Epoch {epoch+1:4d}: loss = {loss:.4f}, accuracy = {acc:.1f}%")

# Final accuracy
y_pred_final = net.forward(X)
accuracy = np.mean((y_pred_final > 0.5).astype(float) == y) * 100
print(f"\nFinal accuracy: {accuracy:.1f}%")

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Training data
axes[0].scatter(X[0, y[0]==0], X[1, y[0]==0], c='blue', label='Class 0', alpha=0.6)
axes[0].scatter(X[0, y[0]==1], X[1, y[0]==1], c='red', label='Class 1', alpha=0.6)
axes[0].set_xlabel('x1')
axes[0].set_ylabel('x2')
axes[0].set_title('Training Data')
axes[0].legend()
axes[0].set_aspect('equal')
axes[0].grid(True, alpha=0.3)

# Decision boundary
xx = np.linspace(-4, 4, 200)
yy = np.linspace(-4, 4, 200)
XX, YY = np.meshgrid(xx, yy)
grid_points = np.vstack([XX.ravel(), YY.ravel()])
ZZ = net.forward(grid_points).reshape(XX.shape)

axes[1].contourf(XX, YY, ZZ, levels=20, cmap='RdBu_r', alpha=0.7)
axes[1].contour(XX, YY, ZZ, levels=[0.5], colors='black', linewidths=2)
axes[1].scatter(X[0, y[0]==0], X[1, y[0]==0], c='blue', edgecolors='k', s=20)
axes[1].scatter(X[0, y[0]==1], X[1, y[0]==1], c='red', edgecolors='k', s=20)
axes[1].set_xlabel('x1')
axes[1].set_ylabel('x2')
axes[1].set_title('Decision Boundary')
axes[1].set_aspect('equal')

# Loss curve
axes[2].plot(losses, 'b-', linewidth=1.5)
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('Loss')
axes[2].set_title('Training Loss')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

# Part 6: Hessian and Second-Order Information

The **Hessian matrix** contains all second-order partial derivatives:

$$H = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} \end{bmatrix}$$

The Hessian tells us about the **curvature** of the function and enables second-order optimization methods.

## 6.1 Computing the Hessian Numerically

In [None]:
def numerical_hessian(f, point, h=1e-5):
    """Compute the Hessian matrix numerically using central differences."""
    point = np.array(point, dtype=float)
    n = len(point)
    H = np.zeros((n, n))
    
    for i in range(n):
        for j in range(n):
            # Use central difference for second derivatives
            ei = np.zeros(n)
            ej = np.zeros(n)
            ei[i] = h
            ej[j] = h
            
            H[i, j] = (
                f(point + ei + ej) 
                - f(point + ei - ej) 
                - f(point - ei + ej) 
                + f(point - ei - ej)
            ) / (4 * h * h)
    
    return H

# Example: f(x, y) = x^4 + x^2*y + y^2
def f_hess(point):
    x, y = point
    return x**4 + x**2 * y + y**2

# Analytical Hessian:
# H = [[12x^2 + 2y, 2x],
#      [2x,          2 ]]
def analytical_hessian(point):
    x, y = point
    return np.array([
        [12*x**2 + 2*y, 2*x],
        [2*x,           2.0]
    ])

test_point = [1.0, 2.0]
H_num = numerical_hessian(f_hess, test_point)
H_exact = analytical_hessian(test_point)

print(f"f(x, y) = x^4 + x^2*y + y^2")
print(f"At point ({test_point[0]}, {test_point[1]}):\n")
print("Numerical Hessian:")
print(H_num)
print("\nAnalytical Hessian:")
print(H_exact)
print(f"\nMax absolute error: {np.max(np.abs(H_num - H_exact)):.2e}")

# Eigenvalues tell us about curvature
eigenvalues = np.linalg.eigvals(H_exact)
print(f"\nEigenvalues of Hessian: {eigenvalues}")
print(f"All positive (convex at this point): {np.all(eigenvalues > 0)}")

## 6.2 Newton's Method vs Gradient Descent

**Newton's method** uses the Hessian to take better steps:

$$\mathbf{x}_{k+1} = \mathbf{x}_k - H^{-1} \nabla f(\mathbf{x}_k)$$

It converges much faster (quadratically) for well-behaved functions, but each step is more expensive.

In [None]:
def newtons_method(f, grad_f, hess_f, x0, n_iters=50, tol=1e-10):
    """Newton's method for optimization."""
    x = np.array(x0, dtype=float)
    path = [x.copy()]
    values = [f(x)]
    
    for i in range(n_iters):
        grad = grad_f(x)
        hess = hess_f(x)
        
        # Newton step: x = x - H^{-1} * grad
        try:
            step = np.linalg.solve(hess, grad)
        except np.linalg.LinAlgError:
            print(f"Hessian singular at iteration {i+1}")
            break
        
        x = x - step
        path.append(x.copy())
        values.append(f(x))
        
        if np.linalg.norm(grad) < tol:
            print(f"Converged at iteration {i+1}")
            break
    
    return np.array(path), np.array(values)

print("Newton's method function defined.")

In [None]:
# Compare on a quadratic with different curvatures: f(x,y) = 5x^2 + y^2
def f_elliptic(point):
    x, y = point
    return 5*x**2 + y**2

def grad_elliptic(point):
    x, y = point
    return np.array([10*x, 2*y])

def hess_elliptic(point):
    return np.array([[10.0, 0.0],
                     [0.0,  2.0]])

x0 = [4.0, 4.0]

# Gradient descent
path_gd, values_gd = gradient_descent(
    f_elliptic, grad_elliptic, x0, learning_rate=0.08, n_iters=50
)

# Newton's method
path_newton, values_newton = newtons_method(
    f_elliptic, grad_elliptic, hess_elliptic, x0, n_iters=50
)

print(f"\nGradient Descent:")
print(f"  Final point: ({path_gd[-1][0]:.8f}, {path_gd[-1][1]:.8f})")
print(f"  Final value: {values_gd[-1]:.10f}")
print(f"  Steps: {len(path_gd) - 1}")

print(f"\nNewton's Method:")
print(f"  Final point: ({path_newton[-1][0]:.8f}, {path_newton[-1][1]:.8f})")
print(f"  Final value: {values_newton[-1]:.10f}")
print(f"  Steps: {len(path_newton) - 1}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Contour plot with both paths
x_range = np.linspace(-5, 5, 100)
y_range = np.linspace(-5, 5, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = 5*X**2 + Y**2

axes[0].contourf(X, Y, Z, levels=30, cmap='viridis', alpha=0.6)
axes[0].contour(X, Y, Z, levels=15, colors='grey', alpha=0.3, linewidths=0.5)

axes[0].plot(path_gd[:, 0], path_gd[:, 1], 'ro-', markersize=5, linewidth=1.5, 
             label=f'Gradient Descent ({len(path_gd)-1} steps)')
axes[0].plot(path_newton[:, 0], path_newton[:, 1], 'bs-', markersize=8, linewidth=2, 
             label=f'Newton ({len(path_newton)-1} steps)')
axes[0].plot(0, 0, 'g*', markersize=15, label='Minimum')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')
axes[0].set_title(r'GD vs Newton on $f(x,y) = 5x^2 + y^2$')
axes[0].legend()

# Convergence comparison
axes[1].semilogy(range(len(values_gd)), values_gd, 'r-', linewidth=2, label='Gradient Descent')
axes[1].semilogy(range(len(values_newton)), values_newton, 'b-', linewidth=2, label="Newton's Method")
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('f(x, y) (log scale)')
axes[1].set_title('Convergence Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6.3 When Does Newton's Method Shine?

Newton's method is especially powerful when the Hessian has a large **condition number** (ratio of largest to smallest eigenvalue), which means the function has very different curvatures in different directions. Gradient descent struggles in this case because it oscillates.

In [None]:
# Ill-conditioned quadratic: f(x,y) = 50x^2 + y^2 (condition number = 50)
def f_ill(point):
    x, y = point
    return 50*x**2 + y**2

def grad_ill(point):
    x, y = point
    return np.array([100*x, 2*y])

def hess_ill(point):
    return np.array([[100.0, 0.0],
                     [0.0,   2.0]])

x0 = [3.0, 3.0]

# Gradient descent (needs small learning rate due to large curvature)
path_gd_ill, values_gd_ill = gradient_descent(
    f_ill, grad_ill, x0, learning_rate=0.009, n_iters=200
)

# Newton's method
path_newton_ill, values_newton_ill = newtons_method(
    f_ill, grad_ill, hess_ill, x0, n_iters=50
)

print(f"Condition number of Hessian: {100/2:.0f}")
print(f"\nGradient Descent: {len(path_gd_ill)-1} steps, final f = {values_gd_ill[-1]:.6e}")
print(f"Newton's Method:  {len(path_newton_ill)-1} steps, final f = {values_newton_ill[-1]:.6e}")

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
x_range = np.linspace(-4, 4, 100)
y_range = np.linspace(-4, 4, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = 50*X**2 + Y**2

ax.contourf(X, Y, Z, levels=30, cmap='viridis', alpha=0.6)
ax.plot(path_gd_ill[:, 0], path_gd_ill[:, 1], 'r.-', markersize=3, linewidth=0.8, 
        alpha=0.8, label=f'GD ({len(path_gd_ill)-1} steps)')
ax.plot(path_newton_ill[:, 0], path_newton_ill[:, 1], 'bs-', markersize=10, linewidth=2, 
        label=f'Newton ({len(path_newton_ill)-1} steps)')
ax.plot(0, 0, 'g*', markersize=15, label='Minimum')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_title(r'Ill-conditioned problem: $f(x,y) = 50x^2 + y^2$ (condition = 50)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---

# Practice Exercises

Try these on your own:

**Exercise 1: Numerical Differentiation**  
Compute the derivative of $f(x) = e^{-x^2}$ using both forward and central differences. Compare with the analytical derivative $f'(x) = -2x e^{-x^2}$ at $x = 1$. Plot the function and its derivative over $[-3, 3]$.

**Exercise 2: Gradient Visualization**  
Compute and visualize the gradient field of $f(x, y) = \sin(x) \cdot \cos(y)$ using a quiver plot on the domain $[-\pi, \pi] \times [-\pi, \pi]$. Overlay contour lines on the same plot.

**Exercise 3: Gradient Descent with Momentum**  
Implement gradient descent **with momentum** using the update rule:  
$v_{k+1} = \beta \, v_k + \alpha \, \nabla f(x_k)$  
$x_{k+1} = x_k - v_{k+1}$  
Test it on the Rosenbrock function with $\alpha = 0.001$ and $\beta = 0.9$. Compare convergence with standard gradient descent.

**Exercise 4: Hessian Analysis**  
For the function $f(x, y) = x^3 - 3xy^2$ (the "monkey saddle"), compute the Hessian at the origin $(0, 0)$. What do the eigenvalues tell you about the nature of this critical point? Visualize the surface using a 3D plot.

---

**Course:** Mathematics for Machine Learning  
**Instructor:** Mohammed Alnemari