# Matrix Analysis and Matrix Calculus
## The Mathematical Foundation of Neural Network Backpropagation

Welcome to the **mathematical engine** that powers deep learning! Matrix calculus is what makes it possible to train neural networks with millions of parameters efficiently.

### What You'll Master
By the end of this notebook, you'll understand:
1. **Matrix derivatives** - How to take derivatives of matrix expressions
2. **The chain rule** for matrices - The foundation of backpropagation
3. **Jacobian and Hessian matrices** - Higher-order derivatives
4. **Automatic differentiation** - How modern frameworks compute gradients
5. **Backpropagation algorithm** - Step-by-step neural network training

### Why This is Revolutionary
- **Neural networks** have millions of parameters organized in matrices
- **Matrix calculus** lets us compute all gradients simultaneously
- **Backpropagation** trains deep networks efficiently using matrix chain rule
- **Understanding this** helps you debug and design better architectures

### Prerequisites
- Basic linear algebra (vectors, matrices, multiplication)
- Multivariate calculus (partial derivatives, chain rule)
- Understanding of neural network basics

Let's unlock the mathematics of deep learning! 🧠🔢

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.linalg import norm, svd
import sympy as sp
from sympy import symbols, Matrix, diff, simplify
import torch
import torch.nn as nn
import torch.autograd as autograd
from mpl_toolkits.mplot3d import Axes3D
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Set style for beautiful plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

# Enable symbolic computation for matrix calculus
sp.init_printing(use_latex=True)

print("🧠 Matrix Calculus toolkit loaded!")
print("Ready to compute gradients like a neural network!")

## 1. Matrix Derivatives: Beyond Scalar Calculus

### The Challenge
In neural networks, we don't have simple functions like `f(x) = x²`. Instead, we have **matrix functions** like:
- Weight matrices: `W ∈ ℝᵐˣⁿ`
- Activation functions: `σ(Wx + b)`
- Loss functions: `L(θ) = ||y - ŷ||²`

### Matrix Derivative Notation

**Scalar by Vector**: `∂f/∂x` where f is scalar, x is vector
```
∂f/∂x = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]ᵀ
```

**Scalar by Matrix**: `∂f/∂W` where f is scalar, W is matrix
```
∂f/∂W = [∂f/∂w₁₁  ∂f/∂w₁₂  ...]
         [∂f/∂w₂₁  ∂f/∂w₂₂  ...]
         [   ...      ...   ...]
```

**Vector by Vector**: `∂y/∂x` where both are vectors → **Jacobian Matrix**

### Why Matrix Derivatives Matter
1. **Efficiency**: Compute all parameter gradients simultaneously
2. **Neural Networks**: Each layer is a matrix transformation
3. **Optimization**: Update millions of parameters at once
4. **Automatic Differentiation**: Modern frameworks use these rules

In [None]:
# Let's start with simple matrix derivative examples

def demonstrate_matrix_derivatives():
    """Show basic matrix derivative computations"""
    
    print("🔢 Matrix Derivative Examples")
    print("=" * 40)
    
    # Example 1: Quadratic form
    print("\n1. Quadratic Form: f(x) = x^T A x")
    print("   Derivative: ∂f/∂x = (A + A^T)x")
    print("   If A is symmetric: ∂f/∂x = 2Ax")
    
    # Numerical example
    A = np.array([[2, 1], [1, 3]])
    x = np.array([1, 2])
    
    f_val = x.T @ A @ x
    gradient = (A + A.T) @ x
    
    print(f"   A = {A}")
    print(f"   x = {x}")
    print(f"   f(x) = {f_val}")
    print(f"   ∇f = {gradient}")
    
    # Example 2: Linear transformation
    print("\n2. Linear Form: f(x) = a^T x")
    print("   Derivative: ∂f/∂x = a")
    
    a = np.array([2, -1])
    f_linear = a.T @ x
    grad_linear = a
    
    print(f"   a = {a}")
    print(f"   f(x) = a^T x = {f_linear}")
    print(f"   ∇f = {grad_linear}")
    
    # Example 3: Matrix trace
    print("\n3. Matrix Trace: f(W) = tr(W^T A W)")
    print("   Derivative: ∂f/∂W = A W + A^T W")
    
    W = np.array([[1, 2], [3, 1]])
    A_mat = np.array([[1, 0], [0, 2]])
    
    f_trace = np.trace(W.T @ A_mat @ W)
    grad_trace = A_mat @ W + A_mat.T @ W
    
    print(f"   W = {W}")
    print(f"   A = {A_mat}")
    print(f"   f(W) = {f_trace}")
    print(f"   ∂f/∂W = \n{grad_trace}")

demonstrate_matrix_derivatives()

## 2. The Chain Rule for Matrices: Heart of Backpropagation

### The Power of Matrix Chain Rule
In neural networks, we have **composite matrix functions**:
```
Input → Layer 1 → Layer 2 → ... → Output → Loss
  x   →   z₁   →   z₂   → ... →   ŷ    →  L
```

Each arrow represents a matrix transformation!

### Matrix Chain Rule Formula
For composite function `L(z₂(z₁(x)))`, the gradient is:
```
∂L/∂x = (∂L/∂z₂) × (∂z₂/∂z₁) × (∂z₁/∂x)
```

Where each `∂z_i/∂z_{i-1}` is a **Jacobian matrix**.

### The Jacobian Matrix
For vector function `f: ℝⁿ → ℝᵐ`, the Jacobian is:
```
J = ∂f/∂x = [∂f₁/∂x₁  ∂f₁/∂x₂  ...  ∂f₁/∂xₙ]
             [∂f₂/∂x₁  ∂f₂/∂x₂  ...  ∂f₂/∂xₙ]
             [   ...      ...    ...    ... ]
             [∂fₘ/∂x₁  ∂fₘ/∂x₂  ...  ∂fₘ/∂xₙ]
```

### Real-World Analogy: Assembly Line
Think of a manufacturing assembly line:
1. **Raw materials** (input x)
2. **Station 1** transforms materials (z₁ = f₁(x))
3. **Station 2** processes further (z₂ = f₂(z₁))
4. **Quality check** measures final product (L = loss function)

To improve quality, we need to know how each station affects the final product!

In [None]:
# Demonstrate matrix chain rule with a simple neural network

class SimpleNeuralNetwork:
    """A minimal neural network to demonstrate matrix calculus"""
    
    def __init__(self, input_size=2, hidden_size=3, output_size=1):
        # Initialize weights randomly
        self.W1 = np.random.randn(hidden_size, input_size) * 0.5
        self.b1 = np.zeros((hidden_size, 1))
        self.W2 = np.random.randn(output_size, hidden_size) * 0.5
        self.b2 = np.zeros((output_size, 1))
        
        # Store intermediate values for backprop
        self.z1 = None
        self.a1 = None
        self.z2 = None
        self.a2 = None
    
    def sigmoid(self, z):
        """Sigmoid activation function"""
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def sigmoid_derivative(self, z):
        """Derivative of sigmoid"""
        s = self.sigmoid(z)
        return s * (1 - s)
    
    def forward(self, X):
        """Forward pass through the network"""
        # Layer 1: z1 = W1 * x + b1
        self.z1 = self.W1 @ X + self.b1
        self.a1 = self.sigmoid(self.z1)
        
        # Layer 2: z2 = W2 * a1 + b2
        self.z2 = self.W2 @ self.a1 + self.b2
        self.a2 = self.sigmoid(self.z2)
        
        return self.a2
    
    def compute_loss(self, y_true, y_pred):
        """Mean squared error loss"""
        return 0.5 * np.mean((y_true - y_pred)**2)
    
    def backward(self, X, y_true, y_pred):
        """Backpropagation using matrix chain rule"""
        m = X.shape[1]  # Number of samples
        
        # Step 1: Gradient of loss w.r.t. output
        dL_da2 = -(y_true - y_pred) / m
        
        # Step 2: Gradient w.r.t. z2 (before activation)
        dL_dz2 = dL_da2 * self.sigmoid_derivative(self.z2)
        
        # Step 3: Gradients for W2 and b2
        dL_dW2 = dL_dz2 @ self.a1.T
        dL_db2 = np.sum(dL_dz2, axis=1, keepdims=True)
        
        # Step 4: Gradient w.r.t. a1 (chain rule!)
        dL_da1 = self.W2.T @ dL_dz2
        
        # Step 5: Gradient w.r.t. z1
        dL_dz1 = dL_da1 * self.sigmoid_derivative(self.z1)
        
        # Step 6: Gradients for W1 and b1
        dL_dW1 = dL_dz1 @ X.T
        dL_db1 = np.sum(dL_dz1, axis=1, keepdims=True)
        
        return {
            'dL_dW2': dL_dW2, 'dL_db2': dL_db2,
            'dL_dW1': dL_dW1, 'dL_db1': dL_db1
        }
    
    def compute_jacobians(self, X):
        """Compute Jacobian matrices for each layer"""
        # Jacobian of layer 1 output w.r.t. input
        J1 = np.zeros((self.a1.shape[0], X.shape[0]))
        for i in range(self.a1.shape[0]):
            for j in range(X.shape[0]):
                # ∂a1_i/∂x_j = sigmoid'(z1_i) * W1_ij
                J1[i, j] = self.sigmoid_derivative(self.z1[i, 0]) * self.W1[i, j]
        
        # Jacobian of layer 2 output w.r.t. layer 1 output
        J2 = np.zeros((self.a2.shape[0], self.a1.shape[0]))
        for i in range(self.a2.shape[0]):
            for j in range(self.a1.shape[0]):
                # ∂a2_i/∂a1_j = sigmoid'(z2_i) * W2_ij
                J2[i, j] = self.sigmoid_derivative(self.z2[i, 0]) * self.W2[i, j]
        
        return J1, J2

# Demonstrate the network
print("🧠 Building a Simple Neural Network")
print("=" * 40)

# Create network
net = SimpleNeuralNetwork(input_size=2, hidden_size=3, output_size=1)

# Sample data
X = np.array([[1.0], [2.0]])  # 2D input
y_true = np.array([[0.8]])    # Target output

print(f"Input shape: {X.shape}")
print(f"Target shape: {y_true.shape}")
print(f"\nNetwork architecture:")
print(f"Input layer: {net.W1.shape[1]} neurons")
print(f"Hidden layer: {net.W1.shape[0]} neurons")
print(f"Output layer: {net.W2.shape[0]} neuron(s)")

# Forward pass
y_pred = net.forward(X)
loss = net.compute_loss(y_true, y_pred)

print(f"\nForward Pass:")
print(f"Predicted output: {y_pred.flatten()}")
print(f"True output: {y_true.flatten()}")
print(f"Loss: {loss:.6f}")

# Backward pass
gradients = net.backward(X, y_true, y_pred)

print(f"\nBackward Pass - Gradients:")
print(f"∂L/∂W2 shape: {gradients['dL_dW2'].shape}")
print(f"∂L/∂W1 shape: {gradients['dL_dW1'].shape}")
print(f"∂L/∂b2 shape: {gradients['dL_db2'].shape}")
print(f"∂L/∂b1 shape: {gradients['dL_db1'].shape}")

# Compute Jacobians
J1, J2 = net.compute_jacobians(X)
print(f"\nJacobian Matrices:")
print(f"J1 (∂a1/∂x) shape: {J1.shape}")
print(f"J2 (∂a2/∂a1) shape: {J2.shape}")