# Part 1.1: Linear Algebra for Deep Learning

Linear algebra is the foundation of deep learning. Neural networks are essentially compositions of linear transformations (matrix multiplications) and nonlinear activation functions.

## Learning Objectives
- [ ] Understand vector spaces and linear transformations
- [ ] Perform matrix operations fluently with NumPy
- [ ] Explain the geometric intuition behind eigendecomposition
- [ ] Apply SVD to dimensionality reduction

---

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import FancyArrowPatch
from mpl_toolkits.mplot3d import proj3d

# For nice inline plots
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

# Set random seed for reproducibility
np.random.seed(42)

## 1. Vectors

A **vector** is an ordered list of numbers. In machine learning:
- A single data point (features) is a vector
- Model parameters (weights) are vectors
- Gradients are vectors

### Geometric Interpretation
A vector can be thought of as:
1. A point in space
2. An arrow from the origin to that point (direction + magnitude)

In [None]:
# Creating vectors in NumPy
v = np.array([3, 4])  # 2D vector
w = np.array([1, 2, 3])  # 3D vector

print(f"Vector v: {v}")
print(f"Shape of v: {v.shape}")
print(f"Dimension (number of elements): {v.shape[0]}")

In [None]:
# Visualize a 2D vector
def plot_vectors(vectors, colors, labels=None):
    """Plot 2D vectors from origin."""
    fig, ax = plt.subplots(figsize=(8, 8))
    
    for i, (vec, color) in enumerate(zip(vectors, colors)):
        label = labels[i] if labels else None
        ax.quiver(0, 0, vec[0], vec[1], angles='xy', scale_units='xy', scale=1, 
                  color=color, label=label, width=0.015)
    
    # Set axis limits
    all_coords = np.array(vectors)
    max_val = np.abs(all_coords).max() + 1
    ax.set_xlim(-max_val, max_val)
    ax.set_ylim(-max_val, max_val)
    ax.set_aspect('equal')
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.grid(True, alpha=0.3)
    if labels:
        ax.legend()
    return ax

# Plot vector v = [3, 4]
plot_vectors([v], ['blue'], ['v = [3, 4]'])
plt.title('A 2D Vector')
plt.show()

### Vector Operations

#### 1. Vector Addition
Vectors add element-wise. Geometrically, place the tail of the second vector at the head of the first.

In [None]:
a = np.array([2, 1])
b = np.array([1, 3])
c = a + b  # Element-wise addition

print(f"a = {a}")
print(f"b = {b}")
print(f"a + b = {c}")

# Visualize vector addition
fig, ax = plt.subplots(figsize=(8, 8))
ax.quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1, color='blue', label='a', width=0.015)
ax.quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1, color='red', label='b', width=0.015)
ax.quiver(0, 0, c[0], c[1], angles='xy', scale_units='xy', scale=1, color='green', label='a + b', width=0.015)
# Show b starting from tip of a (parallelogram rule)
ax.quiver(a[0], a[1], b[0], b[1], angles='xy', scale_units='xy', scale=1, color='red', alpha=0.3, width=0.015)
ax.set_xlim(-1, 5)
ax.set_ylim(-1, 5)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.legend()
ax.set_title('Vector Addition: a + b')
plt.show()

#### 2. Scalar Multiplication
Multiplying a vector by a scalar scales its magnitude (and flips direction if negative).

In [None]:
v = np.array([2, 1])
scaled_v = 2 * v
negative_v = -1 * v

print(f"v = {v}")
print(f"2v = {scaled_v}")
print(f"-v = {negative_v}")

plot_vectors([v, scaled_v, negative_v], ['blue', 'green', 'red'], ['v', '2v', '-v'])
plt.title('Scalar Multiplication')
plt.show()

#### 3. Dot Product

The **dot product** (inner product) of two vectors is fundamental:

$$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = |\mathbf{a}| |\mathbf{b}| \cos\theta$$

Where $\theta$ is the angle between the vectors.

**Key insights:**
- If dot product = 0, vectors are **orthogonal** (perpendicular)
- If positive, vectors point in similar directions
- If negative, vectors point in opposite directions
- Used everywhere in neural networks: weighted sums!

In [None]:
a = np.array([1, 0])
b = np.array([0, 1])
c = np.array([1, 1])

# Different ways to compute dot product
print(f"a · b = {np.dot(a, b)}")  # Orthogonal vectors
print(f"a · c = {np.dot(a, c)}")  # 45 degree angle
print(f"a · a = {np.dot(a, a)}")  # Same vector (gives squared magnitude)

# Using @ operator (preferred in modern NumPy)
print(f"a @ c = {a @ c}")

In [None]:
# Computing angle between vectors using dot product
def angle_between(v1, v2):
    """Returns angle in degrees between vectors v1 and v2."""
    cos_angle = np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    # Clip to handle numerical errors
    cos_angle = np.clip(cos_angle, -1, 1)
    return np.degrees(np.arccos(cos_angle))

a = np.array([1, 0])
b = np.array([1, 1])
c = np.array([0, 1])
d = np.array([-1, 0])

print(f"Angle between a=[1,0] and b=[1,1]: {angle_between(a, b):.1f}°")
print(f"Angle between a=[1,0] and c=[0,1]: {angle_between(a, c):.1f}°")
print(f"Angle between a=[1,0] and d=[-1,0]: {angle_between(a, d):.1f}°")

### Deep Dive: Understanding the Dot Product Formula

There are **two equivalent ways** to define the dot product:

**Definition 1 - Algebraic (how we compute it):**
$$\mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + \ldots + a_n b_n$$

**Definition 2 - Geometric (what it means):**
$$\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}| \cdot |\mathbf{b}| \cdot \cos(\theta)$$

These are mathematically proven to be equal (using the Law of Cosines).

#### Breaking down the geometric formula:

| Component | Meaning | Range |
|-----------|---------|-------|
| $\|\mathbf{a}\|$ | Length of vector a | 0 to ∞ |
| $\|\mathbf{b}\|$ | Length of vector b | 0 to ∞ |
| $\cos(\theta)$ | "Alignment factor" based on angle | -1 to 1 |

#### What does cos(θ) do?

| Angle θ | cos(θ) | Vectors are... | Dot product |
|---------|--------|----------------|-------------|
| 0° | 1 | Same direction | Maximum positive |
| 45° | 0.71 | Somewhat aligned | Positive |
| 90° | 0 | Perpendicular | Zero |
| 135° | -0.71 | Somewhat opposite | Negative |
| 180° | -1 | Opposite directions | Maximum negative |

In [None]:
# Interactive visualization: How dot product changes with angle
# Keep vector 'a' fixed, rotate vector 'b' around

a = np.array([2, 0])  # Fixed vector pointing right

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left plot: Show vectors at different angles
angles_deg = [0, 45, 90, 135, 180]
colors = ['green', 'blue', 'orange', 'red', 'purple']

axes[0].quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1, 
               color='black', width=0.03, label='a (fixed)')

for angle, color in zip(angles_deg, colors):
    theta = np.radians(angle)
    b = 1.5 * np.array([np.cos(theta), np.sin(theta)])  # |b| = 1.5
    dot = a @ b
    axes[0].quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1,
                   color=color, width=0.02, alpha=0.7, label=f'θ={angle}°, a·b={dot:.2f}')

axes[0].set_xlim(-3, 3)
axes[0].set_ylim(-2, 2)
axes[0].set_aspect('equal')
axes[0].axhline(y=0, color='k', linewidth=0.5)
axes[0].axvline(x=0, color='k', linewidth=0.5)
axes[0].legend(loc='upper left', fontsize=9)
axes[0].set_title('Vector b at different angles from a')
axes[0].grid(True, alpha=0.3)

# Right plot: Dot product as function of angle
angles = np.linspace(0, 360, 100)
dot_products = []
for angle in angles:
    theta = np.radians(angle)
    b = 1.5 * np.array([np.cos(theta), np.sin(theta)])
    dot_products.append(a @ b)

axes[1].plot(angles, dot_products, 'b-', linewidth=2)
axes[1].axhline(y=0, color='k', linewidth=1)
axes[1].set_xlabel('Angle θ (degrees)')
axes[1].set_ylabel('Dot product (a · b)')
axes[1].set_title('Dot product vs angle between vectors\n|a|=2, |b|=1.5, so max = 2×1.5 = 3')
axes[1].set_xticks([0, 45, 90, 135, 180, 225, 270, 315, 360])
axes[1].grid(True, alpha=0.3)

# Mark key points
for angle, color in zip(angles_deg, colors):
    theta = np.radians(angle)
    b = 1.5 * np.array([np.cos(theta), np.sin(theta)])
    dot = a @ b
    axes[1].scatter([angle], [dot], color=color, s=100, zorder=5)

plt.tight_layout()
plt.show()

print("Key insight: The dot product follows a cosine curve!")
print("This is because a·b = |a||b|cos(θ), and we're varying θ.")

### The Projection Interpretation

Another powerful way to understand dot product: **projection**.

The dot product $\mathbf{a} \cdot \mathbf{b}$ tells you: *"How much of b points in the direction of a?"*

More precisely:
$$\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}| \times (\text{length of b's shadow onto a})$$

This "shadow" is called the **scalar projection** of b onto a.

In [None]:
# Visualizing projection
a = np.array([3, 0])  # Horizontal vector
b = np.array([2, 2])  # Vector at 45 degrees

# Scalar projection of b onto a: (a·b) / |a|
scalar_proj = (a @ b) / np.linalg.norm(a)

# Vector projection of b onto a: scalar_proj * (a / |a|)
a_unit = a / np.linalg.norm(a)
vector_proj = scalar_proj * a_unit

fig, ax = plt.subplots(figsize=(10, 8))

# Draw vectors
ax.quiver(0, 0, a[0], a[1], angles='xy', scale_units='xy', scale=1,
          color='blue', width=0.02, label=f'a = {a}')
ax.quiver(0, 0, b[0], b[1], angles='xy', scale_units='xy', scale=1,
          color='red', width=0.02, label=f'b = {b}')

# Draw projection
ax.quiver(0, 0, vector_proj[0], vector_proj[1], angles='xy', scale_units='xy', scale=1,
          color='green', width=0.025, label=f'projection of b onto a')

# Draw dashed line from b to its projection (perpendicular)
ax.plot([b[0], vector_proj[0]], [b[1], vector_proj[1]], 'k--', linewidth=1.5, alpha=0.5)

# Annotations
ax.annotate('', xy=(vector_proj[0], -0.3), xytext=(0, -0.3),
            arrowprops=dict(arrowstyle='<->', color='green'))
ax.text(vector_proj[0]/2, -0.6, f'scalar proj = {scalar_proj:.2f}', ha='center', fontsize=11, color='green')

ax.set_xlim(-1, 4)
ax.set_ylim(-1, 3)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.legend(loc='upper left')
ax.set_title('Projection: "How much of b points in the direction of a?"')
ax.grid(True, alpha=0.3)
plt.show()

print(f"a · b = {a @ b}")
print(f"|a| = {np.linalg.norm(a)}")
print(f"Scalar projection of b onto a = (a·b)/|a| = {scalar_proj:.2f}")
print(f"\nCheck: |a| × scalar_proj = {np.linalg.norm(a)} × {scalar_proj:.2f} = {np.linalg.norm(a) * scalar_proj:.2f} = a·b ✓")

### Why Dot Product Matters in Machine Learning

The dot product appears everywhere in ML because it answers: **"How similar are these two vectors?"**

| ML Application | What the dot product computes |
|----------------|-------------------------------|
| **Neural network layer** | `w · x + b` = "How much does input x match what neuron w is looking for?" |
| **Word embeddings** | `word1 · word2` = "How semantically similar are these words?" |
| **Attention (Transformers)** | `query · key` = "How relevant is this key to this query?" |
| **Recommendation systems** | `user · item` = "How much would this user like this item?" |
| **Cosine similarity** | `(a · b) / (|a| |b|)` = Pure directional similarity (-1 to 1) |

#### 4. Vector Norm (Magnitude/Length)

The **L2 norm** (Euclidean length) of a vector:

$$||\mathbf{v}||_2 = \sqrt{\sum_{i=1}^{n} v_i^2}$$

Other norms used in ML:
- **L1 norm**: $||\mathbf{v}||_1 = \sum |v_i|$ (Manhattan distance, used for sparsity)
- **L∞ norm**: $||\mathbf{v}||_\infty = \max |v_i|$

### Deep Dive: What is a Vector Norm?

A **norm** measures the "size" or "length" of a vector. Think of it as answering: *"How far is this point from the origin?"*

#### The L2 (Euclidean) Norm - Most Common

$$||\mathbf{v}||_2 = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2}$$

This is just the **Pythagorean theorem** extended to n dimensions!

For `v = [3, 4]`: $||v|| = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5$

(This is the classic 3-4-5 right triangle)

In [None]:
# Visualizing the L2 norm as distance from origin (Pythagorean theorem)
v = np.array([3, 4])

fig, ax = plt.subplots(figsize=(8, 8))

# Draw the vector
ax.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1,
          color='blue', width=0.02, label=f'v = {v}, ||v|| = {np.linalg.norm(v)}')

# Draw the right triangle
ax.plot([0, v[0]], [0, 0], 'g-', linewidth=2, label=f'horizontal = {v[0]}')
ax.plot([v[0], v[0]], [0, v[1]], 'r-', linewidth=2, label=f'vertical = {v[1]}')

# Right angle marker
ax.plot([v[0]-0.2, v[0]-0.2, v[0]], [0, 0.2, 0.2], 'k-', linewidth=1)

# Labels
ax.text(v[0]/2, -0.4, '3', ha='center', fontsize=14, color='green')
ax.text(v[0]+0.3, v[1]/2, '4', ha='center', fontsize=14, color='red')
ax.text(v[0]/2 - 0.5, v[1]/2 + 0.3, '5', ha='center', fontsize=14, color='blue')

ax.set_xlim(-1, 6)
ax.set_ylim(-1, 6)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.legend(loc='upper left')
ax.set_title('L2 Norm = Pythagorean Theorem\n||v|| = √(3² + 4²) = √25 = 5')
ax.grid(True, alpha=0.3)
plt.show()

#### Comparing Different Norms

Different norms measure "size" differently:

| Norm | Formula | Intuition | Use in ML |
|------|---------|-----------|-----------|
| **L2** | $\sqrt{\sum v_i^2}$ | Straight-line distance | Default distance, weight decay |
| **L1** | $\sum \|v_i\|$ | "Taxicab" distance (walk on grid) | Sparsity (Lasso), makes weights exactly 0 |
| **L∞** | $\max \|v_i\|$ | Largest single component | Worst-case bounds |

In [None]:
# Visualize "unit balls" - all points where ||v|| = 1 for different norms
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

theta = np.linspace(0, 2*np.pi, 100)

# L2 norm: circle (x² + y² = 1)
x_l2 = np.cos(theta)
y_l2 = np.sin(theta)
axes[0].plot(x_l2, y_l2, 'b-', linewidth=2)
axes[0].fill(x_l2, y_l2, alpha=0.2)
axes[0].set_title('L2 Norm (Euclidean)\n||v||₂ = √(x² + y²) = 1\nCircle')
axes[0].set_xlabel('x')
axes[0].set_ylabel('y')

# L1 norm: diamond (|x| + |y| = 1)
x_l1 = [1, 0, -1, 0, 1]
y_l1 = [0, 1, 0, -1, 0]
axes[1].plot(x_l1, y_l1, 'r-', linewidth=2)
axes[1].fill(x_l1, y_l1, alpha=0.2, color='red')
axes[1].set_title('L1 Norm (Manhattan)\n||v||₁ = |x| + |y| = 1\nDiamond')
axes[1].set_xlabel('x')
axes[1].set_ylabel('y')

# L∞ norm: square (max(|x|, |y|) = 1)
x_linf = [1, 1, -1, -1, 1]
y_linf = [1, -1, -1, 1, 1]
axes[2].plot(x_linf, y_linf, 'g-', linewidth=2)
axes[2].fill(x_linf, y_linf, alpha=0.2, color='green')
axes[2].set_title('L∞ Norm (Max)\n||v||∞ = max(|x|, |y|) = 1\nSquare')
axes[2].set_xlabel('x')
axes[2].set_ylabel('y')

for ax in axes:
    ax.set_xlim(-1.5, 1.5)
    ax.set_ylim(-1.5, 1.5)
    ax.set_aspect('equal')
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Example with a specific vector
v = np.array([3, 4])
print(f"For v = {v}:")
print(f"  L2 norm: ||v||₂ = √(3² + 4²) = {np.linalg.norm(v, ord=2)}")
print(f"  L1 norm: ||v||₁ = |3| + |4| = {np.linalg.norm(v, ord=1)}")
print(f"  L∞ norm: ||v||∞ = max(|3|, |4|) = {np.linalg.norm(v, ord=np.inf)}")

#### Why Norms Matter in Machine Learning

| Use Case | How Norms are Used |
|----------|-------------------|
| **Normalization** | Divide by norm to get unit vector: `v / ||v||`. Isolates direction from magnitude. |
| **Regularization** | Add `λ||weights||²` to loss. Keeps weights small → prevents overfitting. |
| **Distance** | Distance between points: `||a - b||`. Used in k-NN, clustering. |
| **Gradient clipping** | If `||gradient|| > threshold`, scale it down. Prevents exploding gradients. |
| **Embedding similarity** | Normalize embeddings so dot product = cosine similarity. |

#### Connecting Dot Product and Norm

The dot product of a vector with itself gives the **squared norm**:

$$\mathbf{v} \cdot \mathbf{v} = v_1^2 + v_2^2 + \ldots = ||\mathbf{v}||^2$$

So: $||\mathbf{v}|| = \sqrt{\mathbf{v} \cdot \mathbf{v}}$

In [None]:
v = np.array([3, 4])

# L2 norm (default)
l2_norm = np.linalg.norm(v)
print(f"L2 norm of {v}: {l2_norm}")  # Should be 5 (3-4-5 triangle)

# L1 norm
l1_norm = np.linalg.norm(v, ord=1)
print(f"L1 norm of {v}: {l1_norm}")  # 3 + 4 = 7

# Unit vector (normalize)
v_unit = v / np.linalg.norm(v)
print(f"Unit vector: {v_unit}")
print(f"Magnitude of unit vector: {np.linalg.norm(v_unit)}")

---

## 2. Matrices

A **matrix** is a 2D array of numbers. In deep learning:
- Weight matrices connect layers
- Batches of data are matrices (rows = samples, columns = features)
- Attention scores form matrices

### Matrix as Linear Transformation

A matrix transforms vectors from one space to another. When you multiply a matrix by a vector, you're applying a linear transformation.

In [None]:
# Creating matrices
A = np.array([[1, 2],
              [3, 4]])

print(f"Matrix A:\n{A}")
print(f"Shape: {A.shape}")
print(f"Number of rows: {A.shape[0]}")
print(f"Number of columns: {A.shape[1]}")

### Matrix-Vector Multiplication

Matrix $\mathbf{A}$ (m×n) times vector $\mathbf{v}$ (n×1) produces vector (m×1):

$$\mathbf{Av} = \begin{bmatrix} \mathbf{a}_1 \cdot \mathbf{v} \\ \mathbf{a}_2 \cdot \mathbf{v} \\ \vdots \\ \mathbf{a}_m \cdot \mathbf{v} \end{bmatrix}$$

Each element is a dot product of a row of A with vector v.

In [None]:
A = np.array([[2, 0],
              [0, 1]])
v = np.array([1, 1])

# Matrix-vector multiplication
result = A @ v  # or np.dot(A, v)
print(f"A @ v = {result}")

# This stretches the x-component by 2, keeps y the same
plot_vectors([v, result], ['blue', 'red'], ['original v', 'Av (transformed)'])
plt.title('Matrix as Transformation (Scaling)')
plt.show()

In [None]:
# Rotation matrix (90 degrees counter-clockwise)
theta = np.pi / 2  # 90 degrees
R = np.array([[np.cos(theta), -np.sin(theta)],
              [np.sin(theta),  np.cos(theta)]])

v = np.array([1, 0])
rotated = R @ v

print(f"Rotation matrix R:\n{R.round(3)}")
print(f"Original: {v}")
print(f"Rotated: {rotated.round(3)}")

plot_vectors([v, rotated], ['blue', 'red'], ['original', 'rotated 90°'])
plt.title('Rotation Transformation')
plt.show()

### Visualizing Linear Transformations

Let's see how different matrices transform a grid of points.

In [None]:
def plot_transformation(A, title):
    """Visualize how matrix A transforms a unit square."""
    # Create a grid of points
    n = 10
    x = np.linspace(-1, 1, n)
    y = np.linspace(-1, 1, n)
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Original grid
    for xi in x:
        axes[0].plot([xi, xi], [-1, 1], 'b-', alpha=0.5)
    for yi in y:
        axes[0].plot([-1, 1], [yi, yi], 'b-', alpha=0.5)
    # Highlight basis vectors
    axes[0].quiver(0, 0, 1, 0, angles='xy', scale_units='xy', scale=1, color='red', width=0.02)
    axes[0].quiver(0, 0, 0, 1, angles='xy', scale_units='xy', scale=1, color='green', width=0.02)
    axes[0].set_xlim(-2, 2)
    axes[0].set_ylim(-2, 2)
    axes[0].set_aspect('equal')
    axes[0].set_title('Original Space')
    axes[0].axhline(y=0, color='k', linewidth=0.5)
    axes[0].axvline(x=0, color='k', linewidth=0.5)
    
    # Transformed grid
    for xi in x:
        points = np.array([[xi, yi] for yi in y])
        transformed = (A @ points.T).T
        axes[1].plot(transformed[:, 0], transformed[:, 1], 'b-', alpha=0.5)
    for yi in y:
        points = np.array([[xi, yi] for xi in x])
        transformed = (A @ points.T).T
        axes[1].plot(transformed[:, 0], transformed[:, 1], 'b-', alpha=0.5)
    
    # Transformed basis vectors
    e1_transformed = A @ np.array([1, 0])
    e2_transformed = A @ np.array([0, 1])
    axes[1].quiver(0, 0, e1_transformed[0], e1_transformed[1], angles='xy', scale_units='xy', scale=1, color='red', width=0.02)
    axes[1].quiver(0, 0, e2_transformed[0], e2_transformed[1], angles='xy', scale_units='xy', scale=1, color='green', width=0.02)
    
    axes[1].set_xlim(-2, 2)
    axes[1].set_ylim(-2, 2)
    axes[1].set_aspect('equal')
    axes[1].set_title(f'After Transformation: {title}')
    axes[1].axhline(y=0, color='k', linewidth=0.5)
    axes[1].axvline(x=0, color='k', linewidth=0.5)
    
    plt.tight_layout()
    plt.show()
    
    print(f"Matrix:\n{A}")
    print(f"Red basis vector [1,0] -> {e1_transformed}")
    print(f"Green basis vector [0,1] -> {e2_transformed}")

In [None]:
# Scaling
A_scale = np.array([[1.5, 0],
                    [0, 0.5]])
plot_transformation(A_scale, "Scaling (1.5x, 0.5y)")

In [None]:
# Rotation
theta = np.pi / 6  # 30 degrees
A_rotate = np.array([[np.cos(theta), -np.sin(theta)],
                     [np.sin(theta),  np.cos(theta)]])
plot_transformation(A_rotate, "Rotation (30°)")

In [None]:
# Shear
A_shear = np.array([[1, 0.5],
                    [0, 1]])
plot_transformation(A_shear, "Shear")

### Deep Dive: Understanding Matrices as Transformations

**Key Insight**: A matrix doesn't just "do math" - it describes a geometric transformation. Every matrix is a machine that takes vectors in and outputs transformed vectors.

#### What Do the Columns of a Matrix Mean?

Here's the most important insight about matrices:

> **The columns of a matrix tell you where the basis vectors land after transformation.**

For a 2D matrix $\mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$:
- **Column 1** $\begin{bmatrix} a \\ c \end{bmatrix}$ = where the vector $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$ (pointing right) lands
- **Column 2** $\begin{bmatrix} b \\ d \end{bmatrix}$ = where the vector $\begin{bmatrix} 0 \\ 1 \end{bmatrix}$ (pointing up) lands

This means: **to design a transformation, just decide where you want the basis vectors to go!**

In [None]:
# Demonstration: Columns of a matrix = where basis vectors land
# Let's verify this with an example

A = np.array([[2, -1],
              [1,  1]])

# Standard basis vectors
e1 = np.array([1, 0])  # Points right
e2 = np.array([0, 1])  # Points up

# Transform them
Ae1 = A @ e1
Ae2 = A @ e2

print("Matrix A:")
print(A)
print(f"\nColumn 1 of A: {A[:, 0]}")
print(f"A @ [1,0] = {Ae1}")
print(f"Same? {np.allclose(A[:, 0], Ae1)}")

print(f"\nColumn 2 of A: {A[:, 1]}")
print(f"A @ [0,1] = {Ae2}")
print(f"Same? {np.allclose(A[:, 1], Ae2)}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Before transformation
axes[0].quiver(0, 0, 1, 0, angles='xy', scale_units='xy', scale=1, color='red', width=0.02, label='e1 = [1,0]')
axes[0].quiver(0, 0, 0, 1, angles='xy', scale_units='xy', scale=1, color='green', width=0.02, label='e2 = [0,1]')
axes[0].set_xlim(-2, 3)
axes[0].set_ylim(-2, 3)
axes[0].set_aspect('equal')
axes[0].axhline(y=0, color='k', linewidth=0.5)
axes[0].axvline(x=0, color='k', linewidth=0.5)
axes[0].grid(True, alpha=0.3)
axes[0].legend()
axes[0].set_title('BEFORE: Standard Basis Vectors')

# After transformation
axes[1].quiver(0, 0, Ae1[0], Ae1[1], angles='xy', scale_units='xy', scale=1, color='red', width=0.02, 
               label=f'A @ e1 = {Ae1} (Column 1)')
axes[1].quiver(0, 0, Ae2[0], Ae2[1], angles='xy', scale_units='xy', scale=1, color='green', width=0.02, 
               label=f'A @ e2 = {Ae2} (Column 2)')
axes[1].set_xlim(-2, 3)
axes[1].set_ylim(-2, 3)
axes[1].set_aspect('equal')
axes[1].axhline(y=0, color='k', linewidth=0.5)
axes[1].axvline(x=0, color='k', linewidth=0.5)
axes[1].grid(True, alpha=0.3)
axes[1].legend()
axes[1].set_title('AFTER: Basis Vectors = Columns of A')

plt.tight_layout()
plt.show()

print("\nKey insight: Reading the columns of A directly tells you the transformation!")

#### Common 2D Transformation Matrices

Once you understand "columns = where basis vectors go," you can read or construct any transformation:

| Transformation | Matrix | Column 1 (where [1,0] goes) | Column 2 (where [0,1] goes) |
|----------------|--------|----------------------------|----------------------------|
| **Identity** (do nothing) | $\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$ | [1, 0] stays at [1, 0] | [0, 1] stays at [0, 1] |
| **Scale by k** | $\begin{bmatrix} k & 0 \\ 0 & k \end{bmatrix}$ | [1, 0] -> [k, 0] | [0, 1] -> [0, k] |
| **Scale x by a, y by b** | $\begin{bmatrix} a & 0 \\ 0 & b \end{bmatrix}$ | [1, 0] -> [a, 0] | [0, 1] -> [0, b] |
| **Rotate by θ** | $\begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}$ | [1, 0] rotates to [cos θ, sin θ] | [0, 1] rotates to [-sin θ, cos θ] |
| **Reflect across x-axis** | $\begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}$ | [1, 0] stays | [0, 1] -> [0, -1] |
| **Reflect across y-axis** | $\begin{bmatrix} -1 & 0 \\ 0 & 1 \end{bmatrix}$ | [1, 0] -> [-1, 0] | [0, 1] stays |
| **Reflect across y=x** | $\begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}$ | [1, 0] -> [0, 1] | [0, 1] -> [1, 0] |
| **Shear (horizontal)** | $\begin{bmatrix} 1 & k \\ 0 & 1 \end{bmatrix}$ | [1, 0] stays | [0, 1] -> [k, 1] |
| **Shear (vertical)** | $\begin{bmatrix} 1 & 0 \\ k & 1 \end{bmatrix}$ | [1, 0] -> [1, k] | [0, 1] stays |
| **Project onto x-axis** | $\begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix}$ | [1, 0] stays | [0, 1] -> [0, 0] (collapsed!) |

#### Why Matrix Multiplication is Composition of Transformations

When you multiply matrices $\mathbf{AB}$, you're creating a new transformation that does **B first, then A**.

**Think of it this way:**
- To apply $\mathbf{AB}$ to vector $\mathbf{v}$: $(\mathbf{AB})\mathbf{v} = \mathbf{A}(\mathbf{B}\mathbf{v})$
- First B transforms v, then A transforms the result

**Why the "backwards" order?**

Because we read left-to-right but function application is right-to-left: $f(g(x))$ applies g first, then f.

This is exactly like composing functions: if `rotate()` and `scale()` are functions, then `rotate(scale(v))` scales first, rotates second.

In [None]:
# Demonstration: Matrix multiplication = composition of transformations
# Let's compose rotation (45 degrees) followed by scaling (2x in x, 0.5x in y)

# Define individual transformations
theta = np.pi / 4  # 45 degrees
Rotate = np.array([[np.cos(theta), -np.sin(theta)],
                   [np.sin(theta),  np.cos(theta)]])

Scale = np.array([[2.0, 0],
                  [0, 0.5]])

# Compose: Scale first, then Rotate (remember: right-to-left!)
# So we write: Rotate @ Scale
Composed = Rotate @ Scale

print("Rotation matrix (45 degrees):")
print(Rotate.round(3))
print("\nScaling matrix (2x, 0.5y):")
print(Scale)
print("\nComposed (Rotate @ Scale) - scales first, then rotates:")
print(Composed.round(3))

# Visualize the three transformations
fig, axes = plt.subplots(1, 4, figsize=(20, 5))

# Test vector
v = np.array([1, 1])

# Original
axes[0].quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1, color='blue', width=0.02)
axes[0].set_title('Original vector [1, 1]')

# After Scale only
v_scaled = Scale @ v
axes[1].quiver(0, 0, v_scaled[0], v_scaled[1], angles='xy', scale_units='xy', scale=1, color='green', width=0.02)
axes[1].set_title(f'After Scale: {v_scaled}')

# After Scale then Rotate (two steps)
v_scaled_rotated = Rotate @ v_scaled
axes[2].quiver(0, 0, v_scaled_rotated[0], v_scaled_rotated[1], angles='xy', scale_units='xy', scale=1, color='red', width=0.02)
axes[2].set_title(f'After Scale then Rotate:\n{v_scaled_rotated.round(3)}')

# Using composed matrix (single step)
v_composed = Composed @ v
axes[3].quiver(0, 0, v_composed[0], v_composed[1], angles='xy', scale_units='xy', scale=1, color='purple', width=0.02)
axes[3].set_title(f'Using Composed matrix:\n{v_composed.round(3)}')

for ax in axes:
    ax.set_xlim(-3, 3)
    ax.set_ylim(-2, 2)
    ax.set_aspect('equal')
    ax.axhline(y=0, color='k', linewidth=0.5)
    ax.axvline(x=0, color='k', linewidth=0.5)
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nTwo-step result: {v_scaled_rotated.round(6)}")
print(f"Composed result: {v_composed.round(6)}")
print(f"Same? {np.allclose(v_scaled_rotated, v_composed)}")
print("\nKey insight: (Rotate @ Scale) @ v = Rotate @ (Scale @ v)")

### Matrix-Matrix Multiplication

If $\mathbf{A}$ is (m×n) and $\mathbf{B}$ is (n×p), then $\mathbf{AB}$ is (m×p).

**Key insight**: Matrix multiplication = composition of transformations.

If A rotates and B scales, then AB does both (B first, then A).

In [None]:
# Matrix multiplication
A = np.array([[1, 2],
              [3, 4]])

B = np.array([[5, 6],
              [7, 8]])

C = A @ B
print(f"A:\n{A}\n")
print(f"B:\n{B}\n")
print(f"A @ B:\n{C}")

In [None]:
# EXERCISE: Implement matrix multiplication from scratch
def matmul(A, B):
    """
    Multiply matrices A and B.
    A: (m, n) matrix
    B: (n, p) matrix
    Returns: (m, p) matrix
    """
    m, n = A.shape
    n2, p = B.shape
    assert n == n2, f"Incompatible dimensions: {A.shape} and {B.shape}"
    
    # TODO: Implement this!
    # Hint: C[i,j] = sum over k of A[i,k] * B[k,j]
    C = np.zeros((m, p))
    
    # Your code here
    for i in range(m):
        for j in range(p):
            for k in range(n):
                C[i, j] += A[i, k] * B[k, j]
    
    return C

# Test your implementation
result = matmul(A, B)
expected = A @ B
print(f"Your result:\n{result}")
print(f"Expected:\n{expected}")
print(f"Correct: {np.allclose(result, expected)}")

### Matrix Properties

#### Transpose
Swap rows and columns: $(\mathbf{A}^T)_{ij} = \mathbf{A}_{ji}$

In [None]:
A = np.array([[1, 2, 3],
              [4, 5, 6]])

print(f"A (2x3):\n{A}\n")
print(f"A^T (3x2):\n{A.T}")

#### Identity Matrix
The "do nothing" transformation. $\mathbf{IA} = \mathbf{AI} = \mathbf{A}$

In [None]:
I = np.eye(3)  # 3x3 identity matrix
print(f"Identity matrix:\n{I}")

A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

print(f"\nA @ I = A: {np.allclose(A @ I, A)}")

#### Matrix Inverse

The inverse $\mathbf{A}^{-1}$ "undoes" the transformation: $\mathbf{A}^{-1}\mathbf{A} = \mathbf{I}$

Not all matrices have inverses (singular matrices).

In [None]:
A = np.array([[4, 7],
              [2, 6]])

A_inv = np.linalg.inv(A)
print(f"A:\n{A}\n")
print(f"A^(-1):\n{A_inv}\n")
print(f"A @ A^(-1):\n{(A @ A_inv).round(10)}")

In [None]:
# A singular matrix (no inverse)
singular = np.array([[1, 2],
                     [2, 4]])  # Row 2 = 2 * Row 1

print(f"Determinant: {np.linalg.det(singular)}")
# np.linalg.inv(singular)  # This would raise an error

---

## 3. Tensors

**Tensors** are generalizations to higher dimensions:
- Scalar: 0D tensor
- Vector: 1D tensor
- Matrix: 2D tensor
- 3D tensor: e.g., RGB image (height × width × channels)
- 4D tensor: batch of images (batch × height × width × channels)

In deep learning, we constantly work with tensors.

In [None]:
# Tensors in NumPy
scalar = np.array(5)          # 0D
vector = np.array([1, 2, 3])  # 1D
matrix = np.array([[1, 2], [3, 4]])  # 2D
tensor_3d = np.random.rand(3, 4, 5)  # 3D
tensor_4d = np.random.rand(32, 28, 28, 3)  # Batch of 32 color images

print(f"Scalar shape: {scalar.shape}, ndim: {scalar.ndim}")
print(f"Vector shape: {vector.shape}, ndim: {vector.ndim}")
print(f"Matrix shape: {matrix.shape}, ndim: {matrix.ndim}")
print(f"3D tensor shape: {tensor_3d.shape}, ndim: {tensor_3d.ndim}")
print(f"4D tensor shape: {tensor_4d.shape}, ndim: {tensor_4d.ndim}")

### Broadcasting

NumPy's broadcasting allows operations on arrays of different shapes. This is crucial for efficient ML code.

In [None]:
# Broadcasting examples
A = np.array([[1, 2, 3],
              [4, 5, 6]])

# Scalar broadcast
print(f"A + 10:\n{A + 10}\n")

# Row vector broadcast (add to each row)
row = np.array([10, 20, 30])
print(f"A + [10, 20, 30]:\n{A + row}\n")

# Column vector broadcast (add to each column)
col = np.array([[100], [200]])
print(f"A + [[100], [200]]:\n{A + col}")

---

## 4. Eigenvalues and Eigenvectors

For a square matrix $\mathbf{A}$, an **eigenvector** $\mathbf{v}$ and **eigenvalue** $\lambda$ satisfy:

$$\mathbf{Av} = \lambda\mathbf{v}$$

**Meaning**: When you apply transformation A to eigenvector v, it only scales (by λ), doesn't change direction.

**Applications in ML**:
- PCA (Principal Component Analysis)
- Understanding neural network dynamics
- Spectral clustering

In [None]:
# Simple example
A = np.array([[3, 1],
              [0, 2]])

eigenvalues, eigenvectors = np.linalg.eig(A)

print(f"Matrix A:\n{A}\n")
print(f"Eigenvalues: {eigenvalues}")
print(f"Eigenvectors (as columns):\n{eigenvectors}")

In [None]:
# Verify: Av = λv
for i in range(len(eigenvalues)):
    λ = eigenvalues[i]
    v = eigenvectors[:, i]  # Column i is eigenvector i
    
    Av = A @ v
    λv = λ * v
    
    print(f"\nEigenvector {i+1}: {v}")
    print(f"Eigenvalue: {λ}")
    print(f"A @ v = {Av}")
    print(f"λ * v = {λv}")
    print(f"Equal: {np.allclose(Av, λv)}")

In [None]:
# Visualize eigenvectors: they don't change direction under transformation
A = np.array([[2, 1],
              [1, 2]])

eigenvalues, eigenvectors = np.linalg.eig(A)

fig, ax = plt.subplots(figsize=(8, 8))

# Plot many vectors and their transformations
for theta in np.linspace(0, 2*np.pi, 16, endpoint=False):
    v = np.array([np.cos(theta), np.sin(theta)])
    Av = A @ v
    ax.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1, 
              color='blue', alpha=0.3, width=0.01)
    ax.quiver(0, 0, Av[0], Av[1], angles='xy', scale_units='xy', scale=1, 
              color='red', alpha=0.3, width=0.01)

# Highlight eigenvectors
for i in range(2):
    v = eigenvectors[:, i]
    Av = A @ v
    ax.quiver(0, 0, v[0], v[1], angles='xy', scale_units='xy', scale=1, 
              color='blue', width=0.02, label=f'eigenvector {i+1}' if i == 0 else '')
    ax.quiver(0, 0, Av[0], Av[1], angles='xy', scale_units='xy', scale=1, 
              color='red', width=0.02, label=f'transformed' if i == 0 else '')

ax.set_xlim(-4, 4)
ax.set_ylim(-4, 4)
ax.set_aspect('equal')
ax.axhline(y=0, color='k', linewidth=0.5)
ax.axvline(x=0, color='k', linewidth=0.5)
ax.set_title('Blue: Original, Red: Transformed\nEigenvectors only scale, not rotate')
ax.legend()
plt.show()

print(f"Eigenvalues: {eigenvalues}")
print("Notice: eigenvectors (thick lines) stay on the same line after transformation!")

### Deep Dive: The Intuition Behind Eigenvectors

**The Big Picture**: Eigenvectors are the "special directions" of a transformation - directions that only get stretched or shrunk, never rotated.

#### Why is this important?

Think of a transformation as "doing something" to space. Most directions get both stretched AND rotated. But eigenvectors reveal the **natural axes** of that transformation - the directions where the action is simplest.

> **Eigenvector intuition**: "I'm a direction that this matrix only scales, never rotates. Apply the matrix to me, and I just get longer or shorter."

#### Breaking Down the Equation

$$\mathbf{Av} = \lambda\mathbf{v}$$

| Component | Meaning |
|-----------|---------|
| $\mathbf{A}$ | The transformation matrix |
| $\mathbf{v}$ | An eigenvector (special direction) |
| $\lambda$ | The eigenvalue (how much v gets scaled) |
| $\mathbf{Av}$ | The result of transforming v |
| $\lambda\mathbf{v}$ | Same direction as v, just scaled by lambda |

#### What the Eigenvalue Tells You

| Eigenvalue lambda | Geometric meaning |
|-------------------|-------------------|
| lambda > 1 | Eigenvector gets stretched |
| 0 < lambda < 1 | Eigenvector gets shrunk |
| lambda = 1 | Eigenvector unchanged (fixed direction) |
| lambda = 0 | Eigenvector collapses to zero (null space) |
| lambda < 0 | Eigenvector flips direction and scales |
| Complex lambda | Rotation is involved (no purely scaled directions) |

#### Why Eigenvectors Matter in Machine Learning

| Application | How Eigenvectors are Used |
|------------|---------------------------|
| **PCA** | Eigenvectors of covariance matrix = directions of maximum variance. The top eigenvectors are the "principal components." |
| **Spectral Clustering** | Eigenvectors of graph Laplacian reveal cluster structure. Points are embedded using eigenvectors, then clustered. |
| **PageRank** | The dominant eigenvector of the link matrix gives page importance scores. |
| **Neural Network Dynamics** | Eigenvalues of weight matrices affect gradient flow. Values > 1 cause exploding gradients, < 1 cause vanishing. |
| **Covariance Analysis** | Eigenvectors show directions of correlation in data. Eigenvalues show how much variance in each direction. |
| **Matrix Powers** | If you know eigenvectors, computing $A^n$ is easy: just raise eigenvalues to power n. Useful for Markov chains. |

#### The PCA Connection

**PCA finds eigenvectors of the covariance matrix.**

Why? The covariance matrix $\mathbf{C}$ tells you how features vary together. Its eigenvectors point in directions where data varies most (or least), and eigenvalues tell you how much variance is in each direction.

The eigenvector with the **largest eigenvalue** = direction of **maximum variance** = first principal component.

---

## 5. Singular Value Decomposition (SVD)

SVD decomposes ANY matrix (not just square) into:

$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$

Where:
- $\mathbf{U}$: Left singular vectors (orthonormal)
- $\mathbf{\Sigma}$: Diagonal matrix of singular values (non-negative, sorted descending)
- $\mathbf{V}^T$: Right singular vectors (orthonormal)

**Applications in ML**:
- Dimensionality reduction (PCA uses SVD)
- Image compression
- Recommender systems
- Latent semantic analysis

In [None]:
# SVD example
A = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9],
              [10, 11, 12]])

U, s, Vt = np.linalg.svd(A)

print(f"Original A shape: {A.shape}")
print(f"U shape: {U.shape}")
print(f"Singular values: {s}")
print(f"V^T shape: {Vt.shape}")

In [None]:
# Reconstruct A from SVD
# Need to create the full Sigma matrix
Sigma = np.zeros((U.shape[0], Vt.shape[0]))
np.fill_diagonal(Sigma, s)

A_reconstructed = U @ Sigma @ Vt
print(f"Original A:\n{A}\n")
print(f"Reconstructed:\n{A_reconstructed.round(10)}\n")
print(f"Reconstruction accurate: {np.allclose(A, A_reconstructed)}")

### Low-Rank Approximation

By keeping only the top k singular values, we get the best rank-k approximation of A.

This is the foundation of dimensionality reduction!

### Deep Dive: Understanding SVD Geometrically

SVD reveals the hidden structure of any matrix. Think of it as answering: *"What are the fundamental building blocks of this transformation?"*

$$\mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^T$$

#### What Each Component Represents

| Component | Shape | What it represents | Geometric meaning |
|-----------|-------|-------------------|-------------------|
| $\mathbf{V}^T$ | (n x n) | Input rotation | Rotate input to align with matrix's "natural" axes |
| $\mathbf{\Sigma}$ | (m x n) | Scaling | Scale along each axis (singular values on diagonal) |
| $\mathbf{U}$ | (m x m) | Output rotation | Rotate to final output orientation |

**The key insight**: ANY matrix transformation can be decomposed into: **rotate -> scale -> rotate**.

#### Why Singular Values are Sorted by Importance

The singular values in $\Sigma$ are always sorted: $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r \geq 0$

**Why sorted?** Because they represent how much the matrix "stretches" space in each direction:
- $\sigma_1$ = maximum stretch factor (most important direction)
- $\sigma_2$ = second most stretch (second most important)
- Small $\sigma_i$ = nearly no stretch = "noise" or "unimportant"

This ordering is why keeping only the top-k singular values gives the **best** rank-k approximation!

#### The Connection to PCA

PCA and SVD are deeply connected:

| If you have... | PCA finds... | Which equals... |
|----------------|--------------|-----------------|
| Data matrix $\mathbf{X}$ (centered) | Eigenvectors of $\mathbf{X}^T\mathbf{X}$ | Right singular vectors $\mathbf{V}$ from SVD of $\mathbf{X}$ |
| Principal components | $\mathbf{X} \cdot \text{eigenvectors}$ | $\mathbf{U} \cdot \Sigma$ from SVD |
| Variance explained | Eigenvalues / total | $\sigma_i^2 / \sum \sigma_j^2$ |

**Bottom line**: PCA is just SVD on centered data! In practice, PCA is often computed using SVD because it's more numerically stable.

In [None]:
def low_rank_approx(A, k):
    """Return rank-k approximation of matrix A using SVD."""
    U, s, Vt = np.linalg.svd(A, full_matrices=False)
    return U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]

# Example with random matrix
np.random.seed(42)
A = np.random.rand(10, 8)

print(f"Original matrix shape: {A.shape}")
print(f"Full rank: {np.linalg.matrix_rank(A)}")

for k in [1, 2, 4, 8]:
    A_k = low_rank_approx(A, k)
    error = np.linalg.norm(A - A_k, 'fro')  # Frobenius norm
    print(f"Rank-{k} approximation error: {error:.4f}")

### Practical Exercise: Image Compression with SVD

Let's compress an image using SVD!

In [None]:
# Create a sample grayscale image (or load one)
# We'll create a simple pattern
x = np.linspace(-3, 3, 200)
y = np.linspace(-3, 3, 200)
X, Y = np.meshgrid(x, y)
image = np.sin(X) * np.cos(Y) + 0.5 * np.sin(2*X) * np.cos(2*Y)
image = (image - image.min()) / (image.max() - image.min())  # Normalize to [0, 1]

plt.figure(figsize=(6, 6))
plt.imshow(image, cmap='gray')
plt.title(f'Original Image ({image.shape[0]}×{image.shape[1]})')
plt.colorbar()
plt.show()

In [None]:
# Compress with different ranks
U, s, Vt = np.linalg.svd(image, full_matrices=False)

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

ranks = [1, 5, 10, 20, 50, 100]
for ax, k in zip(axes.flat, ranks):
    compressed = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
    
    # Calculate compression ratio
    original_size = image.shape[0] * image.shape[1]
    compressed_size = k * (image.shape[0] + image.shape[1] + 1)
    ratio = original_size / compressed_size
    
    ax.imshow(compressed, cmap='gray')
    ax.set_title(f'Rank {k}\nCompression: {ratio:.1f}x')
    ax.axis('off')

plt.tight_layout()
plt.show()

# Plot singular values
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.plot(s, 'b-')
plt.xlabel('Index')
plt.ylabel('Singular Value')
plt.title('Singular Values')

plt.subplot(1, 2, 2)
plt.plot(np.cumsum(s**2) / np.sum(s**2), 'b-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Variance Explained')
plt.title('Cumulative Variance')
plt.axhline(y=0.95, color='r', linestyle='--', label='95%')
plt.legend()

plt.tight_layout()
plt.show()

---

## Exercises

### Exercise 1: Implement Matrix Operations
Implement the following functions without using NumPy's built-in functions:

In [None]:
def transpose(A):
    """Return the transpose of matrix A."""
    m, n = A.shape
    result = np.zeros((n, m))
    # TODO: Implement
    for i in range(m):
        for j in range(n):
            result[j, i] = A[i, j]
    return result

def dot_product(a, b):
    """Return the dot product of vectors a and b."""
    assert len(a) == len(b)
    result = 0
    # TODO: Implement
    for i in range(len(a)):
        result += a[i] * b[i]
    return result

def matrix_vector_mult(A, v):
    """Return A @ v."""
    m, n = A.shape
    assert n == len(v)
    result = np.zeros(m)
    # TODO: Implement
    for i in range(m):
        result[i] = dot_product(A[i], v)
    return result

# Test
A = np.array([[1, 2, 3], [4, 5, 6]])
v = np.array([1, 2, 3])

print(f"transpose(A) correct: {np.allclose(transpose(A), A.T)}")
print(f"dot_product([1,2,3], [4,5,6]) = {dot_product(np.array([1,2,3]), np.array([4,5,6]))}")
print(f"matrix_vector_mult correct: {np.allclose(matrix_vector_mult(A, v), A @ v)}")

### Exercise 2: Linear Transformation Explorer
Create different transformation matrices and visualize their effects.

In [None]:
# TODO: Create and visualize these transformations:
# 1. Reflection across the x-axis
# 2. Reflection across y=x line
# 3. Projection onto the x-axis
# 4. A combination: rotate 45° then scale by 2

# Example: Reflection across x-axis
reflect_x = np.array([[1, 0],
                      [0, -1]])
plot_transformation(reflect_x, "Reflection across x-axis")

### Exercise 3: Build a Simple Recommender System

Use SVD for matrix factorization to build a basic recommender system.

In [None]:
# User-Item rating matrix (users x items)
# 0 means not rated
ratings = np.array([
    [5, 3, 0, 1, 4],
    [4, 0, 0, 1, 3],
    [1, 1, 0, 5, 4],
    [0, 1, 5, 4, 0],
    [0, 0, 4, 0, 4],
    [2, 1, 3, 4, 5]
])

print("User-Item Ratings (0 = not rated):")
print(ratings)

# TODO: 
# 1. Fill missing values with row means (simple imputation)
# 2. Apply SVD to get low-rank approximation
# 3. Use the approximation to predict missing ratings

# Fill missing with row means
ratings_filled = ratings.copy().astype(float)
for i in range(ratings.shape[0]):
    row = ratings[i]
    mean = row[row > 0].mean()
    ratings_filled[i, row == 0] = mean

print("\nFilled ratings:")
print(ratings_filled.round(2))

# Low-rank approximation
k = 2  # Use only 2 latent factors
U, s, Vt = np.linalg.svd(ratings_filled, full_matrices=False)
predicted = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]

print(f"\nPredicted ratings (rank-{k}):")
print(predicted.round(2))

# Show predictions for originally missing entries
print("\nPredictions for missing entries:")
for i in range(ratings.shape[0]):
    for j in range(ratings.shape[1]):
        if ratings[i, j] == 0:
            print(f"  User {i}, Item {j}: {predicted[i, j]:.2f}")

---

## Summary

### Key Concepts

1. **Vectors** represent data points, weights, and gradients
2. **Dot product** is the core operation (weighted sums in neural networks)
3. **Matrices** are linear transformations
4. **Matrix multiplication** composes transformations
5. **Eigenvectors** reveal the "natural directions" of a transformation
6. **SVD** decomposes any matrix and enables dimensionality reduction

### Connection to Deep Learning

- **Forward pass**: Sequence of matrix multiplications + activations
- **Weights**: Learned transformation matrices
- **Backprop**: Uses chain rule on these matrix operations
- **Embeddings**: Low-dimensional representations (like SVD)
- **Attention**: Computed via dot products between vectors

### Checklist
- [ ] I can perform vector operations (addition, dot product, norm)
- [ ] I understand matrices as linear transformations
- [ ] I can multiply matrices and understand shape compatibility
- [ ] I know what eigenvalues/eigenvectors represent
- [ ] I can use SVD for dimensionality reduction

---

## Next Steps

Continue to **Part 1.2: Calculus Refresher** where we'll cover:
- Derivatives and the chain rule
- Gradients and gradient descent
- The mathematical foundation of backpropagation