# Linear Algebra with NumPy

#### Concept 1: Vectors, Matrices, and the $(N, D)$ Shape

**The Why:**
In Machine Learning, we rarely process one data point at a time. We process "batches" of data for efficiency.
* A **Vector** (1D array) usually represents a single data point's features.
* A **Matrix** (2D array) represents a **batch** of data points.

We almost always denote this shape as $(N, D)$, where:
* $N$ = Batch Size (number of samples).
* $D$ = Input Dimensions (number of features per sample).

**Micro-Task 1:**
Using `numpy`:
1.  Set a random seed (for reproducibility).
2.  Create a "batch" of data $X$ representing **5 samples**, where each sample has **4 features** (e.g., height, weight, age, income).
3.  Print the array and, crucially, print its `.shape` attribute to verify it matches $(N, D)$.

In [None]:
import numpy as np

# Computers aren't actually random; they follow a complex list of instructions to look random. 
# If we set a "seed" (like a starting point for that list), we can make sure we get the exact same "random" numbers every time we run the code. 
# This is crucial for debugging!

np.random.seed(67)

# Create a random X
X = np.random.rand(5,4)
np.shape(X)

#### Concept 2: Matrix Multiplication & Dot Products

**The Why:**
This is the single most important operation in Deep Learning. A "Dense Layer" in a neural network is just a dot product between your data ($X$) and a matrix of learnable weights ($W$).

If $X$ holds your data and $W$ holds the network's knowledge, multiplying them ($X \cdot W$) gives you the network's prediction.

**The Rule:**
The **inner dimensions** must match.
$$(N, \underbrace{D) \cdot (D}_{match}, M) \rightarrow (N, M)$$

**Micro-Task 2:**
We have our input `X` with shape `(5, 4)`.
1.  Create a random weight matrix `W`. We want this layer to have **3 neurons** (output features).
2.  Compute the dot product `Y` using `np.dot(X, W)`.

In [None]:
# Create a weight that can be multiplied with X
W = np.random.rand(4,3)

# Dot product
Y = np.dot(X, W)
np.shape(Y)

This `(N, M)` output is exactly what you'd feed into the next layer of a neural network.

---

#### Concept 3: Broadcasting

**The Why:**
In Python, if you want to add a single number (bias) to a whole matrix, you'd typically need a loop. `numpy` handles this automatically via **Broadcasting**. It "stretches" the smaller array to match the larger one without actually copying data, which is incredibly fast.

In a Dense Layer, the formula is $Y = X \cdot W + b$.
* $X \cdot W$ is `(N, M)` (as we just saw).
* $b$ (bias) is usually `(M,)`—one bias value per neuron.

Numpy will automatically add that `(M,)` vector to *every single row* of the `(N, M)` matrix.

**Micro-Task 3:**
1.  Create a bias vector `b` with shape `(3,)` (since we have 3 output neurons).
2.  Add `b` to your previous result `Y`.
3.  Print the shape of the result to prove it didn't change.

In [None]:
# np.shape(Y) is (5,3) so b should be (1,3)
# Possible ways of writing:
#     b = np.random.rand(3)
#     b = np.random.rand(1, 3) Creates a 2D matrix, incorrect format but broadcast is smart enough to understand
#     b = np.random.rand(3, )
b = np.random.rand(3)
Y = np.dot(X,W) + b
np.shape(Y)

#### Concept 4: ReLU Activation (Vectorization)

**The Why:**
Right now, our calculation $XW + b$ is purely **linear**. If we stack 100 linear layers on top of each other, mathematically, they compress down to just one single linear layer. We gain no power.

To learn complex patterns (like curves, faces, or grammar), we need to inject **non-linearity**. We do this with an "Activation Function."
The most common one is **ReLU (Rectified Linear Unit)**. It's deceptively simple:
* If the number is positive, keep it.
* If the number is negative, turn it to 0.

**Micro-Task 4:**
We need to apply ReLU to your matrix `Y`.
1.  Create a new variable `A` (for Activation).
2.  Set `A` equal to `Y`, but replace all negative values with `0`.
3.  **Constraint:** Do **not** use a `for` loop. Use `np.maximum(..., ...)` to do it all at once (Vectorization).
4.  Print `A`.

In [None]:
A = np.maximum(0, Y)
A

### Step 3: The Project
**Project Title: The "Manual" Dense Layer & Efficiency Benchmark**

We are going to prove *why* we use NumPy by building a forward pass and then pitting it against standard Python loops.

**Specifications:**
1.  **Data Setup:**
    * Create a random input batch $X$ of shape `(100, 500)` (Batch Size = 100, Features = 500).
    * Create a random weight matrix $W$ of shape `(500, 64)` (500 Input features $\to$ 64 Output neurons).
2.  **Task A: The NumPy Approach**
    * Compute $Y = X \cdot W$ using `np.dot`.
    * Time how long this takes (you can use the `time` module).
3.  **Task B: The Naive Loop Approach**
    * Write a function that computes the exact same matrix multiplication using standard Python `for` loops (iterating over rows and columns).
    * Time how long this takes.
4.  **Output:**
    * Print the shape of the result (should be `(100, 64)`).
    * Print the time taken for both approaches to show the difference.

In [None]:
import time

X1 = np.random.rand(100,500)
W1 = np.random.rand(500, 64)

start_time1 = time.time()
Y1 = np.dot(X1, W1)
end_time1 = time.time()
total_time1 = end_time1 - start_time1

print(np.shape(Y1), f"\nThe numpy process took {total_time1:.4f} sec")

X2 = np.random.rand(100,500)
W2 = np.random.rand(500, 64)

n = np.shape(X2)[0]
l = np.shape(X2)[1]
m = np.shape(W2)[1]

Y2 = np.zeros((n, m))

start_time2 = time.time()

for i in range(n):
    for j in range(m):
        x_sum = 0
        for k in range(l):
            x_sum += X2[i][k] * W2[k][j]
        Y2[i][j] = x_sum

end_time2 = time.time()
total_time2 = end_time2 - start_time2

print(np.shape(Y2), f"\nThe non-numpy process took {total_time2:.4f} sec")

This is a fantastic result. You have just empirically proven the "Why" of Deep Learning frameworks.

* **NumPy Time:** ~0.0010 sec
* **Loop Time:** ~2.7564 sec
* **Speedup:** Your NumPy code is roughly **2,750x faster**.

If training a real model took 1 week with NumPy, doing it with loops would take **52 years**. This is why we never write raw loops for math in Python.

---

### Step 4: The Code Review

Here is my critique of your implementation:

**1. The "Pro" Syntax (Pass)**
You correctly identified the triple loop structure required for matrix multiplication:
* Outer loops ($i, j$) iterate through every output cell.
* Inner loop ($k$) computes the dot product for that cell.
* This is mathematically perfect.

**2. Optimizations & Best Practices**
* **Indexing:** You used `X2[i][k]`. In NumPy, it is slightly faster and more standard to use comma-separated indexing: `X2[i, k]`. The `[][]` syntax creates a temporary intermediate view, whereas `[,]` accesses the memory directly.
* **Data Reuse:** You created new variables `X2` and `W2`. For a strict benchmark, it's better to reuse `X1` and `W1`. This ensures we are timing the *algorithm*, not the memory allocation differences or random number generation quirks.

**3. The "Under the Hood" Magic**
Why was NumPy so fast?
1.  **C-Backend:** NumPy delegates the heavy lifting to C and Fortran code, avoiding Python's slow interpreter overhead.
2.  **SIMD (Single Instruction, Multiple Data):** Modern CPUs can multiply 4, 8, or 16 numbers in a single clock cycle. NumPy leverages this; Python loops do not.
3.  **BLAS:** NumPy links to highly optimized linear algebra libraries (like Intel MKL or OpenBLAS) that manage memory cache much better than a naive loop.

**Rating:** ⭐ **PASS** (Strong implementation).

---

### Step 5: Real-World Context

**Where does this fit in the Big Picture?**

You just built the engine of a **Fully Connected Network (FCN)** (also called a Multi-Layer Perceptron).

In production (e.g., at Netflix or Google):
1.  **The "Batch":** We don't predict what *one* user wants to watch. We process users in batches of 128, 256, or even 1024 at a time (just like your `N=100`).
2.  **The "Dense Layer":** This $X \cdot W$ operation happens billions of times a day.
3.  **Hardware:** While your CPU is fast, in Week 2 and 3, we will move this exact logic to a **GPU**. GPUs are designed to do thousands of these $(N, D) \times (D, M)$ operations in parallel.



**Summary of Week 1:**
* **Vectors/Matrices:** The language of data.
* **Broadcasting:** The trick to avoid loops.
* **Dot Product:** The engine of prediction.
* **ReLU:** The spark of non-linearity.
* **Vectorization:** The key to speed.

---
