<a href="https://colab.research.google.com/github/financieras/math_for_ai/blob/main/the_mystery_of_the_1_in_machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Mystery of the "1" in Machine Learning

One of the most common head-scratchers for folks getting into **Machine Learning** is:  
> **Where does that "1" come from when we multiply the bias?**

Let’s clear it up with a super intuitive, real-world example: **calculating taxi fares**.

---

## Real-World Use Case: Taxi Pricing

A taxi charges according to this formula:

$$ \text{Fare} = 2.50 + 1.30 \times \text{km} $$

That’s just a straight line:

$$ y = b + m x $$

Or, in ML notation:

$$ y = w_0 + w_1 x $$

Where:
- **\$2.50** is the **flag drop** (fixed base fare, regardless of distance). This is the **intercept** or **bias**.
- **\$1.30/km** is the per-kilometer rate. This is the **slope** or **weight** of the input feature $x$.

### The Big Question

How do we express this in **matrix form** so we can compute **multiple fares at once**?

The base fare (\$2.50) doesn’t depend on any input variable — it’s constant. But in matrix multiplication, **every weight needs to be multiplied by something**.

**Solution:** Multiply the constant term by **1**.

$$ \text{Fare} = 2.50 \times \boxed{1} + 1.30 \times \text{km} $$

That **"1"** is what lets us fold the bias into clean, elegant **matrix notation**.

---

## Case 1: One Ride

Let’s compute the fare for a 10 km ride using **matrix math**.

```python
import numpy as np

# One ride: [1, km]
# The 1 is for the flag drop
x = np.array([1, 10])  # [bias, kilometers]

# Weights: [flag_drop, price_per_km]
w = np.array([2.5, 1.3])

# Compute fare
fare = x @ w  # Same as x.dot(w)

print(f"Ride: {x[1]} km")
print(f"Fare: ${fare:.2f}")
print(f"\nBreakdown:")
print(f"  Flag drop:     {w[0]} × {x[0]}  = ${w[0] * x[0]:.2f}")
print(f"  Per km:        {w[1]} × {x[1]} = ${w[1] * x[1]:.2f}")
print(f"  Total:                       ${fare:.2f}")
```

**Output:**
```
Ride: 10 km
Fare: $15.50

Breakdown:
  Flag drop:     2.5 × 1  =  $2.50
  Per km:        1.3 × 10 = $13.00
  Total:                    $15.50
```

### Matrix Notation (Single Sample)

$$ \hat{y} = \mathbf{x}^T \mathbf{w} = \begin{bmatrix} 1 & 10 \end{bmatrix} \begin{bmatrix} 2.5 \\ 1.3 \end{bmatrix} = 15.5 $$

**Dimensions:** $\mathbf{x}^T_{1 \times 2} \cdot \mathbf{w}_{2 \times 1} = \hat{y}$ (scalar)

### General Formula (One Sample, $n$ Features)

$$ \hat{y} = \mathbf{x}^T \mathbf{w} = \begin{bmatrix} 1 & x_1 & x_2 & \cdots & x_n \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} $$

**Dimensions:** $\mathbf{x}^T_{1 \times (n+1)} \cdot \mathbf{w}_{(n+1) \times 1} = \hat{y}$

---

## Case 2: Multiple Rides (Batch Processing)

In ML, we **never** process one sample at a time. We work in **batches**.

Let’s say we have 4 rides:

| Ride | km  |
|------|-----|
| 1    | 10  |
| 2    | 4   |
| 3    | 25  |
| 4    | 20  |

We want to compute all 4 fares **in one shot**.

```python
# Kilometers for 4 rides
kilometers = np.array([10, 4, 25, 20])

# Build X matrix: each ROW is a ride
# Add a column of ones for the flag drop
X = np.column_stack([np.ones(4), kilometers])

# Weights
w = np.array([2.5, 1.3])

# Compute all fares at once
fares = X @ w

print("Matrix X (each row = one ride):")
print(X)
print("\nWeight vector w:")
print(w)
print("\nFare for each ride:")
for i, fare in enumerate(fares, 1):
    print(f"  Ride {i} ({X[i-1, 1]:.0f} km): ${fare:.2f}")
print(f"\nTotal revenue:     ${fares.sum():.2f}")
```

**Output:**
```
Matrix X (each row = one ride):
[[ 1. 10.]
 [ 1.  4.]
 [ 1. 25.]
 [ 1. 20.]]

Weight vector w:
[2.5 1.3]

Fare for each ride:
  Ride 1 (10 km): $15.50
  Ride 2 (4 km): $7.70
  Ride 3 (25 km): $35.00
  Ride 4 (20 km): $28.50

Total revenue:     $86.70
```

### Batch Matrix Notation

$$ \hat{\mathbf{y}} = \mathbf{X} \mathbf{w} = \begin{bmatrix}
1 & 10 \\
1 & 4 \\
1 & 25 \\
1 & 20
\end{bmatrix}
\begin{bmatrix}
2.5 \\
1.3
\end{bmatrix}
=
\begin{bmatrix}
15.5 \\
7.7 \\
35.0 \\
28.5
\end{bmatrix} $$

**Dimensions:** $\mathbf{X}_{4 \times 2} \cdot \mathbf{w}_{2 \times 1} = \hat{\mathbf{y}}_{4 \times 1}$

**Key Takeaways:**
- Each **row** of $\mathbf{X}$ = one ride
- First column = all **1s** (for bias)
- One matrix op → **4 predictions**

---

## Case 3: Multiple Features

Now add a second input: **wait time in minutes** (charged at \$0.50/min)

$$ \text{Fare} = 2.50 + 1.30 \times \text{km} + 0.50 \times \text{min} $$

This is now a **plane**:

$$ y = w_0 + w_1 x_1 + w_2 x_2 $$

### One Ride, Two Features

```python
# One ride: 10 km, 15 min wait
x = np.array([1, 10, 15])  # [bias, km, minutes]

# Weights: [flag_drop, $/km, $/min]
w = np.array([2.5, 1.3, 0.5])

# Compute fare
fare = x @ w

print(f"Ride: {x[1]} km, {x[2]} min")
print(f"Fare: ${fare:.2f}")
print(f"\nBreakdown:")
print(f"  Flag drop:     {w[0]} × {x[0]}  = ${w[0] * x[0]:.2f}")
print(f"  Per km:        {w[1]} × {x[1]} = ${w[1] * x[1]:.2f}")
print(f"  Wait time:     {w[2]} × {x[2]} = ${w[2] * x[2]:.2f}")
print(f"  Total:                       ${fare:.2f}")
```

**Output:**
```
Ride: 10 km, 15 min
Fare: $23.00

Breakdown:
  Flag drop:     2.5 × 1  =  $2.50
  Per km:        1.3 × 10 = $13.00
  Wait time:     0.5 × 15 =  $7.50
  Total:                    $23.00
```

### Batch of 4 Rides, 2 Features

| Ride | km  | min |
|------|-----|-----|
| 1    | 10  | 15  |
| 2    | 4   | 0   |
| 3    | 25  | 5   |
| 4    | 20  | 10  |

```python
# Ride data
kilometers = np.array([10, 4, 25, 20])
minutes = np.array([15, 0, 5, 10])

# Build X: [1, km, min] per row
X = np.column_stack([np.ones(4), kilometers, minutes])

# Weights
w = np.array([2.5, 1.3, 0.5])

# Compute all fares
fares = X @ w

print("Matrix X (each row = one ride):")
print("  [bias, km, min]")
print(X)
print("\nWeight vector w:")
print("  [flag, $/km, $/min]")
print(w)
print("\nComputed fares:")
for i in range(len(fares)):
    print(f"  Ride {i+1} ({X[i,1]:.0f} km, {X[i,2]:.0f} min): ${fares[i]:.2f}")
print(f"\nTotal revenue: ${fares.sum():.2f}")
```

**Output:**
```
Matrix X (each row = one ride):
  [bias, km, min]
[[ 1. 10. 15.]
 [ 1.  4.  0.]
 [ 1. 25.  5.]
 [ 1. 20. 10.]]

Weight vector w:
  [flag, $/km, $/min]
[2.5 1.3 0.5]

Computed fares:
  Ride 1 (10 km, 15 min): $23.00
  Ride 2 (4 km, 0 min): $7.70
  Ride 3 (25 km, 5 min): $37.50
  Ride 4 (20 km, 10 min): $33.50

Total revenue: $101.70
```

### Batch Matrix Form

$$ \hat{\mathbf{y}} = \mathbf{X} \mathbf{w} = \begin{bmatrix}
1 & 10 & 15 \\
1 & 4 & 0 \\
1 & 25 & 5 \\
1 & 20 & 10
\end{bmatrix}
\begin{bmatrix}
2.5 \\
1.3 \\
0.5
\end{bmatrix}
=
\begin{bmatrix}
23.0 \\
7.7 \\
37.5 \\
33.5
\end{bmatrix} $$

**Dimensions:** $\mathbf{X}_{4 \times 3} \cdot \mathbf{w}_{3 \times 1} = \hat{\mathbf{y}}_{4 \times 1}$

#### General Form ($m$ samples, $n$ features)

$$ \hat{\mathbf{y}} = \mathbf{X} \mathbf{w} = \begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1n} \\
1 & x_{21} & x_{22} & \cdots & x_{2n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{m1} & x_{m2} & \cdots & x_{mn}
\end{bmatrix}
\begin{bmatrix}
w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n
\end{bmatrix} $$

**Dimensions:** $\mathbf{X}_{m \times (n+1)} \cdot \mathbf{w}_{(n+1) \times 1} = \hat{\mathbf{y}}_{m \times 1}$

---

## Connection to Machine Learning

### Linear Regression

This is **exactly** how linear regression works with $n$ features:

$$ \hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_n x_n $$

In matrix form:

$$ \hat{\mathbf{y}} = \mathbf{X} \mathbf{w} $$

Where:
- $\mathbf{X}$: feature matrix $(m \times (n+1))$
  - $m$ = number of samples
  - $n$ = number of features
  - **First column = ones** (for $w_0$ bias)
- $\mathbf{w}$: weight vector $((n+1) \times 1)$
- $\hat{\mathbf{y}}$: predictions $(m \times 1)$

### Example with scikit-learn

```python
from sklearn.linear_model import LinearRegression

# Training data (no bias column — sklearn adds it)
X_train = np.array([[10, 15],
                    [4, 0],
                    [25, 5],
                    [20, 10]])
y_train = np.array([23.0, 7.7, 37.5, 33.5])

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

print("Parameters learned by sklearn:")
print(f"  Bias (w₀): {model.intercept_:.2f}")
print(f"  Weights (w₁, w₂): {model.coef_}")

print("\nCompare with our original weights:")
print(f"  w₀ (flag drop): {w[0]}")
print(f"  w₁ ($/km):      {w[1]}")
print(f"  w₂ ($/min):     {w[2]}")

# Predict new ride: 30 km, 8 min
new_ride = np.array([[30, 8]])
pred_sklearn = model.predict(new_ride)[0]
pred_manual = 2.5 + 1.3*30 + 0.5*8

print(f"\nNew ride (30 km, 8 min):")
print(f"  sklearn prediction: ${pred_sklearn:.2f}")
print(f"  Manual calc:        ${pred_manual:.2f}")
```

**Output:**
```
Parameters learned by sklearn:
  Bias (w₀): 2.50
  Weights (w₁, w₂): [1.3 0.5]

Compare with our original weights:
  w₀ (flag drop): 2.5
  w₁ ($/km):      1.3
  w₂ ($/min):     0.5

New ride (30 km, 8 min):
  sklearn prediction: $45.50
  Manual calc:        $45.50
```

---

## Why Batch Processing Matters

Batch operations are **core** to ML performance.

### 1. **Computational Efficiency**

Processing 1 million rides in a loop vs. one matrix op:

```python
import time

# Generate 1M random rides
n_rides = 1_000_000
X_large = np.column_stack([
    np.ones(n_rides),
    np.random.uniform(1, 50, n_rides),  # km
    np.random.uniform(0, 30, n_rides)   # min
])

# Method 1: Loop (one by one)
start = time.time()
fares_loop = []
for i in range(n_rides):
    fare = X_large[i] @ w
    fares_loop.append(fare)
time_loop = time.time() - start

# Method 2: Vectorized (batch)
start = time.time()
fares_vec = X_large @ w
time_vec = time.time() - start

print(f"Processing {n_rides:,} rides:")
print(f"  Loop (one-by-one): {time_loop*1000:.2f} ms")
print(f"  Vectorized (batch): {time_vec*1000:.2f} ms")
print(f"  Speedup: {time_loop/time_vec:.1f}x faster")
```

**Output (example):**
```
Processing 1,000,000 rides:
  Loop (one-by-one): 6400.21 ms
  Vectorized (batch): 12.87 ms
  Speedup: 497.3x faster
```

### 2. **Hardware Optimization**

GPUs are built for **massive matrix ops**. That’s why PyTorch, TensorFlow, and JAX **always** use batches.

### 3. **Training Stability**

In neural nets, gradients over a batch are **more stable** than per-sample updates.

Typical batch sizes: **32, 64, 128, 256** — optimized for GPU memory and math.

---

We’ve walked through why that **"1"** exists, how **matrix form** enables batching, and why **vectorized ops** are non-negotiable in ML.

Hope this clears up the mystery — and gives you a solid mental model for **bias, weights, and batching** in linear models and beyond.
```