<a href="https://colab.research.google.com/github/financieras/math_for_ai/blob/main/the_mystery_of_the_1_in_machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Mystery of the "1" in Machine Learning

One of the most common questions for people learning Machine Learning is: **where does that "1" come from when we multiply by the bias?**

Let's solve this mystery with an intuitive real-world example: calculating taxi fares.

---

## Practical Example: Taxi Pricing

A taxi charges according to this formula:

$$\text{Fare} = 2.5 + 1.3 \times \text{km}$$

This is simply a straight line:
$$y = b + m x$$
Or in Machine Learning notation:

$$y = w_0 + w_1 x$$

Where:
- **$2.50** is the **base fare** (fixed cost, independent of distance). In the equation, this is the intercept or bias term.
- **$1.30/km** is the price per kilometer traveled. In the equation, this is the slope or coefficient of the independent variable $x$.

### The Key Question

How do we express this in matrix form to compute **multiple fares at once**?

The problem is that the base fare ($2.50) **doesn't depend on any variable**—it's a constant term. But in matrix multiplication, we need to multiply each weight by something.

**Solution:** We multiply the constant term by 1.

$$\text{Fare} = 2.5 \times \boxed{1} + 1.3 \times \text{km}$$

This "1" is what allows us to include the intercept term in matrix notation.

---

## Case 1: Single Ride

Let's start by calculating the fare for a 10 km ride using matrix notation.

In [7]:
import numpy as np

# Single ride: [1, km]
# The 1 is for multiplying by the base fare
x = np.array([1, 10])  # [bias, kilometers]

# Weights: [base_fare, price_per_km]
w = np.array([2.5, 1.3])

# Calculate the fare
fare = x @ w  # Equivalent to: x.dot(w)

print(f"Ride: {x[1]} km")
print(f"Fare: ${fare:.2f}")
print(f"\nBreakdown:")
print(f"  Base fare: {w[0]} × {x[0]}  =  ${w[0] * x[0]:.2f}")
print(f"  Distance:  {w[1]} × {x[1]} = ${w[1] * x[1]:.2f}")
print(f"  Total:                ${fare:.2f}")

Ride: 10 km
Fare: $15.50

Breakdown:
  Base fare: 2.5 × 1  =  $2.50
  Distance:  1.3 × 10 = $13.00
  Total:                $15.50


### Matrix Notation

For a single observation:

$$\hat{y} = \mathbf{x}^T \mathbf{w} = \begin{bmatrix} 1 & 10 \end{bmatrix} \begin{bmatrix} 2.5 \\ 1.3 \end{bmatrix} = 15.5$$

**Dimensions:** $\mathbf{x}^T_{1 \times 2} \cdot \mathbf{w}_{2 \times 1} = \hat{y}$ (scalar)

### General Form
For $n$ features and one observation:

$$\hat{y} = \mathbf{x}^T \mathbf{w} = \begin{bmatrix} 1 & x_1 & x_2 & \cdots & x_n \end{bmatrix} \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}$$

**Dimensions:** $\mathbf{x}^T_{1 \times (n+1)} \cdot \mathbf{w}_{(n+1) \times 1} = \hat{y}$ (scalar)

---

## Case 2: Multiple Rides (Batch Processing)

In Machine Learning, **we never process data one sample at a time**. We always work with **batches**.

Suppose we have 4 different rides:

| Ride | Kilometers |
|------|------------|
| 1    | 10         |
| 2    | 4          |
| 3    | 25         |
| 4    | 20         |

We want to calculate all 4 fares **simultaneously**.

In [2]:
# Kilometers for 4 rides
kilometers = np.array([10, 4, 25, 20])

# Build matrix X: each ROW is one ride
# Add a column of ones for the base fare
X = np.column_stack([np.ones(4), kilometers])

# Weights
w = np.array([2.5, 1.3])

# Calculate all 4 fares at once
fares = X @ w

print("Matrix X (each row is one ride):")
print(X)
print("\nWeight vector w:")
print(w)
print("\nFare for each ride:")
for i, fare in enumerate(fares, 1):
    print(f"  Ride {i} ({X[i-1, 1]:.0f} km): ${fare:.2f}")
print(f"\nTotal revenue:     ${np.sum(fares):.2f}")

Matrix X (each row is one ride):
[[ 1. 10.]
 [ 1.  4.]
 [ 1. 25.]
 [ 1. 20.]]

Weight vector w:
[2.5 1.3]

Fare for each ride:
  Ride 1 (10 km): $15.50
  Ride 2 (4 km): $7.70
  Ride 3 (25 km): $35.00
  Ride 4 (20 km): $28.50

Total revenue:     $86.70


### Batch Matrix Notation

$$\hat{\mathbf{y}} = \mathbf{X} \mathbf{w} = \begin{bmatrix}
1 & 10 \\
1 & 4 \\
1 & 25 \\
1 & 20
\end{bmatrix}
\begin{bmatrix}
2.5 \\
1.3
\end{bmatrix}
=
\begin{bmatrix}
15.5 \\
7.7 \\
35.0 \\
28.5
\end{bmatrix}$$

**Dimensions:** $\mathbf{X}_{4 \times 2} \cdot \mathbf{w}_{2 \times 1} = \hat{\mathbf{y}}_{4 \times 1}$

**Key takeaways:**
- Each **row** of $\mathbf{X}$ = one ride
- First column = all **1s** (for bias)
- One matrix operation → **4 predictions**

---

## Case 3: Multiple Features

Now let's add a second feature: **wait time in minutes** (charged at $0.50/min)

$$\text{Fare} = 2.5 + 1.3 \times \text{km} + 0.5 \times \text{min}$$

This is now a **plane** (a 2D linear model):

$$y = w_0 + w_1 x_1 + w_2 x_2$$

### Single Ride, Two Features

In [3]:
# One ride: 10 km, 15 min wait
x = np.array([1, 10, 15])  # [bias, km, min]

# Weights: [base_fare, $/km, $/min]
w = np.array([2.5, 1.3, 0.5])

# Calculate fare
fare = x @ w

print(f"Ride: {x[1]:.0f} km, {x[2]:.0f} min")
print(f"Fare: ${fare:.2f}")
print(f"\nBreakdown:")
print(f"  Base fare:  {w[0]} × {x[0]}  = ${w[0] * x[0]:.2f}")
print(f"  Distance:   {w[1]} × {x[1]} = ${w[1] * x[1]:.2f}")
print(f"  Wait time:  {w[2]} × {x[2]} = ${w[2] * x[2]:.2f}")
print(f"  Total:                      ${fare:.2f}")

Ride: 10 km, 15 min
Fare: $23.00

Breakdown:
  Base fare:  2.5 × 1  = $2.50
  Distance:   1.3 × 10 = $13.00
  Wait time:  0.5 × 15 = $7.50
  Total:                      $23.00


### Matrix Notation for One Sample

$$\hat{y} = \mathbf{x}^T \mathbf{w} = \begin{bmatrix} 1 & 10 & 15 \end{bmatrix} \begin{bmatrix} 2.5 \\ 1.3 \\ 0.5 \end{bmatrix} = 23.0$$

**Dimensions:** $\mathbf{x}^T_{1 \times 3} \cdot \mathbf{w}_{3 \times 1} = \hat{y}$ (scalar)

### Batch: Multiple Rides, Multiple Features

In [4]:
# 4 rides with [km, min] data
kilometers = np.array([10, 4, 25, 20])
minutes = np.array([15, 0, 5, 10])

# Build X matrix
X = np.column_stack([np.ones(4), kilometers, minutes])

# Weights
w = np.array([2.5, 1.3, 0.5])

# Calculate all fares
fares = X @ w

print("Matrix X (each row is one ride):")
print("  [bias, km, min]")
print(X)
print("\nWeight vector w:")
print("  [base, $/km, $/min]")
print(w)
print("\nComputed fares:")
for i in range(len(fares)):
    print(f"  Ride {i+1} ({X[i,1]:.0f} km, {X[i,2]:.0f} min): ${fares[i]:.2f}")
print(f"\nTotal revenue: ${fares.sum():.2f}")

Matrix X (each row is one ride):
  [bias, km, min]
[[ 1. 10. 15.]
 [ 1.  4.  0.]
 [ 1. 25.  5.]
 [ 1. 20. 10.]]

Weight vector w:
  [base, $/km, $/min]
[2.5 1.3 0.5]

Computed fares:
  Ride 1 (10 km, 15 min): $23.00
  Ride 2 (4 km, 0 min): $7.70
  Ride 3 (25 km, 5 min): $37.50
  Ride 4 (20 km, 10 min): $33.50

Total revenue: $101.70


### Batch Matrix Form

$$\hat{\mathbf{y}} = \mathbf{X} \mathbf{w} = \begin{bmatrix}
1 & 10 & 15 \\
1 & 4 & 0 \\
1 & 25 & 5 \\
1 & 20 & 10
\end{bmatrix}
\begin{bmatrix}
2.5 \\
1.3 \\
0.5
\end{bmatrix}
=
\begin{bmatrix}
23.0 \\
7.7 \\
37.5 \\
33.5
\end{bmatrix}$$

**Dimensions:** $\mathbf{X}_{4 \times 3} \cdot \mathbf{w}_{3 \times 1} = \hat{\mathbf{y}}_{4 \times 1}$

#### General Form ($m$ samples, $n$ features)

$$\hat{\mathbf{y}} = \mathbf{X} \mathbf{w} = \begin{bmatrix}
1 & x_{11} & x_{12} & \cdots & x_{1n} \\
1 & x_{21} & x_{22} & \cdots & x_{2n} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{m1} & x_{m2} & \cdots & x_{mn}
\end{bmatrix}
\begin{bmatrix}
w_0 \\ w_1 \\ w_2 \\ \vdots \\ w_n
\end{bmatrix}$$

**Dimensions:** $\mathbf{X}_{m \times (n+1)} \cdot \mathbf{w}_{(n+1) \times 1} = \hat{\mathbf{y}}_{m \times 1}$

---

## Connection to Machine Learning

### Linear Regression

This is exactly how linear regression works with $n$ features (represented as a hyperplane):

$$\hat{y} = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_n x_n$$

In matrix form:

$$\hat{\mathbf{y}} = \mathbf{X} \mathbf{w}$$

Where:
- $\mathbf{X}$: feature matrix $(m \times (n+1))$
  - $m$ = number of samples
  - $n$ = number of features
  - The **first column** contains ones (for bias $w_0$)
- $\mathbf{w}$: weight vector $((n+1) \times 1)$
- $\hat{\mathbf{y}}$: prediction vector $(m \times 1)$

### Example with scikit-learn

In [5]:
from sklearn.linear_model import LinearRegression

# Training data (no bias column needed—sklearn handles it automatically)
X_train = np.array([[10, 15],
                    [4, 0],
                    [25, 5],
                    [20, 10]])
y_train = np.array([23.0, 7.7, 37.5, 33.5])

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

print("Parameters learned by sklearn:")
print(f"  Bias (w₀): {model.intercept_:.12f}")
print(f"  Coefficients (w₁, w₂): {model.coef_}")

print("\nComparing with our original weights:")
print(f"  w₀ (base fare): {w[0]}")
print(f"  w₁ ($/km):      {w[1]}")
print(f"  w₂ ($/min):     {w[2]}")

# Predict new ride: 30 km, 8 minutes
new_ride = np.array([[30, 8]])
pred_sklearn = model.predict(new_ride)[0]
pred_manual = 2.5 + 1.3*30 + 0.5*8

print(f"\nNew ride (30 km, 8 min):")
print(f"  sklearn prediction: ${pred_sklearn:.12f}")
print(f"  Manual calculation: ${pred_manual:.2f}")

Parameters learned by sklearn:
  Bias (w₀): 2.500000000000
  Coefficients (w₁, w₂): [1.3 0.5]

Comparing with our original weights:
  w₀ (base fare): 2.5
  w₁ ($/km):      1.3
  w₂ ($/min):     0.5

New ride (30 km, 8 min):
  sklearn prediction: $45.500000000000
  Manual calculation: $45.50


---

## Why Batch Processing Matters

Batch processing (multiple samples at once) is **fundamental** in Machine Learning:

### 1. **Computational Efficiency**
Processing 100 samples simultaneously is **orders of magnitude faster** than processing them one by one.

In [6]:
import time

# Generate 1 million random rides
n_rides = 1_000_000
X_large = np.column_stack([
    np.ones(n_rides),
    np.random.uniform(1, 50, n_rides),  # km
    np.random.uniform(0, 30, n_rides)   # minutes
])

# Method 1: Loop (one by one)
start = time.time()
fares_loop = []
for i in range(n_rides):
    fare = X_large[i] @ w
    fares_loop.append(fare)
time_loop = time.time() - start

# Method 2: Vectorized (all at once)
start = time.time()
fares_vectorized = X_large @ w
time_vectorized = time.time() - start

print(f"Processing {n_rides:,} rides:")
print(f"  Loop method (one by one): {time_loop*1000:.2f} ms")
print(f"  Vectorized method (batch): {time_vectorized*1000:.2f} ms")
print(f"  Speedup: {time_loop/time_vectorized:.1f}x faster")

Processing 1,000,000 rides:
  Loop method (one by one): 4648.28 ms
  Vectorized method (batch): 8.94 ms
  Speedup: 520.1x faster


### 2. **Hardware Optimization**

GPUs are optimized for large matrix operations. That's why deep learning frameworks (PyTorch, TensorFlow) always work with batches.

### 3. **Training Stability**

In neural networks, computing gradients over a batch is more stable than computing them on a single sample.

Typical batch sizes for training are: 32, 64, 128, 256, ... optimized for efficient matrix operations on GPUs.

---

## Wrapping Up

In this article, we've explored why working with matrices and batches of data is so powerful in machine learning. I hope this has clarified the origin of that mysterious **1** that appears in matrix formulations and the importance of matrix multiplication in ML.

**Key insights:**
- That **"1"** exists to give the bias term something to multiply by in matrix notation
- **Matrix form** enables efficient batch processing
- **Vectorized operations** are essential for performance in modern ML

Understanding this foundation will serve you well as you dive deeper into linear models, neural networks, and beyond.