Stochastic gradient descent (SGD) is a way to make gradient descent practical and fast on large datasets by using only *parts* of the data to approximate the true gradient at each step, instead of the full dataset every time. 

### Why “plain” gradient descent is too slow

In ordinary (batch) gradient descent for linear regression:

- The loss (e.g., mean squared error) depends on **all** training examples.  
- The gradient at each step is computed by:
  - Making predictions for every row in the dataset.  
  - Comparing them with the true labels.  
  - Averaging the contribution of every point.

This is fine for 244 rows (tips dataset), but:

- Real problems (e.g., image classification) may have millions or billions of examples.  
- Computing the full gradient at every step becomes extremely expensive in time and memory.

So we want a cheaper way to get a *good enough* direction to move in, without scanning the whole dataset every time.

### Key idea: approximate the gradient using a subset of data

Instead of computing the gradient using **all** data points, SGD uses only a **subset**:

- Take a small set of indices (a *batch* or *mini‑batch*).  
- Pretend, for this step, that only those examples exist.  
- Compute the gradient of the loss using only those examples.  
- Use that approximate gradient to update θ.

Because you use fewer examples, each update is much cheaper, and you can update parameters more frequently.

Intuitively:

- Full gradient = “exact” slope, expensive.  
- Batch or mini-batch gradient = “noisy” slope, cheap.

That “noise” is actually a feature, not just a bug—it can help the algorithm escape shallow areas or local irregularities.

## From full-batch gradient to batch-only gradient

Recall the batch gradient for 2D linear regression (θ₀, θ₁):

- Model: $\hat{y}_i = \theta_0 x_{0,i} + \theta_1 x_{1,i}$.  
- Residual: $e_i = y_i - \hat{y}_i$.  
- Full-dataset gradient (mean over all i):

  - $\frac{\partial L}{\partial \theta_0} = -\frac{2}{n} \sum_i e_i x_{0,i}$  
  - $\frac{\partial L}{\partial \theta_1} = -\frac{2}{n} \sum_i e_i x_{1,i}$

In code for the full dataset, you conceptually do:

- `x0 = X[:, 0]`  
- `x1 = X[:, 1]`  
- `errors = y_obs - (theta0 * x0 + theta1 * x1)`  
- compute mean over **all** rows.

To get a **batch-only** gradient, you:

1. Accept a list/array of row indices, e.g. `[5]` or `[0,1,2,3]`.  
2. Use only those rows of X and y.  
3. Compute the same formula, but averaged over this subset only.

So instead of:

- `x0 = X[:, 0]` (all rows)

you do:

- `x0 = X[indices, 0]` (selected rows only)  
- `x1 = X[indices, 1]`  
- `y_batch = y_obs[indices]`  

Then compute the residuals and mean exactly as before, but just on that slice.

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sympy as sy
from mpl_toolkits import mplot3d
from sklearn.linear_model import LinearRegression
import seaborn as sns

# Loading tips dataset
tips = sns.load_dataset("tips")

# Add bias column
tips["bias"] = 1.0

# Feature matrix X and target y
X = tips[["bias", "total_bill"]]  
y = tips["tip"]            

In [5]:
def mse_gradient_batch_only(theta, batch_indices, X, y_obs):
    """Returns the gradient of the MSE on only the given data (via batch_indices) 
       for the given theta"""    
    x0 = X.iloc[batch_indices, 0]
    x1 = X.iloc[batch_indices, 1]
    dth0 = np.mean(-2 * (y_obs[batch_indices] - theta[0] * x0 - theta[1] * x1) * x0)
    dth1 = np.mean(-2 * (y_obs[batch_indices] - theta[0] * x0 - theta[1] * x1) * x1)
    return np.array([dth0, dth1])

In [7]:
# If `batch_indices` = all rows (e.g., `0` to `len(X)-1`), you recover the **full** gradient
mse_gradient_batch_only(np.array([0, 0]), np.arange(0, len(X)), X, y)

array([  -5.99655738, -135.22631803])

In [8]:
# If `batch_indices` = `[5]`, you get the gradient contributed by **only** row 5
mse_gradient_batch_only(np.array([0, 0]),[5], X, y)

array([  -9.42  , -238.2318])

In [9]:
# If `batch_indices` = `[0,1,2,3]`, you get the average gradient over just those 4 examples
mse_gradient_batch_only(np.array([0, 0]), np.arange(0, 4), X, y)

array([ -4.74   , -93.12005])

This is the building block for SGD: a gradient function that can work on arbitrary subsets of the data.

In [10]:
def mse_gradient_batch_only_two_arg(theta, batch_indices):
    """Returns the gradient of the MSE on only the given data (via batch_indices) 
       for the given theta"""    
    X = tips[["bias", "total_bill"]]
    y_obs = tips["tip"]
    return mse_gradient_batch_only(theta, batch_indices, X, y_obs)

In [11]:
mse_gradient_batch_only_two_arg(np.array([0, 0]), [5, 6, 7, 8, 15, 32])

array([  -6.23666667, -126.59116667])

In [12]:
np.split(np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]), 3)

[array([1, 2, 3]), array([4, 5, 6]), array([7, 8, 9])]

Randomly split indices into equal-sized batches using a “split” function:

1. Start with an array of indices (e.g., for 12 points: `[0,1,2,3,4,5,6,7,8,9,10,11]`).  
2. Randomly permute it to shuffle the order.  
3. Split this shuffled list into a fixed number of equal parts (batches).

Example with 12 data points and 3 batches:

- Original indices: `[0,1,2,3,4,5,6,7,8,9,10,11]`  
- After random permutation: `[11, 8, 10, 1, 6, 0, 5, 7, 9, 2, 4, 3]`  
- After splitting into 3 parts:  
  - Batch 1: `[11, 8, 10, 1]`  
  - Batch 2: `[6, 0, 5, 7]`  
  - Batch 3: `[9, 2, 4, 3]`

In [15]:
np.split(np.random.permutation(np.arange(12)), 4)

[array([2, 1, 5]), array([8, 0, 3]), array([10,  9,  6]), array([11,  7,  4])]

### Stochastic Gradient Descent algorithm (procedural view)

Now the video ties everything together into a stochastic gradient descent loop. The high-level idea:

1. Choose:
   - Initial parameters θ (e.g., `[0, 0]`).  
   - Learning rate α (e.g., 0.001).  
   - Number of overall steps or epochs.  
   - Number of batches per epoch (e.g., 4).

2. For each outer iteration:
   - Generate a **random permutation** of all data indices.  
   - Split that permutation into `num_batches` equal parts (mini-batches).

3. For each batch in this split:
   - Compute the gradient using only this batch (using `mse_gradient_batch_only`).  
   - Update θ using that batch gradient.  
   - Record the new θ and current loss (using, for example, the full-data loss to monitor progress).

Because every epoch re‑shuffles the order, each example ends up in different batches over time, and the individual gradient estimates vary.

In [16]:
def mse_loss(theta, X, y_obs):
    y_hat = theta[0] * X.iloc[:, 0] + theta[1] * X.iloc[:, 1]
    return np.mean((y_hat - y_obs) ** 2)    

def mse_loss_single_arg(theta):  
    X = tips[["bias", "total_bill"]]
    y_obs = tips["tip"]
    return mse_loss(theta, X, y_obs)

In [17]:
def stochastic_gradient_descent(df, initial_guess, alpha, n, num_dps, number_of_batches):
    guesses = [initial_guess]
    guess = initial_guess
    losses = [mse_loss_single_arg(guess)]
    while len(guesses) < n:
        dp_indices = np.random.permutation(np.arange(num_dps))
        for batch_indices in np.split(dp_indices, number_of_batches):            
            guess = guess - alpha * df(guess, batch_indices)
            guesses.append(guess)
            losses.append(mse_loss_single_arg(guess))
    return np.array(guesses), np.array(losses)

In [19]:
guesses, losses = stochastic_gradient_descent(mse_gradient_batch_only_two_arg, np.array([0, 0]), 0.001, 10000, len(tips), 4)

In [20]:
guesses[:-1]

array([[0.        , 0.        ],
       [0.00592197, 0.13048851],
       [0.00670676, 0.14123385],
       ...,
       [0.88962466, 0.10052997],
       [0.88978149, 0.10763332],
       [0.89001254, 0.10959086]])

In the tips dataset:

- There are 244 data points.  
- If we pick “4 batches”, each batch has 244 / 4 = 61 points.  
- So each pass (or “epoch”) consists of 4 gradient updates, one per batch of 61 examples.

### Behaviour of SGD in the tips example

When the instructor runs this on the tips dataset:

- Start: θ₀ = 0, θ₁ = 0, with a high loss.  
- As the algorithm runs:
  - θ₀ and θ₁ change with each batch.  
  - Loss generally decreases over time but with fluctuations.

The **patterns** they emphasize:

1. **Compared to full-batch gradient descent:**
   - Full-batch: one update per pass, smooth monotonic loss decrease (for convex problems and good α).  
   - SGD: many small updates per pass, loss curve is jagged/noisy because each step uses only a subset and may “miss” some difficult points (e.g., outliers).

2. Despite the noise:
   - Over many steps, θ₀ and θ₁ converge close to the “true” optimum found by scikit-learn or by batch gradient descent (e.g., ≈ 0.92 offset and ≈ 10–10.5% tip).  
   - The final values in the experiment are something like 0.89 (offset) and 0.10 (tip rate), very near the earlier results.

3. Each batch gradient is an **approximation** of the true gradient:
   - Some batches may not contain rare outliers.  
   - This can make certain steps look “too optimistic” or “too pessimistic” about the global loss.  
   - Over many batches and epochs, the estimates average out.

## Why SGD is useful (and tricky)

Key advantages (as hinted in the video):

- **Efficiency:**  
  - Each gradient computation uses only a small subset.  
  - On massive datasets (e.g., millions of images), this makes training feasible.  
- **Frequent updates:**  
  - Parameters are updated many times per pass through the data, which can speed up practical convergence.  
- **Escaping shallow regions:**  
  - The noise in gradient estimates can help avoid getting “stuck” in flattish or awkward regions.

But it is also:

- **Noisier:**  
  - Loss does not decrease smoothly; it bounces around.  
- **Sensitive to hyperparameters:**  
  - Learning rate α and batch size matter a lot.  
  - Poor choices can cause divergence, very slow progress, or oscillations.  
- **Harder to reason about:**  
  - The path of θ across the loss surface isn’t as nicely behaved as full-batch gradient descent.

## Big-picture intuition

Putting all the small details together:

- **Batch gradient descent**:  
  - Uses the exact gradient from all data.  
  - Fewer, more expensive, smoother updates.

- **Stochastic / mini-batch gradient descent** (as coded in the video):  
  - Uses approximate gradients from *random subsets*.  
  - More, cheaper updates; noisy but efficient.  
  - Over time, still steers θ towards the minimum of the loss.

Even though each individual step is based on an incomplete (and noisy) picture of the loss surface, the repeated use of many batches and repeated passes over randomly shuffled data makes the overall process converge in practice.