# Optimizer #2: Stochastic Gradient Descent (SGD)

## Intuition

A few problems arise from vanilla gradient descent and its applications:
1. **Slow with large datasets**: For each single training step, gradient descent uses **ALL** of the training data
    - For example, if we had a more complicated model such as a logistic regression model that used 10,000 parameters to predict something, and we had data from 1,000,000 samples, we would have to calculate 10,000,000,000 terms for each step. Doing this over, say 1,000 steps, would result in the need to calculate at least 1,000,000,000,000 terms!
2. **Gets stuck in flat regions or local minima**: Because it always moves in the direction of the exact full gradient, it may settle into shallow minima, saddle points, or flat plateaus.
3. **Memory inefficiency**: It must hold the full dataset in memory (or at least compute the full batch gradient), which isn't scalable.

Unlike standard gradient descent — which computes the gradient using the entire dataset — **Stochastic Gradient Descent** updates the model using only a single data point (or a small batch) at a time. This fixes the problems that arise with regular gradient descent, because it provides:
- **Faster updates**: It updates the model more frequently as it uses single points or small batches to compute the gradient, which leads to much faster iterations, especially on larger datasets.
- **Better exploration**: The inherent noise in mini batch iterations gradients adds randomness, which helps to escape local minima and saddle points more efficiently.
- **Memory efficiency**: Only a small batch is needed at a time, which lowers the overall memory footprint.

Because of this key change, the update rule becomes:

## Update Rule

The core idea of SGD is to update your model’s parameters $\theta$ using the gradient of the loss function computed on just one sample (or a small batch) at a time.

General form:
$$
\theta := \theta - \eta \cdot \nabla_\theta \mathcal{L}(\theta; x_i, y_i)
$$
