Gradient descent and stochastic (and mini-batch) gradient descent are all ways to move parameters downhill on a loss surface, but they differ in how much data they use per update, how accurate each update is, and how fast they run.

### Core difference: how much data each step uses

- **Gradient Descent (GD, “batch GD”)**  
  - Each update uses the **entire dataset** to compute the gradient.  
  - The direction is the *true* steepest descent direction for that loss and current parameters.  
  - Updates are accurate but expensive when the dataset is huge.

- **Stochastic Gradient Descent (SGD, batch size = 1)**  
  - Each update uses **just one training example** at a time.  
  - The gradient is a very noisy approximation of the true gradient.  
  - Each update is cheap, and you can do many updates per pass over the data.

- **Mini‑batch Gradient Descent (batch size between 1 and full)**  
  - Each update uses a **small subset** of the data, e.g. 32 examples.  
  - Gradient is less noisy than pure SGD, much cheaper than full GD.  
  - This is the most common choice in modern large‑scale training.

You can think of **batch size** as a knob:

- Batch size = dataset size → classic gradient descent.  
- Batch size = 1 → stochastic gradient descent.  
- Batch size in between → mini‑batch gradient descent.

### Quality vs speed: the trade‑off

The video emphasizes that batch size controls a trade‑off:

- **Batch size = full dataset**  
  - Gradient quality: best (exact).  
  - Runtime per update: worst (must scan all points).  
  - Convergence path: smoothest, “true” steepest path.

- **Batch size = 1 (pure SGD)**  
  - Gradient quality: worst (maximally noisy).  
  - Runtime per update: best (one sample).  
  - Convergence path: extremely jagged; steps jump around.

- **Intermediate batch size (e.g., 32)**  
  - Gradient quality: decent; noise is reduced by averaging over several samples.  
  - Runtime per update: manageable, especially on GPUs/CPUs that like vectorization.  
  - Convergence path: noisy but not chaotic; a good practical balance.

**32** is a commonly used batch size in practice, for reasons like hardware, statistical properties, etc.

### Intuition: “seeing the whole board” vs “seeing a few squares”

- **Batch gradient descent** is like playing chess while seeing the **entire board**.  
  - Every move is carefully chosen using the full situation.  
  - But each move is slow and computationally expensive.

- **Stochastic gradient descent** is like trying to win chess while seeing only **a few squares** at a time.  
  - Every move is based on incomplete and noisy information.  
  - Individual moves might look suboptimal if you knew the full board.  
  - Yet, if you repeat this process many times and eventually see all parts of the board, you can still reach a strong position.

### Why SGD works (at a high level)

- When you loop over the dataset many times (multiple epochs), each data point is used repeatedly in different steps.  
- On **average**, these many noisy updates behave similarly to what you would get from using the full gradient every time.  
- For convex losses like mean squared error, repeating this process can bring you **very close** to the same minimum as full gradient descent.

So: each single step is “blind” and noisy, but across many passes, the noise partially cancels and the direction trends toward the true downhill direction.

### Visual intuition from 2D loss surfaces

**Gradient Descent (full batch):**

- Starting at the origin, it follows a **smooth, curving path** directly downhill toward the minimum.  
- Every step is aligned with the true steepest descent direction because it uses all data.  
- The path looks like a clean trajectory through a bowl.

**Stochastic / small-batch Gradient Descent:**

- Same surface, but each step uses only a subset of data, so each gradient estimate is a **crude approximation**.  
- The path:  
  - First jump might not align with the true steepest direction, but still reduces loss.  
  - Subsequent jumps wander around in a **somewhat random zig‑zag**.  
  - Over time, the path winds toward the region of the minimum, but with lots of small detours.

For **convex** loss functions (like mean squared error), both methods eventually end up **very near** the true minimum, although SGD does so in a noisy fashion.

### Practical usage in libraries

- In many classic scikit‑learn models, optimization is done using **full‑dataset gradient-based methods** (or equivalent exact solvers).  
  - Dataset sizes in those examples are small enough that this is practical.  
  - Convergence is smooth and robust, and you don’t have to worry about batch sizes.

- In **large‑scale environments** (deep learning, huge datasets):  
  - It is far more common to use **mini‑batch gradient descent / SGD**, because:  
    - Computing full gradients is too slow and memory hungry.  
    - Mini‑batches allow fast, frequent updates.  
    - GPUs and accelerators are well-suited to mini‑batch operations.

So, in real-world machine learning (especially deep learning), “gradient descent” almost always means **mini‑batch SGD variants**, not pure full‑batch GD.

### Practical takeaway for you

- **Conceptual understanding:**  
  - Full GD: accurate but expensive; smooth path; uses all data each step.  
  - SGD / mini‑batch: noisy but cheap; jagged path; uses subsets; scales to huge data.

- **Batch size as a knob:**  
  - Smaller batch → more noise, faster each step.  
  - Larger batch → less noise, slower each step.

- **In practice:**  
  - Rarely these algorithms are implemented by hand outside of learning exercises.  
  - Libraries (scikit‑learn, PyTorch, TensorFlow, etc.) implement them under the hood.  
  - Still, understanding these ideas helps interpret model behavior, choose hyperparameters, and debug training issues as a serious practitioner.