Gradient descent is an iterative optimization algorithm that moves parameters in the direction of the negative gradient of a loss function to find (typically local) minima, and in ML it is used to learn model weights that minimize a chosen cost. Batch, stochastic, and mini-batch gradient descent differ mainly in how much data they use per parameter update, which affects speed, noise, and memory usage.

### Core idea of gradient descent

- Gradient descent updates parameters $\theta$ by $\theta \leftarrow \theta - \eta \nabla_\theta J(\theta)$, where $\eta$ is the learning rate and $J$ is the cost function.
- Following the negative gradient corresponds to moving in the direction of steepest decrease of the cost, so repeated updates aim to reach a minimum of $J$.

### Batch (vanilla) gradient descent

- Batch gradient descent computes the gradient using the entire training dataset before performing a single update, and one full pass over the data is an epoch.
- This yields a stable, low-variance gradient estimate but can be slow and memory‑intensive for large datasets because each update waits for all examples to be processed.

### Stochastic gradient descent (SGD)

- SGD updates parameters after each individual training example, computing the gradient using only that sample.
- This makes updates very frequent and often faster per update than batch descent, but the gradient estimate is noisy, causing the loss to fluctuate instead of decreasing smoothly and sometimes helping escape shallow local minima.

### Mini-batch gradient descent

- Mini-batch gradient descent splits the training data into small batches (e.g., 32, 64, 128 examples) and updates parameters for each batch.
- It balances efficiency and noise: gradients are less noisy than pure SGD and updates are more frequent and memory‑efficient than full‑batch, which is why this is the default choice in most modern deep learning frameworks.

### Practical implications and usage

- Mini-batch methods exploit hardware like GPUs efficiently through vectorized operations over batches, significantly speeding up training.
- In practice, gradient descent is often combined with enhancements such as momentum, learning-rate schedules, and adaptive methods (Adam, RMSProp) to improve convergence behavior and robustness.