# Mini-Batch Gradient Descent — Detailed Theory, Intuition, and Implementation

---

## 1. Introduction
Till now, we have covered:
1. **Batch Gradient Descent** – the most basic version
2. **Stochastic Gradient Descent (SGD)**
$$
\textbf{Mini-Batch Gradient Descent}
$$

This method is extremely important because **it is the most commonly used gradient descent variant in practice**.

---

## 2. Core Idea: What Actually Changes?

All gradient descent variants differ in **only one thing**:

> **How frequently parameters are updated**

Mathematics remains the same.

Only **how much data is used per update** changes.

---

## 3. Recap: Batch vs Stochastic

### Batch Gradient Descent

- Uses **entire dataset**
- Updates parameters **once per iteration**

If dataset has $n$ rows:

$$
\text{Updates per epoch} = 1
$$

---

### Stochastic Gradient Descent (SGD)

- Uses **one data point**
- Updates parameters after **every row**

If dataset has $n$ rows:

$$
\text{Updates per epoch} = n
$$

---

### Problems So Far

- **Batch GD** → slow, memory-heavy, stable
- **SGD** → fast, noisy, unstable near minimum

So we need something **in between**.

---

## 4. What is Mini-Batch Gradient Descent?

Mini-Batch Gradient Descent is a **hybrid approach**.

Instead of:
- Using all rows (Batch)
- Using one row (SGD)

We use a **small group of rows**, called a **mini-batch**.

---

## 5. Mini-Batch Concept

Let:
- Total rows = $n$
- Batch size = $b$

Then number of batches:

$$
\text{Number of batches} = \frac{n}{b}
$$

Each batch produces **one parameter update**.

---

## 6. Example to Understand Updates

Assume:
- $n = 100$ rows

### Case 1: Batch GD
$
b = 100
$

$
\text{Updates per epoch} = 1
$

---

### Case 2: SGD
$
b = 1
$

$$
\text{Updates per epoch} = 100
$$

---

### Case 3: Mini-Batch GD
$
b = 10$

$$
\text{Updates per epoch} = 10
$$

This gives:
- More updates than Batch
- Less noise than SGD

---

## 7. Why Mini-Batch is a Good Trade-off

Mini-Batch Gradient Descent:
- Reduces randomness of SGD
- Reduces computation cost of Batch GD
- Works well with vectorization
- Fits in memory
- Converges faster than Batch GD

That’s why **almost all deep learning frameworks use mini-batch GD**.

---

## 8. Mathematical Formulation

Linear Regression model:

$$
\hat{y} = X\beta + b
$$

Loss function:

$$
L = \sum_{i \in \text{batch}} (y_i - \hat{y}_i)^2
$$

---

### Gradient for Intercept

$$
\frac{\partial L}{\partial b}
= -2 \sum_{i \in \text{batch}} (y_i - \hat{y}_i)
$$

---

### Gradient for Weights

$$
\frac{\partial L}{\partial \beta}
= -2 \sum_{i \in \text{batch}} (y_i - \hat{y}_i) x_i
$$

---

### Update Rules

$$
\beta := \beta - \alpha \frac{\partial L}{\partial \beta}
$$

$$
b := b - \alpha \frac{\partial L}{\partial b}
$$

Only difference from Batch GD:
- Summation is over **mini-batch**, not full dataset

---

## 9. Why Mini-Batch Uses Random Sampling

Each mini-batch is selected **randomly**.

Randomness ensures:
- Better generalization
- Ability to escape shallow local minima
- Reduced bias in updates

---

## 10. Implementation Strategy (High Level)

Algorithm steps:

1. Initialize parameters
2. For each epoch:
   - Divide dataset into batches
   - For each batch:
     - Compute prediction
     - Compute gradients
     - Update parameters
3. Repeat until convergence

---

## 11. Coding Mini-Batch Gradient Descent (From Scratch)

We add **one new hyperparameter**:

$$
\text{batch\_size}
$$

This decides how many rows are used per update.

---

### Number of Batches

If:
- Dataset size = $n$
- Batch size = $b$

Then:

$$
\text{batches} = \left\lfloor \frac{n}{b} \right\rfloor
$$

---

## 12. Random Batch Selection

We randomly select indices:

$$
\text{indices} \sim \text{RandomSample}(0, n-1, b)
$$

This gives us:
- $b$ random rows
- Used for one update

---

## 13. Why Mini-Batch is Faster Than Batch GD

Batch GD:
$$
O(n \times p)
$$

Mini-Batch GD:
$$
O(b \times p)
$$

Since:
$$
b \ll n
$$

Total computation reduces drastically.

---

## 14. Comparison Summary

| Method | Batch Size | Updates / Epoch | Speed | Stability |
|------|-----------|-----------------|-------|-----------|
| Batch GD | $n$ | 1 | Slow | Very stable |
| SGD | 1 | $n$ | Very fast | Noisy |
| Mini-Batch | $b$ | $n/b$ | Fast | Balanced |

---

## 15. Visualization Intuition

- Batch GD → smooth straight path
- SGD → zig-zag random path
- Mini-Batch → controlled zig-zag

Mini-Batch behaves **between Batch and SGD**.

---

## 16. Problem Near Minimum

Near the optimal solution:
- Mini-Batch still oscillates
- Especially with constant learning rate

Solution:

$$
\textbf{Learning Rate Scheduling}
$$

---

## 17. Learning Rate Scheduling Idea

Instead of constant $\alpha$:

$$
\alpha_t = \frac{\alpha_0}{1 + kt}
$$

As iterations increase:
- Learning rate decreases
- Updates become smaller
- Convergence stabilizes

---

## 18. Practical Notes

- Mini-Batch GD is **default choice**
- Used in:
  - Neural Networks
  - Deep Learning
  - Large-scale ML
- Batch size commonly used:
  $16,\;32,\;64,\;128$

---

## 19. Relation to Libraries

Most ML libraries internally use:
$$
\text{Mini-Batch Gradient Descent}
$$

Even when called SGD, internally it often uses batches.

---

## 20. Final Takeaways

- Mini-Batch GD is a **middle ground**
- Faster than Batch GD
- More stable than SGD
- Memory efficient
- Most practical optimization method

---

## 21. Series Conclusion

We have now covered:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-Batch Gradient Descent

Understanding these three gives you a **strong foundation** for:
- Machine Learning
- Deep Learning
- Optimization algorithms

---


In [2]:
from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
X,y = load_diabetes(return_X_y=True)
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [3]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
reg = LinearRegression()
reg.fit(X_train,y_train)

In [4]:
print(reg.coef_)
print(reg.intercept_)

[  -9.15865318 -205.45432163  516.69374454  340.61999905 -895.5520019
  561.22067904  153.89310954  126.73139688  861.12700152   52.42112238]
151.88331005254167


In [5]:
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4399338661568968

In [6]:
import random

class MBGDRegressor:

    def __init__(self,batch_size,learning_rate=0.01,epochs=100):

        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size

    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])

        for i in range(self.epochs):

            for j in range(int(X_train.shape[0]/self.batch_size)):

                idx = random.sample(range(X_train.shape[0]),self.batch_size)

                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                #print("Shape of y_hat",y_hat.shape)
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)

        print(self.intercept_,self.coef_)

    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [7]:
mbr = MBGDRegressor(batch_size=int(X_train.shape[0]/50),learning_rate=0.01,epochs=100)
mbr.fit(X_train,y_train)

152.49829698031715 [  30.40397286 -138.37493728  445.19607064  314.09510392  -22.29094659
  -89.85454529 -194.13173872  116.2641905   406.70778904  117.85910415]


In [8]:
y_pred = mbr.predict(X_test)
r2_score(y_test,y_pred)

0.4520057232705571

In [9]:
from sklearn.linear_model import SGDRegressor
sgd = SGDRegressor(learning_rate='constant',eta0=0.1)
batch_size = 35

for i in range(100):

    idx = random.sample(range(X_train.shape[0]),batch_size)
    sgd.partial_fit(X_train[idx],y_train[idx])
sgd.coef_

array([  62.41952613,  -49.84742974,  348.09735528,  215.6045523 ,
         11.2679345 ,  -21.2673479 , -182.44096163,  140.91165651,
        298.73023806,  126.46553277])

In [10]:
sgd.intercept_

array([148.43893739])

In [11]:
y_pred = sgd.predict(X_test)
r2_score(y_test,y_pred)

0.4288847382265526