## Mini-batch gradient descent

Mini-batch gradient descent is a variation of the gradient descent optimization algorithm used in machine learning for training models. It combines the advantages of both batch gradient descent (BGD) and stochastic gradient descent (SGD). Here's a simple explanation:

1. **Batch Gradient Descent (BGD)**: In BGD, you compute the gradient of the cost function with respect to the parameters using the entire dataset. Then, you update the parameters once based on this average gradient.

2. **Stochastic Gradient Descent (SGD)**: In SGD, you compute the gradient of the cost function with respect to the parameters using only one data point (or a small subset, called a mini-batch) randomly chosen from the dataset. Then, you update the parameters based on this individual gradient.

3. **Mini-Batch Gradient Descent**: Mini-batch gradient descent strikes a balance between BGD and SGD. Instead of using the entire dataset (BGD) or just one data point (SGD), mini-batch gradient descent computes the gradient using a small random subset (mini-batch) of the dataset. Then, it updates the parameters based on this mini-batch gradient. This process is repeated for multiple mini-batches until convergence.

In summary, mini-batch gradient descent combines the efficiency of SGD, which updates parameters more frequently and requires less memory, with the stability of BGD, which provides a more accurate estimate of the gradient. This makes it a popular choice for training models, especially in scenarios where the dataset is large and computational resources are limited.

In [1]:
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [2]:
X,y = load_diabetes(return_X_y=True)

In [3]:
X.shape, y.shape

((442, 10), (442,))

In [4]:
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state =2)

In [5]:
reg = LinearRegression()

In [6]:
reg.fit(X_train, y_train)

In [7]:
y_pred = reg.predict(X_test)

In [8]:
r2_score(y_test, y_pred)

0.4399338661568968

In [9]:
reg.coef_

array([  -9.15865318, -205.45432163,  516.69374454,  340.61999905,
       -895.5520019 ,  561.22067904,  153.89310954,  126.73139688,
        861.12700152,   52.42112238])

In [10]:
reg.intercept_

151.88331005254167

In [11]:
X_train.shape[1]

10

### Building our own Mini Batch GD class

In [12]:
import random
import numpy as np  # Importing numpy library for numerical operations

class MBGDRegressor:
    
    def __init__(self, batch_size, learning_rate=0.01, epochs=100):
        """
        Constructor method to initialize the parameters of the model.
        
        Args:
        - batch_size: Size of the mini-batch for mini-batch gradient descent
        - learning_rate: Learning rate for the gradient descent updates
        - epochs: Number of epochs (iterations over the entire dataset) for training
        
        """
        # Initialize model parameters
        self.coef_ = None  # Coefficients of the linear regression model
        self.intercept_ = None  # Intercept term of the linear regression model
        self.lr = learning_rate  # Learning rate for gradient descent
        self.epochs = epochs  # Number of training epochs
        self.batch_size = batch_size  # Size of mini-batch
        
    def fit(self, X_train, y_train):
        """
        Method to train the linear regression model using mini-batch gradient descent.
        
        Args:
        - X_train: Training features (input data)
        - y_train: Training labels (output data)
        
        """
        # Initialize coefficients
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        # Iterate over epochs
        for i in range(self.epochs):
            
            # Iterate over mini-batches
            for j in range(int(X_train.shape[0] / self.batch_size)):
                
                # Randomly sample indices for mini-batch
                idx = random.sample(range(X_train.shape[0]), self.batch_size)
                
                # Compute predictions
                y_hat = np.dot(X_train[idx], self.coef_) + self.intercept_
                
                # Compute derivatives of cost function wrt intercept and coefficients
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                coef_der = -2 * np.dot((y_train[idx] - y_hat), X_train[idx])
                
                # Update intercept and coefficients using gradient descent
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        # Print the final values of intercept and coefficients after training
        print(self.intercept_, self.coef_)
    
    def predict(self, X_test):
        """
        Method to predict outputs for new input data.
        
        Args:
        - X_test: Test features (input data)
        
        Returns:
        - Predicted outputs
        
        """
        # Compute predictions using learned coefficients
        return np.dot(X_test, self.coef_) + self.intercept_

In [13]:
mbr = MBGDRegressor(batch_size=int(X_train.shape[0]/50),learning_rate=0.01,epochs=100)

In [14]:
mbr.fit(X_train,y_train)

150.37615116301237 [  24.70118157 -137.26292063  456.14461743  295.31504016  -25.442493
  -95.3148191  -193.36400333  118.95666971  410.82375837  118.26988043]


In [15]:
y_pred = mbr.predict(X_test)

In [16]:
r2_score(y_test,y_pred)

0.45355345420730653

![](https://github.com/campusx-official/100-days-of-machine-learning/blob/main/day52-types-of-gradient-descent/mini_batch_contour_plot.gif?raw=true)

### Sklearn

In [17]:
from sklearn.linear_model import SGDRegressor

In [18]:
sgd = SGDRegressor(learning_rate='constant',eta0=0.1)

In [19]:
#  There is no way to put batch size in the SGDRegressor, so we are using this to apply batch size to the mini batch in the SGDRegressor class.
batch_size = 35

for i in range(100):
    
    idx = random.sample(range(X_train.shape[0]),batch_size)
    sgd.partial_fit(X_train[idx],y_train[idx])

Let's break down what each part does:

1. `batch_size = 35`: This line sets the batch size to 35. In SGD, instead of using the entire dataset at once (as in batch gradient descent), we use only a subset of the dataset for each update step. This subset is called a mini-batch, and its size is specified by the `batch_size`.

2. `for i in range(100):`: This loop runs for 100 iterations, or epochs. During each epoch, the model is updated multiple times using mini-batches of data.

3. `idx = random.sample(range(X_train.shape[0]), batch_size)`: This line randomly selects `batch_size` indices from the range of indices corresponding to the training data (`X_train`). This effectively creates a random mini-batch of training data.

4. `sgd.partial_fit(X_train[idx], y_train[idx])`: This line fits (or trains) the model using the current mini-batch. The `partial_fit` method is commonly used in online learning scenarios, where the model is updated incrementally as new data becomes available. Here, `X_train[idx]` represents the features (input data) of the mini-batch, and `y_train[idx]` represents the corresponding target labels (output data).

Overall, this code performs 100 iterations of SGD training, where in each iteration, a random mini-batch of 35 data points is selected, and the model is updated using these data points. This process allows the model to gradually improve its performance over multiple epochs while efficiently utilizing the available training data.

In [20]:
sgd.coef_

array([  42.50833574,  -59.73770479,  348.25917255,  240.13042681,
         26.35450936,  -25.87807718, -158.42251649,  117.41868933,
        316.19495841,  144.40076166])

In [21]:
sgd.intercept_

array([146.90314164])

In [22]:
y_pred = sgd.predict(X_test)

In [23]:
r2_score(y_test,y_pred)

0.42580121632640955

### When to use Mini Batch gradient descent?

1. **Dealing with Large Datasets**: If your dataset is too large to fit into memory, mini-batch gradient descent allows you to efficiently process it by dividing it into smaller chunks (mini-batches).

2. **Balancing Speed and Stability**: Mini-batch gradient descent strikes a balance between the speed of stochastic gradient descent (SGD) and the stability of batch gradient descent (BGD). It updates the model parameters more frequently than BGD but with less noise compared to SGD.

3. **Improving Generalization**: Mini-batch gradient descent can lead to better generalization by averaging gradients over mini-batches, which helps the optimization process converge to a more robust solution.

4. **Efficient Computation**: It's suitable for modern hardware architectures like GPUs, allowing for parallelization and vectorized operations, making computation more efficient.

5. **Control over Stochasticity**: You can control the amount of stochasticity in the optimization process by adjusting the batch size, balancing computational efficiency with the stability of updates.

6. **Online Learning**: Mini-batch gradient descent is great for scenarios where the model needs to be continuously updated with new data (online learning), as it allows for incremental updates without retraining on the entire dataset.

In essence, mini-batch gradient descent is a versatile choice for optimizing machine learning models, especially when dealing with large datasets and efficiency is a concern.