## Batch gradient Descent

### Types of Gradient Descent

![types](https://editor.analyticsvidhya.com/uploads/58182variations_comparison.png)

### Difference Between

The main difference between batch gradient descent, stochastic gradient descent (SGD), and mini-batch gradient descent lies in how they process training data and update model parameters during optimization:

1. **Batch Gradient Descent**:
   - **Processing Data**: In batch gradient descent, the entire training dataset is used to compute the gradient of the loss function with respect to the model parameters in each iteration.
   - **Update Rule**: Model parameters are updated once per iteration by taking an average gradient over all training examples.
   - **Advantages**: It provides a precise estimate of the gradient and leads to stable convergence, especially when the loss function is convex.
   - **Disadvantages**: Computationally expensive for large datasets since it requires processing the entire dataset in each iteration. It may get stuck in local minima for non-convex loss functions.

2. **Stochastic Gradient Descent (SGD)**:
   - **Processing Data**: In SGD, a single randomly chosen training example (or a small subset) is used to compute the gradient of the loss function in each iteration.
   - **Update Rule**: Model parameters are updated after computing the gradient for each individual example.
   - **Advantages**: It converges faster due to frequent updates and is less likely to get stuck in local minima. It is more computationally efficient, especially for large datasets.
   - **Disadvantages**: It may exhibit high variance in the updates, leading to noisy convergence. The convergence may be less stable compared to batch gradient descent.

3. **Mini-Batch Gradient Descent**:
   - **Processing Data**: Mini-batch gradient descent strikes a balance between batch gradient descent and SGD. It processes small batches of randomly chosen training examples.
   - **Update Rule**: Model parameters are updated once per batch by taking an average gradient over the examples in the batch.
   - **Advantages**: It combines the stability of batch gradient descent with the efficiency of SGD. It can make efficient use of hardware optimizations like parallelism when computing gradients.
   - **Disadvantages**: It introduces a hyperparameter (batch size) that needs to be tuned. The choice of batch size can affect convergence speed and efficiency.

In summary, the main differences lie in the amount of data used to compute gradients (entire dataset, single example, or small batches), the frequency of parameter updates, and the computational efficiency. Each variant has its trade-offs in terms of convergence speed, stability, and computational resources. The choice depends on the specific problem, dataset size, and computational constraints.

In [1]:
from sklearn.datasets import load_diabetes

In [2]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [3]:
X, y = load_diabetes(return_X_y=True)

In [4]:
X.shape, y.shape

((442, 10), (442,))

In [5]:
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=2)

In [6]:
reg = LinearRegression()
reg.fit(X_train, y_train)

In [7]:
# We have 10 columns so we will get 10 coefficient (β1, β2,.....β10)
reg.coef_

array([  -9.15865318, -205.45432163,  516.69374454,  340.61999905,
       -895.5520019 ,  561.22067904,  153.89310954,  126.73139688,
        861.12700152,   52.42112238])

In [8]:
# Intercept (β0)
reg.intercept_

151.88331005254167

In [9]:
y_pred = reg.predict(X_test)
r2_score(y_test, y_pred)

0.4399338661568968

### Making our own Gradient Descent Class

In [10]:
# This show how many columns are there in our dataset
# 10 cols = 10 coef
X_train[1].shape

(10,)

In [11]:
class amanGD:
    def __init__(self, learning_rate = 0.1, epochs = 100):
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs

    def fit(self, X_train, y_train):
        # initialize coef_ and intercept
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])

        for i in range(self.epochs):
            #Update all coef and intercept
            # Updating intercept
            # 1. calculating y_hat
            y_hat = np.dot(X_train,self.coef_) + self.intercept_
            # 2. Calculating derivative of a loss function with respect to the intercept parameter ∂MSE(L) / ∂intercept(β0) = (2/N)*Σ(i=1 to N) (y_i - ŷ_i)
            intercept_der = -2 * np.mean(y_train - y_hat)
            # 3. updating the intercept parameter using a learning rate
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)

            # Updating All coefficients
            # 1. derivative of a loss function with respect to the coefficients ∂MSE(L) / ∂coefficients(βm) = (2/N)*Σ(i=1 to N) (y_i - ŷ_i)(x_im) here, m= columns
            coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
            # 2. updating the coefficients (or weights) using gradient descent optimization
            self.coef_ = self.coef_ - (self.lr * coef_der)
        print(self.intercept_,self.coef_)
    def predict(self, X_test):
        # y = (βm x 1) + β0
        return np.dot(X_test,self.coef_) + self.intercept_

For Updating All coefficients refer this video

[![Alt text](https://img.youtube.com/vi/Jyo53pAyVAM/0.jpg)](https://www.youtube.com/watch?v=Jyo53pAyVAM?t=3284)


In [12]:
gd = amanGD(epochs=1000, learning_rate=0.5)

In [13]:
gd.fit(X_train, y_train)

152.01351687661833 [  14.38990585 -173.7235727   491.54898524  323.91524824  -39.32648042
 -116.01061213 -194.04077415  103.38135565  451.63448787   97.57218278]


In [14]:
y_pred = gd.predict(X_test)

In [15]:
r2_score(y_test, y_pred)

0.4534503034722803