<h1 align = 'center'>Batch Gradient Descent </h1>

Let's say there are a total of __m__ observations in a data set and we use __all these observations to calculate__ the cost function J, then this is known as <code>Batch Gradient Descent.</code>

### Advantage 

1. Batch Gradient Descent is great for convex or relatively smooth error manifolds.
2. Batch gradient descent is more accurate since it computes the gradient using the entire training dataset.
3. Also, Batch GD scales well with the number of features. 

### Disadvantage

1. Batch Gradient Descent involves calculations over the full training set at each step as a result of which it is __very slow__ on very large training data. 
2. Computationally expensive.
3. For Non-convex function it can get stuck in local minima.

### Import Libraries 

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes

In [2]:
X,y = load_diabetes(return_X_y=True)

In [3]:
X

array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990842, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06832974, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286377, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04687948,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452837, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00421986,  0.00306441]])

In [4]:
y

array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

In [5]:
X.shape, y.shape

((442, 10), (442,))

#### Train test split

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train,X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=2)

#### Model Training

In [8]:
from sklearn.linear_model import LinearRegression

In [9]:
model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

In [10]:
model.coef_

array([  -9.16088483, -205.46225988,  516.68462383,  340.62734108,
       -895.54360867,  561.21453306,  153.88478595,  126.73431596,
        861.12139955,   52.41982836])

In [11]:
model.intercept_

151.88334520854633

#### Performance metrics

In [12]:
from sklearn.metrics import r2_score

In [13]:
y_pred = model.predict(X_test)
print("r2_score = ", r2_score(y_test,y_pred))

r2_score =  0.4399387660024645


### Custom Class for Batch Gradient Descent

In [14]:
X.shape

(442, 10)

In [15]:
class GDRegressor:
    
    def __init__ (self, learning_rate=0.01, epochs= 100):
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs =epochs
        
    def fit(self,X_train, y_train):
        #initialize parameters : b=0, b1,b2,---bn = 1
        self.intercept_ = 0 
        
        #X_train shape will tell how many coef are there
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            # Update all coef_ and intercept_
            y_hat = np.dot(X_train,self.coef_) + self.intercept_  #y_hat = b + wT.x
            
            intercept_derv = - 2 * np.mean(y_train-y_hat) # dl/db = -2 * sum( y-y_hat )
            self.intercept_ = self.intercept_ - (self.lr * intercept_derv) # bnew = bold - n*slop
            
            #update Coef
            coef_derv = -2* np.dot((y_train- y_hat),X_train) #dl/dm = -2 *sum( ( y-y_hat ) * X)
            self.coef_ = self.coef_ - (self.lr * coef_derv)  # mnew = mold - n*slop
            
            
        #print(y_hat.shape)
        print(self.intercept_, self.coef_)
        
        
            
    def predict(self, X_test):
        return np.dot(X_test, self.coef_)+self.intercept_

In [16]:
gd = GDRegressor(epochs=10)

In [17]:
gd.fit(X_train,y_train) 

27.589808027703263 [ 43.51355856  -7.69500784 111.61338053  96.51157875  30.66490076
  10.2854373  -58.15333967  56.6532078  124.34330544  65.11421485]


In [18]:
gd = GDRegressor(epochs=1000)
gd.fit(X_train,y_train)

151.90197654519747 [  -7.41032246 -200.16617472  533.25518552  338.23799406 -137.42989217
  -52.86615812 -170.28983465   54.54383886  571.24202059   53.64638953]


In [19]:
y_pred = gd.predict(X_test)

In [20]:
r2_score(y_test,y_pred)

0.443073473332973

### Time required by batch gradient descent

In [21]:
import time

In [22]:
start = time.time()
gd = GDRegressor(epochs=10)
gd.fit(X_train,y_train)
print("Time take is ", time.time()-start)

27.589808027703263 [ 43.51355856  -7.69500784 111.61338053  96.51157875  30.66490076
  10.2854373  -58.15333967  56.6532078  124.34330544  65.11421485]
Time take is  0.0029134750366210938
