
## 🧠 Batch Gradient Descent — Quick Summary

- **Goal**: Minimize the loss function (usually MSE) by updating model parameters.
- **Batch**: Uses `all training data` at each step to compute gradients (i.e., slopes).
- **Update Rule**: Adjust parameters in the direction that reduces error.

---

### ✅ Key Traits:
- Stable and accurate direction.
- Slower on large datasets (needs full dataset per update).
- Requires more memory (stores all data).
---
## Example: Batch Gradient Descent for Linear Regression

### Input and Output  
Input features → `x1`, `x2`  
Output target → `y1`

### Dataset:

| x1 | x2 | y1 |
|----|----|----|
| 1  | 2  | 3  |
| 2  | 3  | 5  |

---

## Loss Function → Mean Squared Error (MSE)

$$
\text{Loss Function} = \frac{1}{2} \sum (y_{\text{true}_i} - y_{\text{pred}_i})^2
$$

$$
= \frac{1}{2} \left[ (y_{\text{true}_1} - y_{\text{pred}_1})^2 + (y_{\text{true}_2} - y_{\text{pred}_2})^2 \right]
$$

---

## Prediction Function

$$
y_{\text{pred}_i} = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2}
$$

$$
y_{\text{pred}_1} = \beta_0 + \beta_1 \cdot x_{11} + \beta_2 \cdot x_{12}
$$

$$
y_{\text{pred}_2} = \beta_0 + \beta_1 \cdot x_{21} + \beta_2 \cdot x_{22}
$$

---

## Loss Function in Terms of Parameters

$$
\text{Loss} = \frac{1}{2} \left[ (y_{\text{true}_1} - \beta_0 - \beta_1 x_{11} - \beta_2 x_{12})^2 + (y_{\text{true}_2} - \beta_0 - \beta_1 x_{21} - \beta_2 x_{22})^2 \right]
$$

---

## Gradient Descent Update Rules

$$
\beta_0 = \beta_0 - \text{learning\_rate} \cdot \text{slope}_{\beta_0}
$$

$$
\beta_1 = \beta_1 - \text{learning\_rate} \cdot \text{slope}_{\beta_1}
$$

$$
\beta_2 = \beta_2 - \text{learning\_rate} \cdot \text{slope}_{\beta_2}
$$

---
## Partial Derivatives (Gradients)

---

### For β₀:

Start with:

$$
\frac{d(\text{Loss})}{d\beta_0} = \frac{1}{2} \cdot \left[ 2(y_{\text{true}_1} - y_{\text{pred}_1})(-1) + 2(y_{\text{true}_2} - y_{\text{pred}_2})(-1) \right]
$$

Simplify:

$$
= (\frac{-2}{2}) \cdot \left[ (y_{\text{true}_1} - y_{\text{pred}_1}) + (y_{\text{true}_2} - y_{\text{pred}_2}) \right]
$$

General form:

$$
\text{slope}_{\beta_0} = \frac{\partial \text{Loss}}{\partial \beta_0} = \frac{-2}{n} \sum (y_{\text{true}_i} - y_{\text{pred}_i})
$$

---

### For β₁:

Start with:

$$
\frac{d(\text{Loss})}{d\beta_1} = \frac{1}{2} \cdot \left[ 2(y_{\text{true}_1} - y_{\text{pred}_1})(-x_{11}) + 2(y_{\text{true}_2} - y_{\text{pred}_2})(-x_{21}) \right]
$$

Simplify:

$$
= (\frac{-2}{2}) \cdot \left[ (y_{\text{true}_1} - y_{\text{pred}_1})x_{11} + (y_{\text{true}_2} - y_{\text{pred}_2})x_{21} \right]
$$

General form:

$$
\text{slope}_{\beta_1} = \frac{\partial \text{Loss}}{\partial \beta_1} = \frac{-2}{n} \sum (y_{\text{true}_i} - y_{\text{pred}_i}) \cdot x_{i1}
$$

---

### For β₂:

Start with:

$$
\frac{d(\text{Loss})}{d\beta_2} = \frac{1}{2} \cdot \left[ 2(y_{\text{true}_1} - y_{\text{pred}_1})(-x_{12}) + 2(y_{\text{true}_2} - y_{\text{pred}_2})(-x_{22}) \right]
$$

Simplify:

$$
= (\frac{-2}{2}) \cdot \left[ (y_{\text{true}_1} - y_{\text{pred}_1})x_{12} + (y_{\text{true}_2} - y_{\text{pred}_2})x_{22} \right]
$$

General form:

$$
\text{slope}_{\beta_2} = \frac{\partial \text{Loss}}{\partial \beta_2} = \frac{-2}{n} \sum (y_{\text{true}_i} - y_{\text{pred}_i}) \cdot x_{i2}
$$

### For any parameters:

$$
\text{slope}_{\beta_j} = \frac{\partial \text{Loss}}{\partial \beta_j} = \frac{-2}{n} \sum (y_{\text{true}_i} - y_{\text{pred}_i}) \cdot x_{ij} 
$$

In [8]:

import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

X,y= load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)
reg = LinearRegression()
reg.fit(X_train, y_train)

In [9]:
print(reg.intercept_)
print(reg.coef_)

151.88331005254167
[  -9.15865318 -205.45432163  516.69374454  340.61999905 -895.5520019
  561.22067904  153.89310954  126.73139688  861.12700152   52.42112238]


In [10]:
y_pred = reg.predict(X_test)
r2_score(y_test, y_pred)
# The R^2 score is a measure of how well the model explains the variance in the data.

0.4399338661568968

## 🌟 Main Gradient Formula (Used in Code)

> ✅ This is the **core gradient formula** used to compute parameter updates:

$$
\boxed{
y_{\text{pred}_i} = \beta_0 + \beta_1 \cdot x_{i1} + \beta_2 \cdot x_{i2}
}
$$

$$
\boxed{
\text{slope}_{\beta_0} = \frac{\partial \text{Loss}}{\partial \beta_0} = \frac{-2}{n} \sum (y_{\text{true}_i} - y_{\text{pred}_i})
}
$$

$$
\boxed{
\text{slope}_{\beta_j} = \frac{\partial \text{Loss}}{\partial \beta_j} = \frac{-2}{n} \sum (y_{\text{true}_i} - y_{\text{pred}_i}) \cdot x_{ij}
}
$$




In [None]:
class BatchGradientDescent:

    def __init__(self, learning_rate=0.01, n_epochs=1000):
        self.coefficients_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.n_epochs = n_epochs

    def fit(self, X_train, y_train):
        self.intercept_ = 0  # beta_0
        self.coefficients_ = np.ones(X_train.shape[1])  # beta_1, beta_2, ...

        for i in range(self.n_epochs):
            y_pred = np.dot(X_train, self.coefficients_) + self.intercept_
            
            # Gradients
            slope_intercept = -2 * np.mean(y_train - y_pred)
            slope_coefficients = -2 * np.dot((y_train - y_pred), X_train) / X_train.shape[0]

            # Updates
            self.intercept_ -= self.lr * slope_intercept
            self.coefficients_ -= self.lr * slope_coefficients

        print("Intercept (beta_0):", self.intercept_)
        print("Coefficients (beta_1...n):", self.coefficients_)

    def predict(self, X_test):
        return np.dot(X_test, self.coefficients_) + self.intercept_


In [12]:
batchGD = BatchGradientDescent(learning_rate=0.5, n_epochs=400)
batchGD.fit(X_train, y_train)

Intercept (beta_0): 152.11288551486123
Coefficients (beta_1...n): [  52.55710163  -79.44866568  371.8428852   260.53498791   10.5529389
  -39.77223864 -178.29521593  128.54177501  336.97752534  128.82726177]


In [13]:
y_pred=batchGD.predict(X_test)
r2_score(y_test, y_pred)

0.43934402285761986