# Gradient Descent ‚Äî Types, Mathematics, and Multi-Variable Case
---

## 1. Introduction

Gradient Descent is an **optimization algorithm** used in many Machine Learning algorithms to reach an optimal solution.

---

## 2. What is Gradient Descent?

Gradient Descent is an **iterative optimization algorithm** used to minimize a loss function.

It is used in:
- Linear Regression
- Logistic Regression
- Neural Networks
- Deep Learning

Its goal is simple:  
**minimize the loss and reach the best possible parameters**.

---

## 3. Linear Regression Recap

For simple linear regression:

$$
y = mx + b
$$

Where:
- $m$ = slope  
- $b$ = intercept  

We do not know $m$ and $b$ initially.

So we:
1. Start with random values
2. Update them repeatedly
3. Reach optimal values step by step

---

## 4. Loss Function

We use Mean Squared Error (MSE):

$$
L(m,b) = \sum_{i=1}^{n} (y_i - (mx_i + b))^2
$$

This loss:
- Depends on parameters
- Is convex
- Has a single global minimum

---

## 5. Gradient Descent Update Rules

For slope $m$:

$$
m := m - \alpha \frac{\partial L}{\partial m}
$$

For intercept $b$:

$$
b := b - \alpha \frac{\partial L}{\partial b}
$$

Where:
- $\alpha$ is the learning rate

This equation **is Gradient Descent**.

---

## 6. Key Observation (Why Types Exist)

üëâ **Mathematics remains the same**  
üëâ **Only the amount of data used per update changes**

This single difference creates **three types of Gradient Descent**.

---

## 7. Types of Gradient Descent

1. **Batch Gradient Descent**
2. **Stochastic Gradient Descent**
3. **Mini-Batch Gradient Descent**

---

## 8. Batch Gradient Descent

### How It Works

- Uses **entire dataset**
- Updates parameters **once per iteration**

If dataset has 300 rows:
- Loss is computed using all 300 rows
- One update is performed

### Mathematical Form

$$
\frac{\partial L}{\partial m} = -2 \sum_{i=1}^{n} x_i (y_i - (mx_i + b))
$$

$$
\frac{\partial L}{\partial b} = -2 \sum_{i=1}^{n} (y_i - (mx_i + b))
$$

### Pros and Cons

- ‚úÖ Stable
- ‚ùå Very slow
- ‚ùå High computation

Used only when dataset is **small**.

---

## 9. Stochastic Gradient Descent (SGD)

### How It Works

- Uses **one data point**
- Updates parameters after every row

If dataset has 300 rows:
- 300 updates per epoch

### Mathematical Form (Single Sample)

$$
\frac{\partial L}{\partial m} = -2 x_i (y_i - (mx_i + b))
$$

$$
\frac{\partial L}{\partial b} = -2 (y_i - (mx_i + b))
$$

### Pros and Cons

- ‚úÖ Very fast
- ‚ùå Noisy
- ‚ùå Oscillates

Used for **very large datasets**.

---

## 10. Mini-Batch Gradient Descent

### Why It Exists

- Batch ‚Üí slow  
- SGD ‚Üí unstable  

Mini-batch balances both.

### How It Works

- Dataset is split into batches
- Batch size: 16, 32, 64, etc.

If dataset has 300 rows and batch size = 30:
- 10 updates per epoch

### Mathematical Form

$$
\frac{\partial L}{\partial m} = -2 \sum_{i \in batch} x_i (y_i - (mx_i + b))
$$

$$
\frac{\partial L}{\partial b} = -2 \sum_{i \in batch} (y_i - (mx_i + b))
$$

This is the **most used** method in practice.

---

## 11. Summary of Gradient Descent Types

| Type | Data per Update | Speed | Stability |
|----|----|----|----|
| Batch | Full dataset | Slow | High |
| SGD | One row | Fast | Low |
| Mini-Batch | Small batch | Fast | Balanced |

---

## 12. Multiple Linear Regression Case

When data has multiple features:

$$
y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n
$$

Number of parameters:

$$
\text{Total parameters} = n + 1
$$

---

## 13. Loss Function (Multi-Variable)

$$
L = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

---

## 14. Gradient Vector

$$
\nabla L =
\begin{bmatrix}
\frac{\partial L}{\partial \beta_0} \\
\frac{\partial L}{\partial \beta_1} \\
\vdots \\
\frac{\partial L}{\partial \beta_n}
\end{bmatrix}
$$

Each parameter has its **own derivative**.

---

## 15. General Gradient Formula

For any parameter $$\beta_j$$:

$$
\frac{\partial L}{\partial \beta_j}
= -2 \sum_{i=1}^{n} (y_i - \hat{y}_i) x_{ij}
$$

---

## 16. Matrix Form (Efficient Implementation)

Prediction:

$$
\hat{y} = X\beta
$$

Gradient:

$$
\nabla L = -2 X^T (y - X\beta)
$$

This allows **vectorized computation**.

---

## 17. Final Takeaways

- Same math, different execution
- Mini-batch is most practical
- Learning rate controls convergence
- Gradient Descent scales to any number of features
- Foundation of Machine Learning optimization

---


In [None]:
from sklearn.datasets import load_diabetes

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
X,y = load_diabetes(return_X_y=True)
print(X.shape)
print(y.shape)

(442, 10)
(442,)


In [None]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)
reg = LinearRegression()
reg.fit(X_train,y_train)


In [None]:
print(reg.coef_)
print(reg.intercept_)

[  -9.15865318 -205.45432163  516.69374454  340.61999905 -895.5520019
  561.22067904  153.89310954  126.73139688  861.12700152   52.42112238]
151.88331005254167


In [None]:
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4399338661568968

In [None]:
class MYGDRegressor:

    def __init__(self,learning_rate=0.01,epochs=90):

        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs

    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])

        for i in range(self.epochs):
            # update all the coef and the intercept
            y_hat = np.dot(X_train,self.coef_) + self.intercept_
            #print("Shape of y_hat",y_hat.shape)
            intercept_der = -2 * np.mean(y_train - y_hat)
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)

            coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)

        print(self.intercept_,self.coef_)

    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [None]:
gdr = MYGDRegressor(epochs=1000,learning_rate=0.5)
gdr.fit(X_train,y_train)

152.01351687661833 [  14.38990585 -173.7235727   491.54898524  323.91524824  -39.32648042
 -116.01061213 -194.04077415  103.38135565  451.63448787   97.57218278]


In [None]:
y_pred = gdr.predict(X_test)
r2_score(y_test,y_pred)

0.4534503034722803