<h1 style="font-family: 'Arial', sans-serif; font-size: 2.5em; color: #2C3E50; text-align: center; text-shadow: 2px 2px 4px #BDC3C7; background-color:#ECDFCC">
    Mastering Stochastic Gradient Descent: An Optimization Journey 🚀
</h1>


###  
#### <span style="color:yellow;font-size:30px">Stochastic Gradient Descent (SGD)</span>

**Stochastic Gradient Descent (SGD)** is a type of optimization algorithm used to minimize a function (like a cost or loss function) by iteratively updating the model's parameters. It is an efficient way to find the optimal parameters for machine learning models, especially for large datasets.

#### <span style="color:yellow;font-size:30px">How it Works:</span>
1. **Gradient Descent**: In gradient descent, we update the model's parameters (like weights) by calculating the gradient of the loss function. This helps us know in which direction to adjust the parameters to minimize the error.
2. **Stochastic**: Unlike normal gradient descent that uses the entire dataset at once (Batch Gradient Descent), **SGD** picks **one random data point** (or a small batch) from the dataset at each iteration to update the parameters.
3. **Simple way(SGD)**: In this, the number of updates will be equal to the number of rows. For example, if we have 31 rows, there will be 31 updates.
#### Steps of SGD:
1. Randomly select one data point (or a small subset).
2. Calculate the gradient (slope) of the loss function for that data point.
3. Update the parameters based on that gradient.
4. Repeat for each iteration until the model converges to an optimal solution.

 <span style="color:yellow;font-size:30px">Benefits of Stochastic Gradient Descent</span>

1. **Fast and Efficient** 🚀
   - Since it updates the model after looking at each individual data point, it is much faster than Batch Gradient Descent, especially with large datasets.
   
2. **Handles Large Datasets Well** 📊
   - With very large datasets, SGD can be more memory-efficient because it doesn’t require the entire dataset to be loaded into memory.

3. **Escapes Local Minima** 🏔️
   - SGD has a bit of randomness (since it picks random data points), which helps it escape from local minima in complex error surfaces and find a better solution.

<span style="color:yellow;font-size:30px"> Disadvantages of Stochastic Gradient Descent</span>

1. **Noisy Updates** ⚡
   - Since each update is based on a single data point, the updates can be noisy and can lead to fluctuations in the loss function, making it harder to converge.

2. **May Overshoot the Optimal Point** 🎯
   - With high learning rates, SGD may overshoot the optimal solution because of the noisy updates, making it harder to reach the lowest error.

3. **Slower Convergence** ⏳
   - Although it’s fast in terms of iteration, it often takes more iterations to reach a good solution, compared to batch gradient descent which converges more smoothly.

### <span style="color:yellow;font-size:30px"> <b>Example:</b> Why Use SGD?</span>
Imagine you are training a machine learning model on millions of images. Using **Batch Gradient Descent**, you need to calculate the loss for all million images before updating the model. This is slow and requires a lot of memory.

**With SGD**, you only look at one image at a time, calculate the loss, and update the model. This makes the training much faster and efficient, even if it’s a little noisy.

### <span style="color:yellow;font-size:30px"> Summary Table</span>

| **Benefits**              | **Disadvantages**             |
|---------------------------|-------------------------------|
| Fast and efficient         | Noisy updates                 |
| Good for large datasets    | Can overshoot the optimal point|
| Can escape local minima    | Slower convergence in some cases|

In simple terms, **SGD** is like learning by trial and error — updating your knowledge after each small lesson, rather than waiting to gather all lessons at once!

====================================



### **Why Gradient Descent in Linear Regression?**
In **linear regression**, the goal is to find the best-fitting line that minimizes the error between predicted and actual values. The error is usually measured using a loss function like **Mean Squared Error (MSE)**. Gradient descent helps us:
- **Minimize the loss function**: It adjusts the model's parameters (slope and intercept) iteratively, moving towards the direction where the error is the least.
- **Efficiency**: In large datasets, calculating exact solutions using methods like the normal equation becomes computationally expensive, especially if there are many features. Gradient descent is faster and scalable for such scenarios.
  
### **Types of Gradient Descent**:
1. **Batch Gradient Descent**:
   - It uses the entire dataset to compute the gradient at every iteration.
   - **Advantage**: More stable convergence.
   - **Disadvantage**: Can be slow with large datasets since it needs to process the entire dataset in each iteration.

2. **Stochastic Gradient Descent (SGD)**:
   - It updates the parameters for each individual data point, rather than using the entire dataset.
   - **Advantage**: Faster updates and better for large datasets.
   - **Disadvantage**: It introduces more noise in each update, which can cause it to be less stable than batch gradient descent.

3. **Mini-Batch Gradient Descent**:
   - It is a combination of the above two. It updates the parameters using a small random subset (mini-batch) of the data.
   - **Advantage**: It balances the trade-offs between batch and stochastic gradient descent by providing more stable updates but faster than full batch gradient descent.

### **Why Not Use Gradient Descent in Other Models?**
In some other models, we don’t always need gradient descent because of different optimization requirements:

1. **Linear Regression with Normal Equation**:
   - For **small datasets**, linear regression can be solved directly using the **normal equation**, which finds the exact solution without iterative optimization. Gradient descent is unnecessary in such cases.

2. **Logistic Regression**:
   - While logistic regression is optimized using gradient descent, it uses the **logarithmic loss function**, not the mean squared error. It still relies on gradient-based methods because it's difficult to solve directly.

3. **Tree-based Models (e.g., Decision Trees, Random Forests, XGBoost)**:
   - These models don’t require gradient descent since they rely on splitting data based on decision rules, not continuous parameter optimization like linear models.
   - **Tree-based methods** work by recursively splitting the dataset based on feature values, so gradient-based optimization isn’t required.

4. **Support Vector Machines (SVM)**:
   - SVM uses methods like **quadratic programming** for optimization, which does not need gradient descent.

5. **K-Nearest Neighbors (KNN)**:
   - KNN doesn’t involve model training, so there’s no optimization step, hence no need for gradient descent.

### **Conclusion**:
- **Gradient Descent** is used when a model has parameters that can be continuously adjusted to minimize error, like in linear and logistic regression.
- In **tree-based** or **non-parametric models** like KNN, we don't need gradient descent because their optimization processes or algorithms are different.

Would you like further clarification on any specific model or gradient descent type?



## Acknowledgments

I would like to express my gratitude to [**Sir Nitish Singh**](https://www.linkedin.com/in/nitish-singh-03412789/?originalSubdomain=in) for his invaluable guidance and support throughout this project. Special thanks to the [Campus X](https://www.youtube.com/watch?v=ZftI2fEz0Fw&list=PLKnIA16_Rmvbr7zKYQuBfsVkjoLcJgxHH) channel for providing  excellent resources and tutorials that helped enhance my understanding of the concepts discussed in this notebook.



<h1 style="color:yellow;font-size:60px">Step 1: Import Libraries and Dataset</h1>

In [1]:
from sklearn.datasets import load_diabetes
# from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import time

In [26]:
X,y = load_diabetes(return_X_y=True)

In [3]:
print(X.shape)
print(y.shape)

(442, 10)
(442,)


<h1 style="color:yellow;font-size:50px">Step 2: Modeling</h1>

In [27]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2)

In [5]:
reg = LinearRegression()
reg.fit(X_train,y_train)

In [6]:
print(reg.coef_)
print(reg.intercept_)

[  -9.15865318 -205.45432163  516.69374454  340.61999905 -895.5520019
  561.22067904  153.89310954  126.73139688  861.12700152   52.42112238]
151.88331005254167


In [7]:
y_pred = reg.predict(X_test)
r2_score(y_test,y_pred)

0.4399338661568969

>__Note__: In this section, I can simply split data,fit model and evaluate  

<h1 style="color:yellow;font-size:60px">Step 3:Make SGD Regressor</h1>

### Code: Explain Through gpt



 `__init__(self, learning_rate=0.01, epochs=100)`

### What is this?
- This is the **constructor** of the class. It initializes the class with two important parameters:
  - `learning_rate`: This controls **how big each step** of the gradient descent should be. In this case, it is set to `0.01` by default.
  - `epochs`: The number of **iterations** or passes through the entire dataset, set to `100` by default.
  
- We also define two variables:
  - `self.coef_`: This will hold the **coefficients** (weights) for each feature.
  - `self.intercept_`: This is the **bias** or y-intercept term.

 `fit(self, X_train, y_train)`

### What does it do?
- This function is used to **train the model**. It takes the training data `X_train` and corresponding labels `y_train` as inputs.

### Steps:
1. **Initialize the model:**
   - `self.intercept_ = 0`: The intercept (bias) starts at 0.
   - `self.coef_ = np.ones(X_train.shape[1])`: The coefficients (weights) are initialized to `1` for each feature in `X_train`. If `X_train` has `n` features, we get an array of size `n`.

2. **Training loop (Gradient Descent):**
   - The outer loop runs for `self.epochs` (100 by default), meaning the model will see the data multiple times.
   
3. **Stochastic Gradient Descent:**
   - In each epoch, we randomly pick an example from `X_train` by generating a random `idx`.
   - **Prediction**: We calculate the predicted value (`y_hat`) for the chosen example:
     $[
     y\_hat = X\_train[idx] \cdot coef\_ + intercept\_
     $]
   - **Gradient Calculation**:
     - We calculate the error between the true value (`y_train[idx]`) and the predicted value (`y_hat`).
     - We compute the gradient for the intercept (`intercept_der`) and for the coefficients (`coef_der`):
       $
       intercept\_der = -2 \cdot (y\_train[idx] - y\_hat)
       $
       $
       coef\_der = -2 \cdot (y\_train[idx] - y\_hat) \cdot X\_train[idx]
       $
     - We **update** the intercept and coefficients using these gradients and the learning rate (`self.lr`):
       $
       self.intercept_ = self.intercept_ - (self.lr \cdot intercept\_der)
       $
       $
       self.coef_ = self.coef_ - (self.lr \cdot coef\_der)
      $

4. After all epochs are completed, the model has **learned** the intercept and coefficients that best fit the data.

5. Finally, it prints the trained `intercept_` and `coef_`.
6. 
 `predict(self, X_test)`

### What does it do?
- This function is used to **make predictions** on new data. It takes `X_test` (the test data) as input.
- It uses the trained intercept and coefficients to calculate the predicted value:
  $
  y\_pred = X\_test \cdot coef\_ + intercept\_
  $
- The function returns the predicted values (`y_pred`) for all the test data.

## Summary:
- **Stochastic Gradient Descent (SGD)** is a variation of gradient descent where we update the model weights for **one example** at a time (instead of the whole dataset).
- The model uses **random samples** from the data in each iteration to adjust the weights and the intercept until it finds the best fit.


In [8]:
class SGDRegressor:
    
    def __init__(self,learning_rate=0.01, epochs=100):
        self.coeficnt_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.iteration = epochs
        
    def fit(self, X_train,y_train):
        self.intercept_ = 0
        self.coeficnt_ = np.ones(X_train.shape[1])
        
        
        for i in range(self.iteration):
            for j in range(X_train.shape[0]):
                idx = np.random.randint(0,X_train.shape[0])
                
                y_hat = np.dot(X_train[idx],self.coeficnt_) + self.intercept_
                
                
                intercept_der = -2 * (y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)
                
                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coeficnt_ = self.coeficnt_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coeficnt_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coeficnt_) + self.intercept_
        

In [9]:
sgd = SGDRegressor(learning_rate=0.01,epochs=40)

In [10]:
start = time.time()
sgd.fit(X_train,y_train)
print("The time taken is",time.time() - start)

143.84872764987068 [  58.22525171  -42.13063755  318.88091247  228.53253389   28.80551884
   -8.1284088  -158.66814918  124.05288045  286.19514821  127.28950194]
The time taken is 0.5794434547424316


In [11]:
y_pred = sgd.predict(X_test)

In [12]:
r2_score(y_test,y_pred)

0.40687003530988186

>___Point to be Noted__:

On the first try, by simply loading and applying the model, we achieved an `R² score of 0.43`. <br>
After applying stochastic gradient descent, we approached an `R² score of 0.40.`

<h1 style="color:yellow;font-size:60px">Step 4:SGD Regressor from Sklearn</h1>

In [13]:
from sklearn.linear_model import SGDRegressor

In [28]:
reg = SGDRegressor(max_iter=30,learning_rate='constant',eta0=0.01)

In [29]:
reg.fit(X_train,y_train)



In [30]:
y_pred = reg.predict(X_test)

In [31]:
r2_score(y_test,y_pred)

0.32696038030570607

<span style="Color:Yellow; Font-size:30px; text-align:center"> FAQ :</span> 

I hope you're wondering why, when we fit our `SGD regressor,` we achieve an `R² score of 0.40`, but when we use the one from sklearn, we only get `0.32`. You might have missed an important point: in our custom SGD regressor, we use 40 iterations, while in sklearn's regressor, we only use 30 iterations.


### Additional insight:
The **number of iterations** (or epochs) can significantly affect the model's performance. Using **more iterations** allows the model to continue refining the weights based on the gradient of the loss function. In your case, the **40 iterations** of your custom SGD regressor likely allowed it to converge closer to an optimal solution than the **30 iterations** in the `sklearn` version. 

However, more iterations can also lead to **overfitting** if not handled carefully. This discrepancy could be due to other factors, such as the **learning rate** or other **hyperparameters** in `sklearn`'s SGD regressor, which might be different from your implementation.