### Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is a variant of the gradient descent optimization algorithm commonly used in machine learning for training models, particularly in scenarios where the dataset is large. 

Here'sle explanation of how it works:

1. **Gradient Descent**: In traditional gradient descent (also known as batch gradient descent), you compute the gradient of the cost function with respect to the parameters for the entire dataset. Then, you update the parameters once based on this average gradient.

2. **Stochastic Gradient Descent**: In SGD, instead of computing the gradient over the entire dataset, you randomly pick a single data point (or a small subset, called a mini-batch) from the dataset. You compute the gradient of the cost function with respect to the parameters using only that single data point (or mini-batch), and then you update the parameters. This process is repeated for each data point (or mini-batch) in the dataset.

Advantages of Stochastic Gradient Descent over Batch Gradient Descent:

1. **Efficiency**: SGD is often faster because it updates the parameters more frequently. With each update, it takes a step in the direction that minimizes the cost function, potentially converging to the minimum more quickly.

2. **Less Memory Requirement**: Since SGD only requires calculating the gradient for a single data point (or a small subset), it consumes much less memory compared to batch gradient descent, making it more suitable for large datasets that cannot fit into memory.

3. **Possibly Better Generalization**: SGD's frequent updates and exposure to individual data points (or mini-batches) can introduce more randomness into the optimization process, potentially helping the algorithm to escape local minima and find better solutions or generalize better to unseen data.

However, SGD can also have some drawbacks, such as more frequent fluctuations in the objective function and slower convergence towards the minimum when the cost surface is not smooth. To mitigate these issues, techniques like learning rate scheduling, momentum, and adaptive learning rates are often employed.

In [1]:
from sklearn.datasets import load_diabetes

from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

In [2]:
X, y = load_diabetes(return_X_y=True)

In [3]:
X.shape, y.shape

((442, 10), (442,))

In [4]:
X_train,X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=2)

In [5]:
X_train

array([[-0.00188202, -0.04464164, -0.06979687, ..., -0.03949338,
        -0.06291688,  0.04034337],
       [-0.00914709, -0.04464164,  0.01103904, ..., -0.03949338,
         0.01703607, -0.0052198 ],
       [ 0.02354575,  0.05068012, -0.02021751, ..., -0.03949338,
        -0.09643495, -0.01764613],
       ...,
       [ 0.06350368,  0.05068012, -0.00405033, ..., -0.00259226,
         0.08449153, -0.01764613],
       [-0.05273755,  0.05068012, -0.01806189, ...,  0.1081111 ,
         0.03606033, -0.04249877],
       [ 0.00175052,  0.05068012,  0.05954058, ...,  0.1081111 ,
         0.06898589,  0.12732762]])

In [6]:
reg = LinearRegression()
reg.fit(X_train, y_train)

In [7]:
y_pred = reg.predict(X_test)
r2_score(y_test, y_pred)

0.4399338661568968

In [8]:
reg.coef_

array([  -9.15865318, -205.45432163,  516.69374454,  340.61999905,
       -895.5520019 ,  561.22067904,  153.89310954,  126.73139688,
        861.12700152,   52.42112238])

In [9]:
reg.intercept_

151.88331005254167

In [10]:
X_train.shape[1]

10

In [11]:
X_train.shape[0]

353

In [12]:
np.random.randint(0, X_train.shape[0])

346

In [13]:
class amanSGD:
    def __init__(self, learning_rate = 0.1, epochs = 100):
        
        self.lr = learning_rate
        self.epochs = epochs
        self.coef_ = None
        self.intercept_ = None

    def fit(self, X_train, y_train):
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        for i in range(self.epochs):
            for j in range(X_train.shape[0]):
                idx = np.random.randint(0, X_train.shape[0])
                y_hat = np.dot(X_train[idx], self.coef_) + self.intercept_
                # in sgd the formula will be -2 (yi - y_hat) because we are not calculating the whole derivative
                intercept_der = -2 * (y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat), X_train[idx])
                self.coef_ = self.coef_ -(self.lr * coef_der)
        print(self.intercept_, self.coef_)
    def predict(self, X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

In [14]:
sgd = amanSGD(learning_rate=0.01, epochs = 50)

In [15]:
sgd.fit(X_train, y_train)

145.94400660033554 [  56.16314182  -76.06313579  355.90336265  245.72488747   18.23406522
  -26.70496554 -169.42389724  126.89187121  311.14697342  130.73297417]


In [16]:
y_pred = sgd.predict(X_test)

In [17]:
r2_score(y_test, y_pred)

0.4283703691763032

### Stochastic Gradient Descent (SGD) disadvantages:

1. **Noisy Updates**: Because SGD updates parameters based on the gradient computed from a single data point (or a small subset), the updates can be very noisy, leading to a more erratic convergence path compared to batch gradient descent. This noise can sometimes slow down convergence or make it harder to converge to an optimal solution.

2. **Variance in Convergence**: Due to the randomness introduced by sampling individual data points (or mini-batches), SGD may exhibit higher variance in convergence behavior compared to batch gradient descent. This variance can make it more challenging to determine the convergence criteria or to reproduce the same results across different runs.

3. **Sensitive to Learning Rate**: The learning rate in SGD needs to be carefully tuned. If the learning rate is too high, SGD may oscillate around the minimum or even diverge. If it's too low, convergence may be slow. Finding an appropriate learning rate can be more challenging in SGD compared to batch gradient descent.

4. **Potential for Plateaus**: In certain cases, especially when the cost surface is relatively flat or has long, shallow plateaus, SGD may struggle to make progress towards the minimum because the gradient from individual data points (or mini-batches) may not provide sufficient guidance.

5. **Difficulty in Diagnosing Convergence**: Because of the stochastic nature of SGD updates, diagnosing convergence issues can be more challenging compared to batch gradient descent. It may require additional monitoring techniques or multiple runs to ensure convergence to a satisfactory solution.

Despite these disadvantages, SGD remains widely used due to its efficiency, scalability to large datasets, and ability to handle non-convex optimization problems. Various techniques, such as momentum, adaptive learning rates, and mini-batch sampling strategies, are often employed to mitigate these issues and improve the performance of SGD in practice.

### When to use Stochastic Gradient Descent (SGD) and Batch Gradient Descent

Choosing between Stochastic Gradient Descent (SGD) and Batch Gradient Descent depends on various factors such as the size of the dataset, computational resources, and optimization goals. Here's a guideline on when to use each:

1. **Batch Gradient Descent (BGD)**:
   - **Small to Medium Sized Datasets**: BGD is suitable when the dataset can comfortably fit into memory.
   - **Smooth Cost Functions**: BGD may perform well when dealing with smooth, well-behaved cost functions, as it computes the gradient over the entire dataset, leading to more stable updates.
   - **Convergence**: If convergence speed is crucial and computational resources are not a limitation, BGD might be preferred as it guarantees a monotonic decrease in the cost function with each iteration.

2. **Stochastic Gradient Descent (SGD)**:
   - **Large Datasets**: SGD is well-suited for large datasets that cannot fit into memory because it updates parameters based on individual data points (or mini-batches), requiring less memory.
   - **Efficiency**: When computational resources are limited, SGD is often preferred due to its computational efficiency. It allows for more frequent updates, potentially converging faster, especially in high-dimensional spaces.
   - **Non-convex Optimization**: SGD can be beneficial in non-convex optimization problems, as its stochastic updates can help escape local minima and explore the solution space more effectively.
   - **Online Learning**: For scenarios where new data arrives continuously, SGD is suitable for online learning settings, where the model is updated incrementally with each new observation.

3. **Mini-Batch Gradient Descent**:
   - **Trade-off between BGD and SGD**: Mini-batch gradient descent combines aspects of both BGD and SGD by updating parameters based on small random subsets (mini-batches) of the dataset. It strikes a balance between the efficiency of SGD and the stability of BGD, making it suitable for a wide range of scenarios.
   - **Parallelization**: Mini-batch gradient descent can also be parallelized across multiple processing units, making it useful in distributed computing environments.

In practice, the choice between BGD, SGD, or mini-batch gradient descent often depends on experimentation and empirical evaluation, considering factors such as dataset size, computational constraints, convergence behavior, and optimization objectives.

### Convex function and Non Convex function

![](https://d3i71xaburhd42.cloudfront.net/b1df0cd796034ca9ba3bc018474e44ee60fd7855/21-Figure1.5-1.png)

### Convergence
![](https://miro.medium.com/v2/resize:fit:1400/1*QA0kOv7KA_0SNWaEEHCmZg.png)

### Sklearn SGD

In [18]:
from sklearn.linear_model import SGDRegressor

In [19]:
reg = SGDRegressor(max_iter=100, learning_rate='constant', eta0=0.01)

In [20]:
reg.fit(X_train, y_train)



In [21]:
y_pred = reg.predict(X_test)

In [22]:
r2_score(y_test, y_pred)

0.43066316900537893