Stochastic Gradient Descent (SGD) with momentum is an enhancement of the basic SGD algorithm. While basic SGD updates the parameters based solely on the current gradient, SGD with momentum takes into account the past gradients to smooth out the updates and accelerate convergence.

## How SGD with Momentum Works

The idea of momentum is to accumulate an exponentially decaying moving average of past gradients and continue to move in their direction. This helps to build up speed in directions with consistent gradients and dampen oscillations in directions with inconsistent gradients.

![Screenshot%202024-06-02%20at%2010.58.44%E2%80%AFAM.png](attachment:Screenshot%202024-06-02%20at%2010.58.44%E2%80%AFAM.png)

## Working Steps:

**1. Initialization:** Initialize the parameters (weights and biases) and set the initial velocity v0​=0.

**2. Compute Gradient:** For each training example (x(i),y(i)), compute the gradient of the loss function with respect to the parameters.

**3. Velocity Update:** Update the velocity vector using the current gradient and the previous velocity.

**4. Parameter Update:** Update the parameters using the velocity vector.

**5. Iteration:** Repeat steps 2-4 for all training examples and for multiple epochs until convergence.



## Usage

SGD with momentum is widely used in training deep neural networks, including:
1. Convolutional Neural Networks (CNNs)
2. Recurrent Neural Networks (RNNs)
3. Fully Connected Networks

It is particularly effective in scenarios where the optimization landscape is complex, with many local minima and saddle points, as it helps to navigate through these challenges more effectively.

### Advantages

**1. Faster Convergence:** By building up speed in directions with consistent gradients, momentum helps to accelerate convergence.

**2. Reduced Oscillations:** Momentum dampens oscillations in directions with inconsistent gradients, leading to more stable updates.

**3. Effective in High-Dimensional Spaces:** Momentum is beneficial in high-dimensional parameter spaces where gradients can vary significantly.

### Disadvantages

**1. Hyperparameter Tuning:** The momentum coefficient γ and learning rate η need to be carefully tuned for optimal performance.

**2. Complexity:** Slightly more complex to implement and understand compared to basic SGD.