# Neural Network Optimization

## Gradient Descent Algorithm


Gradient Descent is an optimization algorithm used to minimize a function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. It's widely used in machine learning to update parameters of models.

Mathematical Explanation:

Given a function $ f(\theta) $ where $ \theta $ represents the parameters, the goal is to find $ \theta $ that minimizes $ f(\theta) $.

Update Rule:

$$
\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla_{\theta} f(\theta_{\text{old}})
$$

- $ \eta $ is the learning rate (a small positive number).
- $ \nabla_{\theta} f(\theta_{\text{old}}) $ is the gradient of the function at $ \theta_{\text{old}} $.

Visual Illustration:

Imagine you're at the top of a hill (the maximum of the function), and you want to get to the bottom (the minimum). At each step, you look around for the steepest downward slope (the negative gradient) and take a step in that direction.


## Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent is a variation of gradient descent where the gradient is estimated using a single sample (or a mini-batch) rather than the entire dataset. This makes it computationally efficient and allows it to handle large datasets.

Update Rule:

$$
\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla_{\theta} f(\theta_{\text{old}}; x_i, y_i)
$$

- $ (x_i, y_i) $ is a single data point.
- The gradient $ \nabla_{\theta} f(\theta_{\text{old}}; x_i, y_i) $ is computed using only this data point.

Benefits of SGD:

- Faster iterations due to less computation per update.
- Introduces noise that can help escape local minima.

---

### Example: Linear Regression with Gradient Descent

Suppose we have a dataset:

| $ x_i $ | $ y_i $ |
|-----------|-----------|
|     1     |     2     |
|     2     |     4     |
|     3     |     6     |
|     4     |     8     |

We want to fit a linear model $ y = w x + b $ using gradient descent.

Loss Function (Mean Squared Error):

$$
L(w, b) = \frac{1}{N} \sum_{i=1}^{N} (y_i - (w x_i + b))^2
$$

Compute Gradients:

- Gradient with respect to $ w $:

  $$
  \frac{\partial L}{\partial w} = -\frac{2}{N} \sum_{i=1}^{N} x_i (y_i - (w x_i + b))
  $$

- Gradient with respect to $ b $:

  $$
  \frac{\partial L}{\partial b} = -\frac{2}{N} \sum_{i=1}^{N} (y_i - (w x_i + b))
  $$

Update Rules:

$$
\begin{align*}
w_{\text{new}} & = w_{\text{old}} - \eta \frac{\partial L}{\partial w} \\
b_{\text{new}} & = b_{\text{old}} - \eta \frac{\partial L}{\partial b}
\end{align*}
$$

Step-by-Step Calculation:

Let's initialize $ w = 0 $, $ b = 0 $, and $ \eta = 0.01 $.

First Iteration:

1. Compute predictions:

   $$
   \hat{y}_i = w x_i + b = 0 \times x_i + 0 = 0
   $$

2. Compute gradients:

   $$
   \frac{\partial L}{\partial w} = -\frac{2}{4} \sum_{i=1}^{4} x_i (y_i - \hat{y}_i) = -\frac{1}{2} \sum_{i=1}^{4} x_i y_i
   $$

   $$
   \frac{\partial L}{\partial b} = -\frac{2}{4} \sum_{i=1}^{4} (y_i - \hat{y}_i) = -\frac{1}{2} \sum_{i=1}^{4} y_i
   $$

3. Calculate sums:

   $$
   \sum_{i=1}^{4} x_i y_i = 1 \times 2 + 2 \times 4 + 3 \times 6 + 4 \times 8 = 60
   $$

   $$
   \sum_{i=1}^{4} y_i = 2 + 4 + 6 + 8 = 20
   $$

4. Compute gradients:

   $$
   \frac{\partial L}{\partial w} = -\frac{1}{2} \times 60 = -30
   $$

   $$
   \frac{\partial L}{\partial b} = -\frac{1}{2} \times 20 = -10
   $$

5. Update parameters:

   $$
   w_{\text{new}} = 0 - 0.01 \times (-30) = 0 + 0.3 = 0.3
   $$

   $$
   b_{\text{new}} = 0 - 0.01 \times (-10) = 0 + 0.1 = 0.1
   $$

Final Thoughts:

- Gradient Descent is suitable for smaller datasets where computational efficiency is not a concern.
- Stochastic Gradient Descent is better for larger datasets and can help in escaping local minima due to its stochastic nature.


In [5]:
import torch

# Data
x = torch.tensor([1.0, 2.0, 3.0, 4.0])
y = torch.tensor([2.0, 4.0, 6.0, 8.0])

# Parameters
w = torch.tensor(0.0, requires_grad=True)
b = torch.tensor(0.0, requires_grad=True)

# Learning rate
eta = 0.01

# Number of epochs
epochs = 100

for epoch in range(epochs):
    # Forward pass: compute predicted y
    y_pred = w * x + b
    # Compute and print loss
    loss = ((y_pred - y) ** 2).mean()
    # Backward pass
    loss.backward()
    # Update parameters
    with torch.no_grad():
        w -= eta * w.grad
        b -= eta * b.grad
    # Zero gradients
    w.grad.zero_()
    b.grad.zero_()
    # Print progress
    if (epoch + 1) % 10 == 0:
        print(f'Epoch {epoch+1}: w = {w.item():.4f}, b = {b.item():.4f}, Loss = {loss.item():.4f}')

Epoch 10: w = 1.5104, b = 0.4936, Loss = 1.1751
Epoch 20: w = 1.7583, b = 0.5584, Loss = 0.0843
Epoch 30: w = 1.8030, b = 0.5547, Loss = 0.0529
Epoch 40: w = 1.8149, b = 0.5404, Loss = 0.0492
Epoch 50: w = 1.8213, b = 0.5248, Loss = 0.0463
Epoch 60: w = 1.8267, b = 0.5093, Loss = 0.0436
Epoch 70: w = 1.8319, b = 0.4943, Loss = 0.0410
Epoch 80: w = 1.8369, b = 0.4797, Loss = 0.0387
Epoch 90: w = 1.8417, b = 0.4655, Loss = 0.0364
Epoch 100: w = 1.8463, b = 0.4518, Loss = 0.0343


In [6]:

import torch

# Data
x = torch.tensor([[1.0], [2.0], [3.0], [4.0]])
y = torch.tensor([[2.0], [4.0], [6.0], [8.0]])

# Parameters
w = torch.randn(1, 1, requires_grad=True)
b = torch.randn(1, requires_grad=True)

# Learning rate
eta = 0.01

# Number of epochs
epochs = 100

for epoch in range(epochs):
    permutation = torch.randperm(x.size()[0])
    for i in permutation:
        xi = x[i]
        yi = y[i]
        # Forward pass
        y_pred = xi @ w + b
        # Compute loss
        loss = (y_pred - yi).pow(2).mean()
        # Backward pass
        loss.backward()
        # Update parameters
        with torch.no_grad():
            w -= eta * w.grad
            b -= eta * b.grad
        # Zero gradients
        w.grad.zero_()
        b.grad.zero_()
    # Print progress
    if (epoch + 1) % 10 == 0:
        # Compute total loss
        y_pred = x @ w + b
        loss = (y_pred - y).pow(2).mean()
        print(f'Epoch {epoch+1}: w = {w.item():.4f}, b = {b.item():.4f}, Loss = {loss.item():.4f}')

Epoch 10: w = 1.9437, b = 0.1718, Loss = 0.0049
Epoch 20: w = 1.9503, b = 0.1516, Loss = 0.0038
Epoch 30: w = 1.9554, b = 0.1333, Loss = 0.0030
Epoch 40: w = 1.9613, b = 0.1178, Loss = 0.0023
Epoch 50: w = 1.9648, b = 0.1036, Loss = 0.0018
Epoch 60: w = 1.9693, b = 0.0914, Loss = 0.0014
Epoch 70: w = 1.9733, b = 0.0807, Loss = 0.0011
Epoch 80: w = 1.9762, b = 0.0711, Loss = 0.0008
Epoch 90: w = 1.9792, b = 0.0628, Loss = 0.0007
Epoch 100: w = 1.9809, b = 0.0551, Loss = 0.0005
