# Optimization in Machine Learning using Adagrad

# Learning Objective

Understand how optimization algorithms like **Adagrad** are used to train machine learning models, focusing on a **linear regression task**.


---


### Problem Overview

**Goal**: Learn a simple linear relationship between hours studied (`x`) and student grades (`y`).

Model:

$$
\hat{y} = wx + b
$$

We aim to find the best parameters $w$ and $b$ that minimize the **Mean Squared Error (MSE)**:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

---

### Core Concepts (coming from calculus)

#### 1. **Gradient Descent Optimization**

We use gradients (partial derivatives of the loss function) to update parameters:

* Gradient w\.r.t. weight $w$:

$$
\frac{\partial \text{MSE}}{\partial w} = \frac{-2}{n} \sum (y_i - \hat{y}_i) x_i
$$

* Gradient w\.r.t. bias $b$:

$$
\frac{\partial \text{MSE}}{\partial b} = \frac{-2}{n} \sum (y_i - \hat{y}_i)
$$

#### 2. **Adagrad Optimizer**

Traditional gradient descent uses a fixed learning rate. Adagrad adapts the learning rate **individually for each parameter**, which:

* Speeds up learning for infrequent features
* Slows down learning for frequently updated parameters

Update rule:

$$
\theta := \theta - \frac{\text{lr}}{\sqrt{G_\theta + \epsilon}} \cdot \frac{\partial L}{\partial \theta}
$$

Where:

* $G_\theta$ = accumulated sum of squared gradients
* $\epsilon$ = small constant to avoid division by zero

---

### Code Walkthrough Highlights

| Code Segment                                   | Concept Taught                          |
| ---------------------------------------------- | --------------------------------------- |
| `x`, `y`                                       | Feature and label vectors               |
| `w = 0.0`, `b = 0.0`                           | Parameter initialization                |
| `for epoch in range(epochs)`                   | Iterative training over multiple passes |
| `y_pred = w * x + b`                           | Model prediction                        |
| `dw`, `db`                                     | Gradient computation for MSE loss       |
| `gw_accum += dw**2`                            | Accumulate squared gradients for weight |
| `w -= (lr / np.sqrt(gw_accum + epsilon)) * dw` | Adagrad update for weight               |
| `print(...)`                                   | Final model parameters                  |

---

In [None]:
import numpy as np  # Import NumPy library for numerical operations

# Dataset, x is the number of study hours, and y is the grades
x = np.array([1, 2, 3, 4], dtype=np.float64)  # Input data (features)
y = np.array([2, 4, 5, 7], dtype=np.float64)  # Output data (labels)

In [None]:
# Initialize parameters
w = 0.0  # Initial weight
b = 0.0  # Initial bias

# learning rate: Controls the step size during parameter updates.
A smaller value leads to slower but potentially more stable learning.

In [4]:
lr = 0.1  # Learning rate for gradient descent

In [None]:
epochs = 100  # Number of training iterations
epsilon = 1e-8  # Small value to prevent division by zero

- gw_accum and gb_accum: Variables used to store the accumulated squared gradients for the weight and bias, respectively.
These are crucial for the Adagrad optimization algorithm.

- The purpose of accumulating squared gradients (gw_accum and gb_accum)
in the Adagrad optimization algorithm is to adaptively adjust the learning rate
for each parameter based on the past gradients.
- By accumulating the square of gradients during training,
Adagrad gives more weight to infrequent parameters by decreasing
 the learning rate for parameters that have large accumulated gradients.

- This adaptive learning rate mechanism allows Adagrad to effectively handle
sparse data and converge faster by mitigating the problem of choosing a
global learning rate that may be too large or too small for different
parameters in the model.

In [None]:
# Accumulated squared gradients for Adagrad
gw_accum = 0.0  # Accumulated squared gradient for weight
gb_accum = 0.0  # Accumulated squared gradient for bias

n = len(x)  # Number of data points

In [3]:
for epoch in range(epochs):  # Loop over specified number of epochs
    # Predictions
    y_pred = w * x + b  # Calculate predicted output using current parameters

    # Compute gradients (MSE loss)
    dw = (-2 / n) * np.sum((y - y_pred) * x)  # Gradient of weight
    db = (-2 / n) * np.sum(y - y_pred)  # Gradient of bias

    # Accumulate squared gradients
    gw_accum += dw ** 2  # Accumulate squared gradient for weight
    gb_accum += db ** 2  # Accumulate squared gradient for bias
    '''
These two lines are used to accumulate squared gradients for Adagrad optimization algorithm.

In Adagrad, the accumulation of squared gradients is used to re-scale
the learning rate individually for each parameter. This helps to handle sparse
features or parameters that require different learning rates for convergence.
By adding the square of the current gradient to the accumulated squared
gradients, the algorithm keeps track of the historical gradients for each parameter.
This leads to larger updates for infrequent parameters and smaller
updates for frequent parameters.
The purpose of squaring the gradients is to ensure that only positive values
contribute to the accumulation. This helps Adagrad to adapt the learning rate
more appropriately to different parameters, leading to better convergence in
non-convex optimization problems.
'''

    # Update parameters using Adagrad
    w -= (lr / np.sqrt(gw_accum + epsilon)) * dw  # Update weight using Adagrad
    b -= (lr / np.sqrt(gb_accum + epsilon)) * db  # Update bias using Adagrad

'''
These two lines of code are performing parameter updates using the Adagrad
optimization algorithm. Adagrad is an adaptive learning rate algorithm that
adapts the learning rate for each parameter based on the historical gradients
for that parameter.
The update equations are as follows:
w -= (learning_rate / sqrt(accumulated_gradient_w + epsilon)) * gradient_w
b -= (learning_rate / sqrt(accumulated_gradient_b + epsilon)) * gradient_b
In these lines of code:
- "w" and "b" are the weight and bias parameters of the model, respectively.
- "lr" is the learning rate, controlling the size of the updates.
- "gw_accum" and "gb_accum" are the accumulated squared gradients for
the weight and bias, respectively.
- "epsilon" is a small constant added to the denominator to prevent division by zero.

Overall, these lines of code update the weight and bias parameters using
the Adagrad optimization algorithm to improve the model's performance during training
'''

# Output the final parameters
print(f"Final weight (w): {w:.4f}")  # Print final weight with 4 decimal places
print(f"Final bias (b): {b:.4f}")  # Print final bias with 4 decimal places

Final weight (w): 1.5788
Final bias (b): 0.5605
