# Problem Statement: **BONUS EXERCISE**

Imports and CUDA

Problem 1: **Gradient Descent for Demand Forecasting at AtliQ**

AtliQ wants to optimize the prediction of regional product demands using gradient descent.

Assume the loss function is

$$L(w)=(w‚àí4)^2$$

where **w** is a weight parameter initialized at 0.

**Write code to:**

* Perform 10 iterations of gradient descent using a learning rate of 0.1.

* Print the weight **w** at each step.



In [1]:
import torch
import numpy as np
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Check if CUDA (GPU) is available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

Using device: cpu


### Solution 1:
### 1. Deriving the Gradient of the Loss Function

The loss function is:

$$L(w) = (w - 4)^2$$

Using the power rule:

$$\frac{d}{dw}(x^2) = 2x$$

The derivative of $(w - 4)^2$ becomes:

$$\frac{d}{dw} (w - 4)^2 = 2(w - 4)$$

Thus, the gradient of the loss function is:

$$\frac{dL}{dw} = 2(w - 4)$$

### 2. Gradient Descent Weight Update Formula

Gradient descent updates the weight using:

$$w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{dL}{dw}$$

Here,  **Œ∑** represents the **learning rate**, which controls how big a step we take during each update.

Substituting the gradient expression:

$$w_{\text{new}} = w_{\text{old}} - \eta \cdot 2(w_{\text{old}} - 4)$$



In [9]:
learning_rate = 0.1
w = 0.0

for i in range(10):
    gradient = 2 * (w - 4)  # dL/dw
    w = w - learning_rate * gradient  # Update rule
    print(f"Step {i+1}: w = {w:.4f}")


Step 1: w = 0.8000
Step 2: w = 1.4400
Step 3: w = 1.9520
Step 4: w = 2.3616
Step 5: w = 2.6893
Step 6: w = 2.9514
Step 7: w = 3.1611
Step 8: w = 3.3289
Step 9: w = 3.4631
Step 10: w = 3.5705




---



Problem 2: **Momentum for Contour Navigation in AtliQ's Supply Chain**

AtliQ's supply chain optimization problem is represented by a contour map of a quadratic function:

$$f(x,y)=x^2 +3y^2$$

Write a code to implement gradient descent (5 iterations) with momentum to minimize this function.

Use:
* Initial point (x, y) = (2, 2)
* Learning rate (Œ∑) = 0.1
* Momentum Coefficient (Œ≤)) = 0.9

### Solution 2: 
### 1. Deriving the Gradient of the Function

We want to minimize the function:

$$
f(x, y) = x^2 + 3y^2
$$

Since this is a function of two variables, we compute the **partial derivatives** with respect to each variable.  
We apply the power rule. For example, $\frac{\partial}{\partial x}(x^2) = 2x$ and $\frac{\partial}{\partial y}(y^2) = 2y$, treating the other variable as a constant.

- Partial derivative with respect to \(x\):

$$
\frac{\partial f}{\partial x} = 2x
$$

- Partial derivative with respect to \(y\):

$$
\frac{\partial f}{\partial y} = 3 \cdot 2y = 6y
$$

Thus, the gradients of $f(x, y)$ are simply $2x$ and $6y$.



### 2. Momentum Update Formula

Momentum helps accelerate gradient descent by accumulating a velocity term.  
The velocity updates are computed as:

$$
v_x = \beta \cdot v_x + (1 - \beta)\cdot\,\frac{\partial f}{\partial x}
$$

$$
v_y = \beta \cdot v_y + (1 - \beta)\cdot\,\frac{\partial f}{\partial y}
$$

where:

- $\beta$ is the **momentum coefficient**  
- $\eta$ is the **learning rate**.

After computing the velocities, the parameters are updated using:

$$
x_{\text{new}} = x_{\text{old}} - \eta \cdot v_x
$$

$$
y_{\text{new}} = y_{\text{old}} - \eta \cdot v_y
$$


In [8]:
def gradient(x, y):
    return 2*x, 6*y  # Gradients of f(x, y)

x, y = 2.0, 2.0
learning_rate = 0.1
momentum = 0.9

vx, vy = 0.0, 0.0  # initialized velocity

for i in range(5):
    dx, dy = gradient(x, y)
    # Momentum updates
    vx = momentum * vx + (1 - momentum) * dx
    vy = momentum * vy + (1 - momentum) * dy
    # Update positions
    x -= learning_rate * vx # use x += vx if learning_rate is applied inside the velocity update instead of (1 - momentum)
    y -= learning_rate * vy # use y += vy if learning_rate is applied inside the velocity update instead of (1 - momentum)
    print(f"Step {i+1}: x = {x:.4f}, y = {y:.4f}")


Step 1: x = 1.9600, y = 1.8800
Step 2: x = 1.8848, y = 1.6592
Step 3: x = 1.7794, y = 1.3609
Step 4: x = 1.6490, y = 1.0108
Step 5: x = 1.4986, y = 0.6351




---



Problem 3: **RMS Prop for AtliQ's Dynamic Pricing Optimization**

AtliQ's AI model adjusts product prices dynamically. Implement the RMSProp optimizer for minimizing the function:

$$f(w) = w^2 + 5$$

Use:

* Initial weight (ùë§) = 5.0
* Learning rate (Œ∑) = 0.01
* Momentum Coefficient(Œ≤)=0.9


Run the optimization for 15 iterations and print the weight updates.

### Solution 3:

### 1. Deriving the Gradient of the Function

We want to minimize the function:

$$
f(w) = w^2 + 5
$$

To compute the gradient, we differentiate using the power rule  $\frac{d}{dx}(x^2) = 2x$:

$$
\frac{df}{dw} = 2w
$$

Thus, the gradient of $f(w)$ is simply $2w$



### 2. RMSProp Update Rule

RMSProp helps stabilize gradient descent by scaling the learning rate using a **moving average of squared gradients**.

The moving average of squared gradients is computed as:

$$
s_{new} = \beta \cdot s_{old} + (1 - \beta)\,\cdot g^2
$$

where:

- $g$ is the gradient, so here $g = 2w$
- $\beta$ is the **decay rate**
- $s$ stores the **running average of squared gradients**
- $\epsilon$ is a small constant to avoid division by zero. A common choice is $10^{-8}$

The RMSProp update rule is:

$$
w_{new}
= w_{old} - \frac{\eta}{\sqrt{s_{new} + \epsilon}}\,\cdot g
$$

where:

- $\eta$ is the **learning rate**



In [10]:
def gradient(w):
    return 2 * w   # Gradient of f(w)

w = 5.0                     
learning_rate = 0.01        
beta = 0.9                  
epsilon = 1e-8
squared_gradient_average = 0.0   # initialized squared gradient average

for i in range(15):
    grad = gradient(w)
    # RMSProp squared gradient average update
    squared_gradient_average = beta * squared_gradient_average + (1 - beta) * (grad ** 2)
    # RMSProp update rule
    w = w - (learning_rate / ((squared_gradient_average + epsilon) ** 0.5)) * grad
    print(f"Step {i+1}: w = {w:.4f}")

Step 1: w = 4.9684
Step 2: w = 4.9455
Step 3: w = 4.9264
Step 4: w = 4.9094
Step 5: w = 4.8939
Step 6: w = 4.8794
Step 7: w = 4.8657
Step 8: w = 4.8526
Step 9: w = 4.8400
Step 10: w = 4.8277
Step 11: w = 4.8158
Step 12: w = 4.8041
Step 13: w = 4.7927
Step 14: w = 4.7814
Step 15: w = 4.7704




---



Problem 4: **Adam Optimizer for AtliQ AI Models**

AtliQ is training an AI model to recommend warehouse restocking schedules. Use the Adam optimizer to minimize the function:

$$f(x) = x^4 - 3x^3 + 2$$

Write code to:

* Initialize x = 3.0

Run the optimizations for 19 iterations (starting from 1) with:
* Learning rate (Œ∑) = 0.01
* Momentum Coefficients: Œ≤1 = 0.9, Œ≤2 = 0.09


### Solution 4:

### 1. Deriving the Gradient of the Function

We want to minimize the function:

$$
f(x) = x^4 - 3x^3 + 2
$$

To compute the gradient, we differentiate each term using the power rule. The derivative becomes:

$$
\frac{df}{dx} = 4x^3 - 9x^2
$$

Thus, the gradient of $f(x)$ is simply $4x^3 - 9x^2$

### 2. Adam Optimizer Update Rule

Adam combines **Momentum** and **RMSProp** by maintaining:

- a running average of gradients (first moment)
- a running average of squared gradients (second moment)

#### **Biased moment estimates**

The first and second moment updates are:

$$
m_{new} = \beta_1 \cdot  m_{old} + (1 - \beta_1) \cdot  g
$$

$$
v_{new} = \beta_2 \cdot  v_{old} + (1 - \beta_2) \cdot  g^2
$$

where:

- $g$ is the gradient, so here $g = 4x^3 - 9x^2$ 
- $\beta_1$ controls the decay of the first moment  
- $\beta_2$ controls the decay of the second moment  
- $m$ and $v$ store accumulated averages

#### **Bias corrections**

Because moments start at zero, Adam applies bias correction:

$$
\hat{m} = \frac{m}{1 - \beta_1^t}
$$

$$
\hat{v} = \frac{v}{1 - \beta_2^t}
$$

#### **Adam update rule**

The parameter update is:

$$
x_{new}= x_{old} - \frac{\eta}{\sqrt{\hat{v}} + \epsilon} \,\cdot \hat{m}
$$

where:

- $\eta$ is the **learning rate**  
- $\epsilon$ is a small constant added to prevent division by zero  
  (typically chosen as $10^{-8}$ in the Adam optimizer)


In [26]:
def gradient(x):
    # Gradient of f(x) = x^4 - 3x^3 + 2
    return 4 * x**3 - 9 * x**2

x = 3.0
learning_rate = 0.01
beta1, beta2 = 0.9, 0.09
epsilon = 1e-8
first_moment, second_moment = 0.0, 0.0  # initialized first and second moment

for t in range(1, 20): 
    grad = gradient(x)
    m = beta1 * first_moment + (1 - beta1) * grad #update biased first moment
    v = beta2 * second_moment + (1 - beta2) * (grad ** 2) # Update biased second moment
    m_hat = m / (1 - beta1**t) #corrected first moment
    v_hat = v / (1 - beta2**t) #corrected second moment
    x = x - learning_rate * m_hat / ((v_hat ** 0.5) + epsilon) # update rule
    first_moment, second_moment = m, v
    print(f"Step {t}: x = {x:.4f}")


Step 1: x = 2.9900
Step 2: x = 2.9799
Step 3: x = 2.9697
Step 4: x = 2.9595
Step 5: x = 2.9491
Step 6: x = 2.9387
Step 7: x = 2.9281
Step 8: x = 2.9174
Step 9: x = 2.9067
Step 10: x = 2.8958
Step 11: x = 2.8849
Step 12: x = 2.8739
Step 13: x = 2.8627
Step 14: x = 2.8515
Step 15: x = 2.8401
Step 16: x = 2.8287
Step 17: x = 2.8171
Step 18: x = 2.8054
Step 19: x = 2.7937




---

