# PyTorch Optimizers

PyTorch optimizers essentially are in charge of updating the weights with each iteration of the training loop. How they update the weights in each iteration depends on the optimization algorithm (ex. stochastic gradient descent). Optimizers should be used over manual gradient updates (most of the time) as it is more adaptive and may be able to train the model faster (with less iterations) with less errors.

PyTorch comes with the following optimizers out of the box:
- SGD (stochastic gradient descent)
- Adam (adaptive moment estimation)
- RMSprop
- Adagrad
- AdamW

In [1]:
import torch
from torch import optim

x = torch.tensor([15], dtype=torch.float32, requires_grad=True)

optimizer = optim.SGD([x], lr=0.05) #stochastic gradient descent with learning rate = 0.05

x_iters = [x.item()]
y_hist = [x.item() ** 2]

iterations = 100
for i in range(iterations):
    optimizer.zero_grad() #clear accumulated gradient calculations

    y = x ** 2 # forward pass to calculate function output

    y.backward() # calculate gradient value at x

    if (i + 1) % 20 == 0:
        print(f"Iteration {i+1}: pre-update x = {x.item():.4f}")

    optimizer.step() # updates the actual x to the argument closer

    if (i + 1) % 20 == 0:
        x_iters.append(x.item())
        y_hist.append(x.item() ** 2)
        print(f"Iteration {i+1}: post-update x = {x.item():.4f}, \nresulting y = {y.item():.4f}")

print(x_iters)
print(y_hist)

Iteration 20: pre-update x = 2.0263
Iteration 20: post-update x = 1.8236, 
resulting y = 4.1058
Iteration 40: pre-update x = 0.2463
Iteration 40: post-update x = 0.2217, 
resulting y = 0.0607
Iteration 60: pre-update x = 0.0300
Iteration 60: post-update x = 0.0270, 
resulting y = 0.0009
Iteration 80: pre-update x = 0.0036
Iteration 80: post-update x = 0.0033, 
resulting y = 0.0000
Iteration 100: pre-update x = 0.0004
Iteration 100: post-update x = 0.0004, 
resulting y = 0.0000
[15.0, 1.823649525642395, 0.22171321511268616, 0.026955142617225647, 0.0032771159894764423, 0.0003984206705354154]
[225.0, 3.3256975923757324, 0.049156749755604245, 0.0007265797135149743, 1.0739489208482162e-05, 1.5873903070989003e-07]


To summarize, the following are steps to how to use existing optimizers in PyTorch:

1. Initialize optimizer object with function import variable (ex. weights to a loss function) and other hyperparameters (ex. learning rate, momentum etc.)
    - Note that different optimization algorithms require different hyperparameters.
2. Start optimization loop
3. Zero out existing gradients with `optimizer.zero_grad()`
4. Forward pass (ie. calculate function output with inputs)
5. Backward pass (ie. calculate gradient value evaluated at current inputs): `y.backward()`
6. Update what the next input value should be with `optimizer.step()`

In [2]:
# try in actual learning cases

import numpy as np

weight = [0.77, -0.56]
bias = np.random.normal(0,12)
SEED = 9999

X = np.random.rand(100, 2) * 10
y = X @ weight + bias

print(X.size)
print(y.size)

X = torch.from_numpy(X)
print(X.dtype)
y = torch.from_numpy(y)

print(X.size())
print(y.size())

200
100
torch.float64
torch.Size([100, 2])
torch.Size([100])


In [None]:
# simple linear regression learning

from torch import nn

class OLS(nn.Module):
    def __init__(self):
        super().__init__()

        #initialize weights with a random vector
        self.weights = nn.Parameter(
            torch.randn(
                2,
                requires_grad=True, #PyTorch will track gradients of this param
                dtype=torch.float64
            )
        )

        #initialize bias with a random scalar
        self.bias = nn.Parameter(
            torch.randn(
                1,
                requires_grad=True, #PyTorch will track gradients of this param
                dtype=torch.float64
            )
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return x @ self.weights + self.bias
    
model = OLS() #initialize class
print(list(model.parameters())) #checks current state of model (prior to training)
print(model.state_dict())

X_train, X_validate, X_test = X[:80], X[80:90], X[90:]
y_train, y_validate, y_test = y[:80], y[80:90], y[90:]

iterations = 500 #this is the epoch in this case
learning_rate = 0.00005

optimizer = optim.SGD(model.parameters(), lr = learning_rate, momentum = 0.9) #step 1

for epoch in range(iterations): #step 2
    optimizer.zero_grad() #step 3

    model.train()
    
    y_pred = model(X_train) #step 4

    loss = torch.sum((y_pred - y_train) ** 2)
    loss.backward() #step 5

    model.eval()
    optimizer.step() #step 6

    if epoch % 100 == 0:
        print(f"Epoch: {epoch+1} \n Weights: {model.weights} \n Training Loss: {loss.item():.4f}")

print(f"final weights: {model.weights}")


[Parameter containing:
tensor([-1.7755,  1.4201], dtype=torch.float64, requires_grad=True), Parameter containing:
tensor([1.0382], dtype=torch.float64, requires_grad=True)]
OrderedDict([('weights', tensor([-1.7755,  1.4201], dtype=torch.float64)), ('bias', tensor([1.0382], dtype=torch.float64))])
Epoch: 1 
 Weights: Parameter containing:
tensor([-1.2778,  1.5626], dtype=torch.float64, requires_grad=True) 
 Training Loss: 13091.3530
Epoch: 101 
 Weights: Parameter containing:
tensor([ 0.9495, -0.3987], dtype=torch.float64, requires_grad=True) 
 Training Loss: 34.4812
Epoch: 201 
 Weights: Parameter containing:
tensor([ 0.8271, -0.5016], dtype=torch.float64, requires_grad=True) 
 Training Loss: 3.8853
Epoch: 301 
 Weights: Parameter containing:
tensor([ 0.7892, -0.5404], dtype=torch.float64, requires_grad=True) 
 Training Loss: 0.4403
Epoch: 401 
 Weights: Parameter containing:
tensor([ 0.7765, -0.5534], dtype=torch.float64, requires_grad=True) 
 Training Loss: 0.0499
final weights: Para