# Gradient Descent in PyTorch
* Notebook by Adam Lang
* Date: 9/30/2024

# Gradient Descent - Classification
* Gradient descent is classified by the amount of data to compute the gradient of loss function.
1. Stochastic
  * m iterations per epoch.
  * SGD Update = 50*10 = 500 times
  * Recommendation: shuffle dataset before each epoch!
  * SGD will improve model performance by frequently updating weights and biases during each epoch.
    * Drawback is that the weights are updated based on just 1 sample which can cause fluctuations in loss values.
    * This drawback makes it difficult and rare to use on "real-world" data.
2. Batch
  * Aggregates weights and biases.
  * Batch gradient descent update = 10 times for 10 epochs
  * **Advantages**:
    * computational efficiency
    * stable error gradient
    * faster convergence
  * **Drawbacks**:
    * would need several epochs for training
    * requires full dataset in memory limiting scalability
3. **Mini-Batch**
  * **MOST optimal gradient descent method as it combines SGD with Batch.**
    * First divides training dataset into subsets.
    * During each epoch only update weights ONCE.
    * Forward propagation on subset --> loss calculated --> update weights --> next subset, etc..
    * Example:
      * 50 samples /10 = 5 batch
      * num epochs = 5 (weights updated 5 times
    * Example 2:
      * 800 samples / 80 batch -> 10 subsets
      * 10 subsets * 10 epochs = 100 times updated weights

* Batch-sizes of 2^n is preferred:
   * e.g. 16, 32, 64, 128, 256, 512, 1024, etc...

## Implementing Gradient Descent on university data

In [1]:
## load data
import pandas as pd
import numpy as np

In [2]:
## data path
data_path = '/content/drive/MyDrive/Colab Notebooks/Deep Learning Notebooks/Prodigy University Dataset.csv'
## load data
data = pd.read_csv(data_path)
data.head()

Unnamed: 0,sat_sum,hs_gpa,fy_gpa
0,508,3.4,3.18
1,488,4.0,3.33
2,464,3.75,3.25
3,380,3.75,2.42
4,428,4.0,2.63


## Data pre-processing

In [3]:
# convert vars to numpy - 2D array
X = data[['sat_sum', 'hs_gpa']].values

## reshape fy_gpa into 2D array
y = data['fy_gpa'].values.reshape(-1,1)


print(f"Shape of X: {X.shape}")
print(f"Shape of y: {y.shape}")

Shape of X: (1000, 2)
Shape of y: (1000, 1)


In [4]:
## create train test split
from sklearn.model_selection import train_test_split

## split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
## standard scaler
from sklearn.preprocessing import StandardScaler

# normalize feature so it is easier to train data
## setup scaler
scaler = StandardScaler()

## fit_transform X and y train data
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [6]:
## shape X_train
print(X_train.shape)

(800, 2)


In [7]:
import torch
## convert numpy to PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)

X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)

In [8]:
## tensor shape
X_train_tensor.shape

torch.Size([800, 2])

## Build Linear Regression model

In [9]:
import torch.nn as nn

In [10]:
## build a model with 2 neurons
## Sequential --> forward propagation
model = nn.Sequential(
    nn.Linear(2, 2), ##2 inputs, 2 outputs
    nn.Sigmoid(), ## non-linear logistic hidden layer
    nn.Linear(2, 1) ##2 inputs, 1 output --> target pred
)

In [11]:
## forward prop
preds = model(X_train_tensor)

In [12]:
## first 5 predictions
preds[:5]

tensor([[-0.4380],
        [-0.4455],
        [-0.5019],
        [-0.4481],
        [-0.5002]], grad_fn=<SliceBackward0>)

In [13]:
## MSE mean squared error
from torch.nn import MSELoss

In [14]:
## compute MSE loss
criterion = MSELoss()
loss = criterion(preds, y_train_tensor)
print(loss)

tensor(9.0811, grad_fn=<MseLossBackward0>)


In [15]:
## compare reds on X_train vs target
preds[:5]

tensor([[-0.4380],
        [-0.4455],
        [-0.5019],
        [-0.4481],
        [-0.5002]], grad_fn=<SliceBackward0>)

In [16]:
## y_train
y_train_tensor[:5]

tensor([[2.0000],
        [3.1100],
        [1.6300],
        [3.0200],
        [1.5500]])

Get Model Weights

In [17]:
## model weights
model[0].weight

Parameter containing:
tensor([[ 0.3655, -0.0605],
        [-0.0774, -0.6032]], requires_grad=True)

In [18]:
## weights
model[2].weight

Parameter containing:
tensor([[-0.0879, -0.2757]], requires_grad=True)

## Gradient Descent
* We will update the model weights using gradient descent to improve our previous linear regression model.

In [19]:
import torch.optim as optim

## init optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [20]:
## back propagation
loss.backward()

In [21]:
## apply update weights to our model
optimizer.step()

In [22]:
## lets see updated weights
model[0].weight

Parameter containing:
tensor([[ 0.3655, -0.0605],
        [-0.0774, -0.6032]], requires_grad=True)

In [23]:
## weigths
model[2].weight

Parameter containing:
tensor([[-0.0852, -0.2736]], requires_grad=True)

Summary:
* We can see that SGD updated the weights compared the forward pass.

## Manually coding is not efficient to update weights
* This is why we can take advantage of the torch.utils.data library to update the weights so we dont have to continually hardcode this.

In [24]:
from torch.utils.data import TensorDataset, DataLoader

In [25]:
##setup train data
train_data = TensorDataset(X_train_tensor, y_train_tensor)

In [26]:
## setup our linear regression model
model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid(),
    nn.Linear(2, 1)
)
## optimizer to update weights
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [28]:
## performance on train and test sets before training
## criterion is loss function (MSELoss())
train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
print(f"Without training:\nTrain Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}")

Without training:
Train Loss: 7.0744, Test Loss: 7.3450


In [29]:
## look at predictions
model(X_train_tensor)[:5]

tensor([[-0.2516],
        [-0.2260],
        [-0.0442],
        [-0.1580],
        [-0.1007]], grad_fn=<SliceBackward0>)

Summary:
* We can see the loss is VERY HIGH and the predictions are all negative.
* Let's try varying types of gradient descent to optimize this.

## Stochastic Gradient Descent

In [32]:
## shape of train_data
len(train_data)

800

We can see there are 800 samples in the dataset.
* So based on what we saw above, if we want 10 subsets of 80 --> 800 samples.
  * This means we have to update the weights 10 times for 10 epochs = 100 times

In [33]:
## load data
train_loader = DataLoader(train_data,
                          batch_size=1,
                          shuffle=True) ## shuffle the data!

## Training loop for 10 epochs
for epoch in range(10):
  for X_batch, y_batch in train_loader:
    # 1. forward pass --> X_batch to model --> get pred
    pred = model(X_batch)
    # 2. Loss calculate
    loss = criterion(pred, y_batch)

    # backward pass and optimization
    # 3. optimizer zero grad - zero out gradients
    optimizer.zero_grad()
    # 4. loss backwards -- back propagation
    loss.backward()
    # 5. optimizer step -- add updated weights to modmel
    optimizer.step()

## apply loss function -- (criterion - MSELoss) -- get train and test loss
  train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
  # print(epoch,': ',train_loss)
  test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
  print(f'Epoch {epoch+1}: | Train Loss: {train_loss:.4f}, | Test Loss: {test_loss:.4f}')

Epoch 1: | Train Loss: 0.5903, | Test Loss: 0.6639
Epoch 2: | Train Loss: 0.4793, | Test Loss: 0.5372
Epoch 3: | Train Loss: 0.4295, | Test Loss: 0.4874
Epoch 4: | Train Loss: 0.3990, | Test Loss: 0.4564
Epoch 5: | Train Loss: 0.3801, | Test Loss: 0.4364
Epoch 6: | Train Loss: 0.3681, | Test Loss: 0.4283
Epoch 7: | Train Loss: 0.3609, | Test Loss: 0.4173
Epoch 8: | Train Loss: 0.3555, | Test Loss: 0.4129
Epoch 9: | Train Loss: 0.3526, | Test Loss: 0.4099
Epoch 10: | Train Loss: 0.3499, | Test Loss: 0.4106


In [34]:
## now get predictions
model(X_train_tensor)[:5]

tensor([[2.1802],
        [2.1901],
        [2.1239],
        [2.4677],
        [1.9195]], grad_fn=<SliceBackward0>)

## Batch Gradient Descent

In [35]:
## linear regression model
model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid(),
    nn.Linear(2,1)
)
## optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)

Note below:
* We are changing batch_size to 800 to apply entire dataset.

In [37]:
## load data
train_loader = DataLoader(train_data,
                          batch_size=800,
                          shuffle=True) ## shuffle the data!

## Training loop for 1000 epochs -- more effective training on 800 samples
for epoch in range(1000):
  for X_batch, y_batch in train_loader:
    # 1. forward pass --> X_batch to model --> get pred
    pred = model(X_batch)
    # 2. Loss calculate
    loss = criterion(pred, y_batch)

    # backward pass and optimization
    # 3. optimizer zero grad - zero out gradients
    optimizer.zero_grad()
    # 4. loss backwards -- back propagation
    loss.backward()
    # 5. optimizer step -- add updated weights to modmel
    optimizer.step()

## apply loss function -- (criterion - MSELoss) -- get train and test loss
  if (epoch+1) % 100 == 0: ## print after every 100 epochs
    train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
  # print(epoch,': ',train_loss)
    test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
    print(f'Epoch {epoch+1}: | Train Loss: {train_loss:.4f}, | Test Loss: {test_loss:.4f}')

Epoch 100: | Train Loss: 0.5743, | Test Loss: 0.6425
Epoch 200: | Train Loss: 0.5641, | Test Loss: 0.6296
Epoch 300: | Train Loss: 0.5564, | Test Loss: 0.6198
Epoch 400: | Train Loss: 0.5502, | Test Loss: 0.6120
Epoch 500: | Train Loss: 0.5448, | Test Loss: 0.6054
Epoch 600: | Train Loss: 0.5400, | Test Loss: 0.5995
Epoch 700: | Train Loss: 0.5354, | Test Loss: 0.5941
Epoch 800: | Train Loss: 0.5310, | Test Loss: 0.5891
Epoch 900: | Train Loss: 0.5268, | Test Loss: 0.5844
Epoch 1000: | Train Loss: 0.5228, | Test Loss: 0.5798


## Mini-Batch Gradient Descent
* Now we can take the best of both methods above (SGD + Batch) and get a more optimal outcome.

In [38]:
## linear regression model
model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid(),
    nn.Linear(2,1)
)
## optimizer
optimizer = optim.SGD(model.parameters(), lr=0.001)

Below the batch_size will be 64 a multiple of 2n

In [41]:
## load data
train_loader = DataLoader(train_data,
                          batch_size=64, ## 800 samples in train set
                          shuffle=True) ## shuffle the data!

## Training loop for 500 epochs -- more effective training on 800 samples
for epoch in range(500):
  for X_batch, y_batch in train_loader:
    # 1. forward pass --> X_batch to model --> get pred
    pred = model(X_batch)
    # 2. Loss calculate
    loss = criterion(pred, y_batch)

    # backward pass and optimization
    # 3. optimizer zero grad - zero out gradients
    optimizer.zero_grad()
    # 4. loss backwards -- back propagation
    loss.backward()
    # 5. optimizer step -- add updated weights to modmel
    optimizer.step()

## apply loss function -- (criterion - MSELoss) -- get train and test loss
  if (epoch+1) % 100 == 0: ## print after every 100 epochs
    train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
  # print(epoch,': ',train_loss)
    test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
    print(f'Epoch {epoch+1}: | Train Loss: {train_loss:.4f}, | Test Loss: {test_loss:.4f}')

Epoch 100: | Train Loss: 0.3519, | Test Loss: 0.4066
Epoch 200: | Train Loss: 0.3514, | Test Loss: 0.4064
Epoch 300: | Train Loss: 0.3509, | Test Loss: 0.4062
Epoch 400: | Train Loss: 0.3505, | Test Loss: 0.4059
Epoch 500: | Train Loss: 0.3502, | Test Loss: 0.4057


Summary:
* Collectively this model was actually not as good as SGD alone, however it was much better than Batch gradient descent alone.

# Common Optimization Techniques
* Here are some that we will look at:
1. GD with momentum
  * slightly modified version of SGD.
  * Can prevent the SGD from stopping at the local minima and instead go to the global minima.
  * This is most useful for situations where we have both local and global minima.
  * **Faster Approach**: moving average of gradients
    * Accelerated SGD
    * Dampens turbulence
  * Uses Exponential Average
  * Beta usually kept at 0.9 or 90% weight of previous gradients and 10% weight to current gradient.
    * Equivalent to taking average of last 10 gradients.
2. Nesterov momentum
  * looks ahead at gradient of future steps.
3. AdaGrad
  * Adaptive Gradient Descent uses different learning rates for each iteration.
  * Concept
    * params with infrequent updates --> BIGGER updates to weights
    * params with frequent updates --> SMALLER updates to weights
  * **Very useful for sparse or unstructured datasets.**
  * **Problem**: may reduce learning rates aggressively!!
    * similar to using a very low learning rate.
  * To overcome this problem we can use **Learning Rate Decay**
    * used to **reduce the learning rate over time.**
    * It can be fixed or scheduled or dynamically adjusted.
  * **Default learning rate decay in PyTorch is 0.1**
  * **Higher decay rate --> lower new learning rate**
4. RMSProp
  * Useful for all kinds of datasets.
  * "Root Mean Squared Propagation" accelerates the optimization process by reducing the number of updates needed to reach the minima.
  * Example:
    * Build a classification model to classify variety of fishes.
    * If the PRIMARY FACTOR is COLOR.
    * RMSPROP penalizes the parameter "Color" to rely on other features.
      * Prevents the model from adapting too quickly to changes in the parameter "Color".
5. **Adam**
  * Adaptive Moment Estimation or Adam.
  * Combination of RMSProp and Momentum.
  * **MOST WIDELY USED OPTIMIZATION TECHNIQUE IN DEEP LEARNING**
  * There are 2 moving averages in Adam.
    * 1. first moment (mean) estimate
    * 2. second moment (uncentered variance) estimate
    * 3. Updated rule for Adam optimizer

## Optimization using Gradient Descent with Momentum
* The code difference is adding **momentum** to the optimizer below.
* As we mentioned we want it set to 0.9 or 90% to look back at the last 10.

In [44]:
## linear regression model
model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid(),
    nn.Linear(2,1)
)
## optimizer --> momentum set at 0.9 now
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

Training Loop

In [45]:
## load data
train_loader = DataLoader(train_data,
                          batch_size=64, ## 800 samples in train set
                          shuffle=True) ## shuffle the data!

## Training loop for 500 epochs -- more effective training on 800 samples
for epoch in range(500):
  for X_batch, y_batch in train_loader:
    # 1. forward pass --> X_batch to model --> get pred
    pred = model(X_batch)
    # 2. Loss calculate
    loss = criterion(pred, y_batch)

    # backward pass and optimization
    # 3. optimizer zero grad - zero out gradients
    optimizer.zero_grad()
    # 4. loss backwards -- back propagation
    loss.backward()
    # 5. optimizer step -- add updated weights to modmel
    optimizer.step()

## apply loss function -- (criterion - MSELoss) -- get train and test loss
  if (epoch+1) % 100 == 0: ## print after every 100 epochs
    train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
  # print(epoch,': ',train_loss)
    test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
    print(f'Epoch {epoch+1}: | Train Loss: {train_loss:.4f}, | Test Loss: {test_loss:.4f}')

Epoch 100: | Train Loss: 0.3464, | Test Loss: 0.4028
Epoch 200: | Train Loss: 0.3451, | Test Loss: 0.4023
Epoch 300: | Train Loss: 0.3442, | Test Loss: 0.4022
Epoch 400: | Train Loss: 0.3434, | Test Loss: 0.4019
Epoch 500: | Train Loss: 0.3428, | Test Loss: 0.4017


Summary:
* We can see the loss is consistently lower now.

## Optimization using Nesterov Momentum
* Here we add **nesterov=True** to the optimizer

In [47]:
## linear regression model
model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid(),
    nn.Linear(2,1)
)
## optimizer --> momentum set at 0.9 now
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9,
                      nesterov=True)

Training Loop

In [48]:
## load data
train_loader = DataLoader(train_data,
                          batch_size=64, ## 800 samples in train set
                          shuffle=True) ## shuffle the data!

## Training loop for 500 epochs -- more effective training on 800 samples
for epoch in range(500):
  for X_batch, y_batch in train_loader:
    # 1. forward pass --> X_batch to model --> get pred
    pred = model(X_batch)
    # 2. Loss calculate
    loss = criterion(pred, y_batch)

    # backward pass and optimization
    # 3. optimizer zero grad - zero out gradients
    optimizer.zero_grad()
    # 4. loss backwards -- back propagation
    loss.backward()
    # 5. optimizer step -- add updated weights to modmel
    optimizer.step()

## apply loss function -- (criterion - MSELoss) -- get train and test loss
  if (epoch+1) % 100 == 0: ## print after every 100 epochs
    train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
  # print(epoch,': ',train_loss)
    test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
    print(f'Epoch {epoch+1}: | Train Loss: {train_loss:.4f}, | Test Loss: {test_loss:.4f}')

Epoch 100: | Train Loss: 0.3476, | Test Loss: 0.4032
Epoch 200: | Train Loss: 0.3460, | Test Loss: 0.4027
Epoch 300: | Train Loss: 0.3449, | Test Loss: 0.4018
Epoch 400: | Train Loss: 0.3441, | Test Loss: 0.4016
Epoch 500: | Train Loss: 0.3434, | Test Loss: 0.4014


Summary:
* Very similar output to GD with momentum but not exactly the same, yet still consistently lower than standard gradient descent techniques.
* Lets look at some other optimization techniques.

## Optimization using AdaGrad
* params with infrequent updates --> BIGGER updates to weights
* params with frequent updates --> SMALLER updates to weights

In [49]:
## linear regression model
model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid(),
    nn.Linear(2,1)
)
## optimizer --> AdaGrad
optimizer = optim.Adagrad(model.parameters())

Training Loop
* Batch size 64

In [50]:
## load data
train_loader = DataLoader(train_data,
                          batch_size=64, ## 800 samples in train set
                          shuffle=True) ## shuffle the data!

## Training loop for 500 epochs -- more effective training on 800 samples
for epoch in range(500):
  for X_batch, y_batch in train_loader:
    # 1. forward pass --> X_batch to model --> get pred
    pred = model(X_batch)
    # 2. Loss calculate
    loss = criterion(pred, y_batch)

    # backward pass and optimization
    # 3. optimizer zero grad - zero out gradients
    optimizer.zero_grad()
    # 4. loss backwards -- back propagation
    loss.backward()
    # 5. optimizer step -- add updated weights to modmel
    optimizer.step()

## apply loss function -- (criterion - MSELoss) -- get train and test loss
  if (epoch+1) % 100 == 0: ## print after every 100 epochs
    train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
  # print(epoch,': ',train_loss)
    test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
    print(f'Epoch {epoch+1}: | Train Loss: {train_loss:.4f}, | Test Loss: {test_loss:.4f}')

Epoch 100: | Train Loss: 2.2841, | Test Loss: 2.4530
Epoch 200: | Train Loss: 1.2514, | Test Loss: 1.3775
Epoch 300: | Train Loss: 0.7755, | Test Loss: 0.8731
Epoch 400: | Train Loss: 0.5594, | Test Loss: 0.6395
Epoch 500: | Train Loss: 0.4588, | Test Loss: 0.5277


Summary:
* We can see the training loss is getting lower!

## Optimization using RMSProp

In [51]:
## linear regression model
model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid(),
    nn.Linear(2,1)
)
## optimizer --> RMSprop
optimizer = optim.RMSprop(model.parameters())

Training Loop

In [52]:
## load data
train_loader = DataLoader(train_data,
                          batch_size=64, ## 800 samples in train set
                          shuffle=True) ## shuffle the data!

## Training loop for 500 epochs -- more effective training on 800 samples
for epoch in range(500):
  for X_batch, y_batch in train_loader:
    # 1. forward pass --> X_batch to model --> get pred
    pred = model(X_batch)
    # 2. Loss calculate
    loss = criterion(pred, y_batch)

    # backward pass and optimization
    # 3. optimizer zero grad - zero out gradients
    optimizer.zero_grad()
    # 4. loss backwards -- back propagation
    loss.backward()
    # 5. optimizer step -- add updated weights to modmel
    optimizer.step()

## apply loss function -- (criterion - MSELoss) -- get train and test loss
  if (epoch+1) % 50 == 0: ## print after every 50 epochs
    train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
  # print(epoch,': ',train_loss)
    test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
    print(f'Epoch {epoch+1}: | Train Loss: {train_loss:.4f}, | Test Loss: {test_loss:.4f}')

Epoch 50: | Train Loss: 0.3414, | Test Loss: 0.4021
Epoch 100: | Train Loss: 0.3402, | Test Loss: 0.3985
Epoch 150: | Train Loss: 0.3386, | Test Loss: 0.3990
Epoch 200: | Train Loss: 0.3380, | Test Loss: 0.3990
Epoch 250: | Train Loss: 0.3379, | Test Loss: 0.3984
Epoch 300: | Train Loss: 0.3379, | Test Loss: 0.4016
Epoch 350: | Train Loss: 0.3370, | Test Loss: 0.3992
Epoch 400: | Train Loss: 0.3395, | Test Loss: 0.3982
Epoch 450: | Train Loss: 0.3380, | Test Loss: 0.3976
Epoch 500: | Train Loss: 0.3379, | Test Loss: 0.4030


Summary:
* We have achieved our lowest loss so far at 0.3379 and 0.4030.

## Optimization using Adam

In [53]:
## linear regression model
model = nn.Sequential(
    nn.Linear(2,2),
    nn.Sigmoid(),
    nn.Linear(2,1)
)
## optimizer --> Adam
optimizer = optim.Adam(model.parameters())

Training Loop

In [54]:
## load data
train_loader = DataLoader(train_data,
                          batch_size=64, ## 800 samples in train set
                          shuffle=True) ## shuffle the data!

## Training loop for 500 epochs -- more effective training on 800 samples
for epoch in range(500):
  for X_batch, y_batch in train_loader:
    # 1. forward pass --> X_batch to model --> get pred
    pred = model(X_batch)
    # 2. Loss calculate
    loss = criterion(pred, y_batch)

    # backward pass and optimization
    # 3. optimizer zero grad - zero out gradients
    optimizer.zero_grad()
    # 4. loss backwards -- back propagation
    loss.backward()
    # 5. optimizer step -- add updated weights to modmel
    optimizer.step()

## apply loss function -- (criterion - MSELoss) -- get train and test loss
  if (epoch+1) % 50 == 0: ## print after every 50 epochs
    train_loss = criterion(model(X_train_tensor), y_train_tensor).item()
  # print(epoch,': ',train_loss)
    test_loss = criterion(model(X_test_tensor), y_test_tensor).item()
    print(f'Epoch {epoch+1}: | Train Loss: {train_loss:.4f}, | Test Loss: {test_loss:.4f}')

Epoch 50: | Train Loss: 2.2722, | Test Loss: 2.4410
Epoch 100: | Train Loss: 0.5070, | Test Loss: 0.5901
Epoch 150: | Train Loss: 0.3666, | Test Loss: 0.4221
Epoch 200: | Train Loss: 0.3616, | Test Loss: 0.4137
Epoch 250: | Train Loss: 0.3592, | Test Loss: 0.4117
Epoch 300: | Train Loss: 0.3565, | Test Loss: 0.4096
Epoch 350: | Train Loss: 0.3536, | Test Loss: 0.4072
Epoch 400: | Train Loss: 0.3507, | Test Loss: 0.4059
Epoch 450: | Train Loss: 0.3483, | Test Loss: 0.4046
Epoch 500: | Train Loss: 0.3462, | Test Loss: 0.4030


Summary:
* We can see that RMS prop was the best optimizer here but that Adam was a close second and Adam was able to reduce the loss immediately after the first 50 epochs.