This notebook aims to give a very brief introduction to deep learning as implemented in PyTorch. Note that knowledge of PyTorch is *not* required for CS 229 (although you may use it for your final project); its use here is to demonstrate how to work with a popular modern deep learning library.

First, run the cell below to import PyTorch and load the data we'll be using in this tutorial.

In [None]:
import torch
import torch.nn as nn

assert torch.cuda.is_available()
device = torch.device('cuda')

import pandas as pd
def load_housing_data(path, y_key='median_house_value'):
  table = torch.from_numpy(pd.read_csv(path).to_numpy()).float().to(device)
  return table[:,:-1], table[:,-1]
X_train, y_train = load_housing_data('sample_data/california_housing_train.csv')
X_test, y_test = load_housing_data('sample_data/california_housing_test.csv')

# normalize inputs
x_mean, x_std = X_train.mean(0), X_train.std(0)
X_train = (X_train - x_mean) / x_std
X_test = (X_test - x_mean) / x_std
# rescale outputs
y_train = y_train / 1000
y_test = y_test / 1000

The basic objects in PyTorch are Tensors, which are multidimensional arrays similar to NumPy arrays. In fact, much of the PyTorch function calls and indexing are the same as in NumPy.

In [None]:
torch.tensor(0.).shape, torch.tensor([0., 1.]).shape, torch.tensor([[0., 1.], [2., 3.], [4., 5.]]).shape

(torch.Size([]), torch.Size([2]), torch.Size([3, 2]))

In [None]:
x = torch.arange(1, 11)
print(x)
x[2:5] = 0
print(x)
print(2*x+1)

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
tensor([ 1,  2,  0,  0,  0,  6,  7,  8,  9, 10])
tensor([ 3,  5,  1,  1,  1, 13, 15, 17, 19, 21])


One difference from NumPy arrays is that tensors have an attribute, `requires_grad`, that tells PyTorch to keep track of the gradient operations. By default, this flag is false for normal tensors that you construct. But a special type of Tensor, called a Parameter, has `requires_grad = True`. Also, if you compute a function of a Tensor that requires grad, the resulting tensor will also require grad.

In [None]:
theta = nn.Parameter(torch.rand(5))
b = nn.Parameter(torch.rand(1))
x = torch.rand(5)
out = torch.inner(theta, x) + b
print(theta.requires_grad, b.requires_grad, x.requires_grad, out.requires_grad)
print(out)

True True False True
tensor([1.8037], grad_fn=<AddBackward0>)


The code below constructs a simple multi-layer perceptron (MLP):

In [None]:
input_dim = X_train.shape[1]
hidden_units = 256
model = nn.Sequential(
    nn.Linear(input_dim, hidden_units),
    nn.ReLU(),
    nn.Linear(hidden_units, hidden_units),
    nn.ReLU(),
    nn.Linear(hidden_units, hidden_units),
    nn.ReLU(),
    nn.Linear(hidden_units, 1)  # scalar output
)
model.to(device)  # move to GPU

Sequential(
  (0): Linear(in_features=8, out_features=256, bias=True)
  (1): ReLU()
  (2): Linear(in_features=256, out_features=256, bias=True)
  (3): ReLU()
  (4): Linear(in_features=256, out_features=256, bias=True)
  (5): ReLU()
  (6): Linear(in_features=256, out_features=1, bias=True)
)

With the model created, we now construct an optimizer object that will handle updating the model's parameters:

In [None]:
# optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
optimizer = torch.optim.Adam(model.parameters())

Now we are ready to train the model! The main loop involves the following steps:
* Sample a batch of data
* Compute the loss on the batch of data
* Call the optimizer's `.zero_grad()` method to clear any existing gradient information stored in the parameters.
* Call the loss's `.backward()` method to backpropagate gradients to the parameters
* Call the optimizer's `.step()` method to update the parameters using the new gradient information



In [None]:
n = X_train.shape[0]
num_steps = 50000
batch_size = 256
loss_ema = None
for step_index in range(num_steps):
  batch_indices = torch.randint(high=n, size=[batch_size])
  X_batch = X_train[batch_indices]
  y_batch = y_train[batch_indices]
  predictions = model(X_batch).squeeze(-1)  # squeeze reshapes [B, 1] -> [B]
  loss = nn.functional.mse_loss(predictions, y_batch)
  loss_ema = loss.item() if loss_ema is None else (0.1 * loss.item() + 0.9 * loss_ema)
  if step_index % 1000 == 0:
    print(f'Loss EMA at step {step_index}: {loss_ema}')

  optimizer.zero_grad()
  loss.backward()   # backpropagate
  optimizer.step()  # update

Loss EMA at step 0: 56162.14453125
Loss EMA at step 1000: 3297.1064133197115
Loss EMA at step 2000: 2813.036880667342
Loss EMA at step 3000: 2948.2840153701827
Loss EMA at step 4000: 2493.5022401937804
Loss EMA at step 5000: 2527.959669862538
Loss EMA at step 6000: 2393.5583275584286
Loss EMA at step 7000: 2383.5043021309884
Loss EMA at step 8000: 2197.845718176544
Loss EMA at step 9000: 2245.4929384119832
Loss EMA at step 10000: 2401.8460390299865
Loss EMA at step 11000: 2124.3930005829197
Loss EMA at step 12000: 2082.2648568245804
Loss EMA at step 13000: 2109.5585373673043
Loss EMA at step 14000: 2062.9334114769695
