# COMP4211  Tutorial5 : PyTorch (feedforward neural networks)

<p style="text-align: left;">
TA: James<br \>
thcheungae@cse.ust.hk<br \>
March 23, 2019 (Mon)<br \>
</p>
<url> https://hkust.zoom.us/j/570617804 </url>

PyTorch
==========

Pytorch is a deep learning research platform. It is more complicated than significantly more difficult than sklearn but provides maximum flexibility and speed.

Neural Network
==========

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or
  weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
  ``weight = weight - learning_rate * gradient``
  
## Define a Network in PyTorch

In [4]:
# install using conda:
# conda install pytorch torchvision -c pytorch

# or using pip:
# pip install torch torchvision

# or other installation, visit https://pytorch.org/

import torch
import torch.nn as nn
import torch.nn.functional as F

d_in, d_out = 4, 2


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(d_in, 8)
        self.fc2 = nn.Linear(8, 4)
        self.fc3 = nn.Linear(4, d_out)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        y = F.relu(self.fc2(x))
        z = self.fc3(y)
        self.fc1_out = x # for illustration
        self.fc2_out = y
        self.fc3_out = z
        return z


net = Net()
print(net)

print('Number of learnable parameters:')
print(sum(p.numel() for p in net.parameters() if p.requires_grad))

Net(
  (fc1): Linear(in_features=4, out_features=8, bias=True)
  (fc2): Linear(in_features=8, out_features=4, bias=True)
  (fc3): Linear(in_features=4, out_features=2, bias=True)
)
Number of learnable parameters:
86


<div class="alert alert-info"><p>

**Alternative**

A faster way to build a simple neural network is the ``nn.Sequential``

<pre><code>
    net = nn.Sequential(
        nn.Linear(d_in, 4),
        nn.ReLU(),
        nn.Linear(4, 8),
        nn.ReLU(),
        nn.Linear(8, d_out)
    )
    
</pre></code>

Let's try a random input. Notice that PyTorch operates on **Tensor**

In [5]:
input = torch.randn(1, d_in)
out = net(input)
print(out)

tensor([[-0.3078, -0.0892]], grad_fn=<AddmmBackward>)


In [6]:
output = net(input)
target = torch.tensor([[1.0, 0.0]])  # a dummy target, for example

criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.8592, grad_fn=<MseLossBackward>)


![Image](./img/computation_graph.png)

When we call ``loss.backward()``, the whole graph is differentiated
w.r.t. the loss, and all Tensors in the graph that has ``requires_grad=True``
will have their ``.grad`` Tensor accumulated with the gradient. Now, if you follow loss in the backward direction, using its .grad_fn attribute, you will see a graph of computations that looks like this:

input -> linear -> relu (fc1_out) -> linear -> relu (fc2_out) -> linear (fc3_out) -> MSELoss (loss)
      
For illustration, let us follow a few steps backward:

In [7]:
for i, name in zip([loss, net.fc3_out, net.fc2_out, net.fc1_out], ['loss', 'fc3', 'fc2', 'fc1']):
    print(f"\n{name}\nrequires_grad: {i.requires_grad}\ngrad_fn: {i.grad_fn}")


loss
requires_grad: True
grad_fn: <MseLossBackward object at 0x104c78828>

fc3
requires_grad: True
grad_fn: <AddmmBackward object at 0x104c785f8>

fc2
requires_grad: True
grad_fn: <ReluBackward0 object at 0x104c785f8>

fc1
requires_grad: True
grad_fn: <ReluBackward0 object at 0x104c785f8>


Backprop
--------
To backpropagate the error all we have to do is to ``loss.backward()``.
You need to clear the existing gradients though, else gradients will be
accumulated to existing gradients.


Now we shall call ``loss.backward()``, and have a look at fc1's bias
gradients before and after the backward.



In [8]:
net.zero_grad()
out.backward(torch.randn(1, d_out)) # illustration purpose

net.zero_grad()     # zeroes the gradient buffers of all parameters

print('fc2.weight.grad before backward')
print(net.fc2.weight.grad)

loss.backward() # propagate loss

print('fc2.weight.grad after backward')
print(net.fc2.weight.grad)

fc2.weight.grad before backward
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.]])
fc2.weight.grad after backward
tensor([[0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.2226, 0.0018, 0.2093, 0.0000, 0.1040, 0.0479, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000],
        [0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000]])


Update the weights
------------------

In [9]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.1)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward() # propagate loss

print(net.fc2.weight)
optimizer.step()    # does the update w = w - eta * gradient
print(net.fc2.weight)

Parameter containing:
tensor([[ 0.2594,  0.0925, -0.0070, -0.0130, -0.2776,  0.0488, -0.1821,  0.1503],
        [-0.0967, -0.0249, -0.1245, -0.2778,  0.2521,  0.3157, -0.0173,  0.1020],
        [ 0.1447, -0.0987,  0.2112,  0.0083, -0.2679, -0.0283, -0.0457,  0.0715],
        [-0.3447,  0.0540,  0.0016,  0.1446, -0.0068, -0.2499, -0.0185, -0.2585]],
       requires_grad=True)
Parameter containing:
tensor([[ 0.2594,  0.0925, -0.0070, -0.0130, -0.2776,  0.0488, -0.1821,  0.1503],
        [-0.1189, -0.0251, -0.1454, -0.2778,  0.2417,  0.3109, -0.0173,  0.1020],
        [ 0.1447, -0.0987,  0.2112,  0.0083, -0.2679, -0.0283, -0.0457,  0.0715],
        [-0.3447,  0.0540,  0.0016,  0.1446, -0.0068, -0.2499, -0.0185, -0.2585]],
       requires_grad=True)


## Putting everything together

In [11]:
# Create random Tensors to hold inputs and outputs
x = torch.randn(100, d_in)
y = torch.randn(100, d_out)

# Construct our model by instantiating the class defined above
model = Net()
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

for t in range(20):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(f'Epoch {t}: loss: {loss.item()}')

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

Epoch 0: loss: 216.07546997070312
Epoch 1: loss: 215.9506072998047
Epoch 2: loss: 215.8314666748047
Epoch 3: loss: 215.71768188476562
Epoch 4: loss: 215.60934448242188
Epoch 5: loss: 215.50621032714844
Epoch 6: loss: 215.40774536132812
Epoch 7: loss: 215.3138885498047
Epoch 8: loss: 215.22413635253906
Epoch 9: loss: 215.13836669921875
Epoch 10: loss: 215.05642700195312
Epoch 11: loss: 214.97811889648438
Epoch 12: loss: 214.90362548828125
Epoch 13: loss: 214.8328399658203
Epoch 14: loss: 214.76515197753906
Epoch 15: loss: 214.70040893554688
Epoch 16: loss: 214.63845825195312
Epoch 17: loss: 214.57925415039062
Epoch 18: loss: 214.5225372314453
Epoch 19: loss: 214.4682159423828


## Dataset Class

Dataset are not always given in a .csv format. You have to build your own PyTorch ``utils.data.Dataset`` and feed it to a then feed the data to the model using the ``utils.data.Dataloader``.

In [15]:
import os
import torch
import pandas as pd
import numpy as np
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import LabelEncoder

In [16]:
'''
You may download the dataset from:
https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
'''

data_path = './HR.csv'

df = pd.read_csv(data_path)
df.head()

Unnamed: 0,Attrition,Age,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,Yes,41,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,No,49,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,Yes,37,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,No,33,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,No,27,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [17]:
class HrDataset(Dataset):

    def __init__(self, file_path):
        self.file_path = file_path
        self.classes = [0, 1]
        df = pd.read_csv(file_path)

        # Some preprocessing
        for col in df.columns:
            if df.dtypes[col] == "object":
                df[col] = df[col].fillna("NA")
                df[col] = df[col].astype('category')
                if len(df[col].cat.categories) > 2:
                    df = pd.get_dummies(df, columns=[col])
                else:
                    df[col] = LabelEncoder().fit_transform(df[col])
            else:
                df[col] = df[col].fillna(0)

        self.df = df

    def __getitem__(self, idx):
        if torch.is_tensor(idx):
            idx = idx.tolist()

        X = np.array(self.df.iloc[idx, 1:]).astype(np.float32)
        y = self.df.iloc[idx, 0]

        return X, y

    def __len__(self):
        return len(self.df)

In [18]:
data = HrDataset(file_path=data_path)

classes = data.classes
num_classes = len(classes)

for i in range(len(data)):
    features, label = data[i]
    print(f'Data {i}, Shape: {features.shape}, Label: {label}')
    print(features)
    if i == 1:
        break

Data 0, Shape: (53,), Label: 1
[4.1000e+01 1.1020e+03 1.0000e+00 2.0000e+00 1.0000e+00 1.0000e+00
 2.0000e+00 0.0000e+00 9.4000e+01 3.0000e+00 2.0000e+00 4.0000e+00
 5.9930e+03 1.9479e+04 8.0000e+00 0.0000e+00 1.0000e+00 1.1000e+01
 3.0000e+00 1.0000e+00 8.0000e+01 0.0000e+00 8.0000e+00 0.0000e+00
 1.0000e+00 6.0000e+00 4.0000e+00 0.0000e+00 5.0000e+00 0.0000e+00
 0.0000e+00 1.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00 0.0000e+00
 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00
 1.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 1.0000e+00]
Data 1, Shape: (53,), Label: 0
[4.9000e+01 2.7900e+02 8.0000e+00 1.0000e+00 1.0000e+00 2.0000e+00
 3.0000e+00 1.0000e+00 6.1000e+01 2.0000e+00 2.0000e+00 2.0000e+00
 5.1300e+03 2.4907e+04 1.0000e+00 0.0000e+00 0.0000e+00 2.3000e+01
 4.0000e+00 4.0000e+00 8.0000e+01 1.0000e+00 1.0000e+01 3.0000e+00
 3.0000e+00 1.0000e+01 7.0000e+00 1.0000e+00 7.0000e+00 0.0000e+00
 1.0000e+0

In [19]:
from torch.utils.data import random_split

full_data = HrDataset(file_path=data_path)

train_size = int(0.8 * len(full_data))
test_size = len(full_data) - train_size

full_train, test = random_split(full_data, [train_size, test_size])

train_size = int(0.8 * len(full_train))
valid_size = len(full_train) - train_size

train, valid = random_split(full_train, [train_size, valid_size])

print(f'Full:\t\t{len(full_data)}')
print(f'|Train:\t\t{len(train)}')
print(f'|Validation:\t{len(valid)}')
print(f'|Test:\t\t{len(test)}')

Full:		1470
|Train:		940
|Validation:	236
|Test:		294


## Dataloader Class

However, we are losing a lot of features by using a simple for loop to iterate over the data. In particular, we are missing functions like **Batching** the data, **Shuffling** the data and Load the data in parallel using multiprocessing workers.

``torch.utils.data.DataLoader`` is an iterator which provides all these features. Parameters used below should be clear. One parameter of interest is collate_fn. You can specify how exactly the samples need to be batched using ``collate_fn``. However, default collate should work fine for most use cases.

In [22]:
batch_size = 32

train_loader = DataLoader(train, batch_size=batch_size,
                          shuffle=True, num_workers=1)

valid_loader = DataLoader(valid, batch_size=batch_size,
                          shuffle=True, num_workers=1)

test_loader = DataLoader(test, batch_size=1,
                         shuffle=True, num_workers=1)

dataiter = iter(train_loader)
inputs, labels = dataiter.next()

print(f'Batched features:\n {inputs}, \
      \n Batched labels:\n {labels}')

Batched features:
 tensor([[5.1000e+01, 4.3200e+02, 9.0000e+00,  ..., 0.0000e+00, 1.0000e+00,
         0.0000e+00],
        [2.8000e+01, 4.4000e+02, 2.1000e+01,  ..., 0.0000e+00, 1.0000e+00,
         0.0000e+00],
        [2.2000e+01, 5.9400e+02, 2.0000e+00,  ..., 0.0000e+00, 1.0000e+00,
         0.0000e+00],
        ...,
        [2.6000e+01, 4.2600e+02, 1.7000e+01,  ..., 1.0000e+00, 0.0000e+00,
         0.0000e+00],
        [3.0000e+01, 1.3120e+03, 2.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         1.0000e+00],
        [3.1000e+01, 6.6700e+02, 1.0000e+00,  ..., 0.0000e+00, 0.0000e+00,
         1.0000e+00]]),       
 Batched labels:
 tensor([0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 1, 1, 1, 0, 1])


## Putting Everything together

In [20]:
import torch
import torch.nn as nn
import torch.nn.functional as F

# torchsummary is not available in conda install, 
# you need to install using pip
!pip install torchsummary
from torchsummary import summary

d_in, d_out = 53, 2


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(d_in, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, d_out)
        self.bn1 = nn.BatchNorm1d(d_in)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(64)
        self.drops = nn.Dropout(0.3)

    def forward(self, x):
        x = self.bn1(x)
        x = F.relu(self.fc1(x))
        x = self.drops(x)
        x = self.bn2(x)
        x = F.relu(self.fc2(x))
        x = self.drops(x)
        x = self.bn3(x)
        x = self.fc3(x)
        return x


## gpu or cpu
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
net = Net().to(device)

summary(net, input_size=(53,))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
       BatchNorm1d-1                   [-1, 53]             106
            Linear-2                  [-1, 128]           6,912
           Dropout-3                  [-1, 128]               0
       BatchNorm1d-4                  [-1, 128]             256
            Linear-5                   [-1, 64]           8,256
           Dropout-6                   [-1, 64]               0
       BatchNorm1d-7                   [-1, 64]             128
            Linear-8                    [-1, 2]             130
Total params: 15,788
Trainable params: 15,788
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.06
Estimated Total Size (MB): 0.07
----------------------------------------------------------------


## Tensorboard
Tensorboard is a powerful platform to visualize the training performance of your model.

In [23]:
from torch.utils.tensorboard import SummaryWriter

# default `log_dir` is "runs" - we'll be more specific here
writer = SummaryWriter('runs/hr_experiment')

# add graph
writer.add_graph(net, inputs)
writer.close()

# run tensorboard --logdir='./runs'

## Model Training and Validation - the model.fit( )

In [25]:
running_loss = 0.0
global_step = 0
eval_every = 20
num_epochs = 10
lr = 0.001
total_step = len(train_loader)*num_epochs

net = Net().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=lr)

for epoch in range(num_epochs):  # loop over the dataset multiple times

    for i, (inputs, labels) in enumerate(train_loader):
        net.train()
        inputs = inputs.to(device)
        labels = labels.to(device)

        '''Training of the model'''
        # Forward pass
        outputs = net(inputs)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        global_step += 1

        running_loss += loss.item()

        '''Evaluating the model every x steps'''
        if global_step % eval_every == 0:
            with torch.no_grad():
                net.eval()
                val_running_loss = 0.0
                for val_inputs, val_labels in valid_loader:
                    val_outputs = net(val_inputs)
                    val_loss = criterion(val_outputs, val_labels)
                    val_running_loss += val_loss.item()

                average_train_loss = running_loss / eval_every
                average_val_loss = val_running_loss / len(valid_loader)

                # ...log the running loss
                writer.add_scalar(
                    f'training loss {num_epochs}', average_train_loss, global_step)

                # ...log the running loss
                writer.add_scalar(
                    f'validation loss {num_epochs}', average_val_loss, global_step)

                print('Epoch [{}/{}], Step [{}/{}], Train Loss: {:.4f}, Valid Loss: {:.4f}'
                      .format(epoch+1, num_epochs, global_step, total_step, average_train_loss, average_val_loss))

                running_loss = 0.0

print('Finished Training')

Epoch [1/50], Step [20/1500], Train Loss: 0.6942, Valid Loss: 0.4005
Epoch [2/50], Step [40/1500], Train Loss: 0.6162, Valid Loss: 0.3702
Epoch [2/50], Step [60/1500], Train Loss: 0.5634, Valid Loss: 0.4647
Epoch [3/50], Step [80/1500], Train Loss: 0.4985, Valid Loss: 0.4567
Epoch [4/50], Step [100/1500], Train Loss: 0.4550, Valid Loss: 0.4200
Epoch [4/50], Step [120/1500], Train Loss: 0.4633, Valid Loss: 0.3843
Epoch [5/50], Step [140/1500], Train Loss: 0.4029, Valid Loss: 0.3580
Epoch [6/50], Step [160/1500], Train Loss: 0.3955, Valid Loss: 0.3401
Epoch [6/50], Step [180/1500], Train Loss: 0.3455, Valid Loss: 0.2992
Epoch [7/50], Step [200/1500], Train Loss: 0.3253, Valid Loss: 0.2655
Epoch [8/50], Step [220/1500], Train Loss: 0.3242, Valid Loss: 0.2861
Epoch [8/50], Step [240/1500], Train Loss: 0.3214, Valid Loss: 0.2564
Epoch [9/50], Step [260/1500], Train Loss: 0.3013, Valid Loss: 0.2645
Epoch [10/50], Step [280/1500], Train Loss: 0.3083, Valid Loss: 0.2882
Epoch [10/50], Step [30

In [26]:
from sklearn.metrics import classification_report


def eval(model, test_loader):
    y_test = []
    y_pred = []

    with torch.no_grad():
        for inputs, labels in test_loader:
            model.eval()
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            y_pred.append(predicted.item())
            y_test.append(labels.item())

    print(classification_report(y_test, y_pred))


eval(net, test_loader)

              precision    recall  f1-score   support

           0       0.90      0.93      0.91       245
           1       0.56      0.47      0.51        49

    accuracy                           0.85       294
   macro avg       0.73      0.70      0.71       294
weighted avg       0.84      0.85      0.84       294



## Checkpointing your model

In [27]:
def save_checkpoint(ckpt_dir, model, optimizer, epoch):
    save_path = f'{ckpt_dir}_{epoch}.pt'
    state_dict = {'model_state_dict': model.state_dict(),
                  'optimizer_state_dict': optimizer.state_dict(),
                  'epoch': epoch}

    torch.save(state_dict, save_path)

    print(f'Model saved to ==> {save_path}')
    return save_path


def load_checkpoint(save_path, model, optimizer):
    state_dict = torch.load(save_path)
    model.load_state_dict(state_dict['model_state_dict'])
    optimizer.load_state_dict(state_dict['optimizer_state_dict'])

    print(f'Model loaded from <== {save_path}')

    return state_dict['epoch']

In [28]:
save_path = save_checkpoint('./hr_model', net, optimizer, 10)
new_net = Net().to(device)
last_epoch = load_checkpoint(save_path, new_net, optimizer)
eval(new_net, test_loader)

Model saved to ==> ./hr_model_10.pt
Model loaded from <== ./hr_model_10.pt
              precision    recall  f1-score   support

           0       0.90      0.93      0.91       245
           1       0.56      0.47      0.51        49

    accuracy                           0.85       294
   macro avg       0.73      0.70      0.71       294
weighted avg       0.84      0.85      0.84       294



<div class="alert alert-warning"><p>
    
**Exercise:**
    Try to improve the performance by changing:
  -  Number of layers
  -  Number of hidden units
  -  Batch Size
  -  Learning Rate
  -  Optimizer
  -  Number of Epochs
      

**Reference**
- <url>https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset</url>
- <url>https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html</url>
- <http://cs231n.stanford.edu/slides/2018/cs231n_2018_ds02.pdf>