# Week 5. Training Issues
Through the previous learning, you may have already known how to train a model from scratch. In this part, we will firstly review the whole training process by setting up a classification network on MNIST dataset. Then we will highlight some useful tricks to improve the model performance.

If you have any questions or suggestions about this part, please feel free to contact the teaching assistants Wanying Tao and Jianfei Xing on WeChat.
  

In [None]:
%load_ext autoreload
%autoreload 2

## 1. Common Setup

In [None]:
import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# from torchvision.datasets import MNIST
import torchvision
from torchvision import transforms
from torch.optim import lr_scheduler
from collections import OrderedDict
import matplotlib.pyplot as plt

In [None]:
# cuda = torch.cuda.is_available() 
# torch.cuda.set_device(device) 

# Limited by GPU resources, we recommend computing on CPU
cuda = torch.device('cpu') 

## 2. Classfication Model

In [None]:
class FeedForwardNeuralNetwork(nn.Module):
    """
    Inputs                Linear/Function        Output
    [128, 1, 28, 28]   -> Linear(28*28, 100) -> [128, 100]  # the first hidden layer
                       -> ReLU               -> [128, 100]  # ReLU activation function, may Sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # the second hidden layer
                       -> ReLU               -> [128, 100]  # ReLU activation function, may Sigmoid
                       -> Linear(100, 100)   -> [128, 100]  # the third hidden layer
                       -> ReLU               -> [128, 100]  # ReLU activation function, may Sigmoid
                       -> Linear(100, 10)    -> [128, 10]   # classification layer                                                          
   """
    def __init__(self, input_size, hidden_size, output_size, activation_function='RELU'):
        super(FeedForwardNeuralNetwork, self).__init__()
        self.use_dropout = False
        self.use_bn = False
        self.hidden1 = nn.Linear(input_size, hidden_size)  # Linear function 1: 784 --> 100 
        self.hidden2 = nn.Linear(hidden_size, hidden_size) # Linear function 2: 100 --> 100
        self.hidden3 = nn.Linear(hidden_size, hidden_size) # Linear function 3: 100 --> 100
        # Linear function 4 (readout): 100 --> 10
        self.classification_layer = nn.Linear(hidden_size, output_size)
        self.dropout = nn.Dropout(p=0.5) # Drop out with prob = 0.5
        self.hidden1_bn = nn.BatchNorm1d(hidden_size) # Batch Normalization 
        self.hidden2_bn = nn.BatchNorm1d(hidden_size)
        self.hidden3_bn = nn.BatchNorm1d(hidden_size)
        
        # Non-linearity
        if activation_function == 'SIGMOID':
            self.activation_function1 = nn.Sigmoid()
            self.activation_function2 = nn.Sigmoid()
            self.activation_function3 = nn.Sigmoid()
        elif activation_function == 'RELU':
            self.activation_function1 = nn.ReLU()
            self.activation_function2 = nn.ReLU()
            self.activation_function3 = nn.ReLU()
        
    def forward(self, x):
        """
        Args:
            x: [batch_size, channel, height, width], network input
        Returns:
            out: [batch_size, n_classes], network output
        """
        
        x = x.view(x.size(0), -1) # flatten x in [128, 784]
        out = self.hidden1(x)
        out = self.activation_function1(out) # Non-linearity 1
        if self.use_bn == True:
            out = self.hidden1_bn(out)
        out = self.hidden2(out)
        out = self.activation_function2(out)
        if self.use_bn == True:
            out = self.hidden2_bn(out)
        out = self.hidden3(out)
        if self.use_bn == True:
            out = self.hidden3_bn(out)
        out = self.activation_function3(out)
        if self.use_dropout == True:
            out = self.dropout(out)
        out = self.classification_layer(out)
        return out
    
    def set_use_dropout(self, use_dropout):
        """Whether to use dropout. Auxiliary function for our exp, not necessary.
        Args:
            use_dropout: True, False
        """
        self.use_dropout = use_dropout
        
    def set_use_bn(self, use_bn):
        """Whether to use batch normalization. Auxiliary function for our exp, not necessary.
        Args:
            use_bn: True, False
        """
        self.use_bn = use_bn
        
    def get_grad(self):
        """Return average grad for hidden2, hidden3. Auxiliary function for our exp, not necessary.
        """
        hidden2_average_grad = np.mean(np.sqrt(np.square(self.hidden2.weight.grad.detach().numpy())))
        hidden3_average_grad = np.mean(np.sqrt(np.square(self.hidden3.weight.grad.detach().numpy())))
        return hidden2_average_grad, hidden3_average_grad

## 3. Training  

### 3.1 Pre-set hyper-parameters
* learning rate: usually set the learning rate to 1e-1, 1e-2 or 1e-3, and gradually decrease its value during iteration.
* n_epochs: training epoch must be set large enough to ensure that the model can converge. 
* batch_size: bigger batch size means better usage of GPU and less time for model to converge, generally use the exponent power of 2, e.g., 2, 4, 8, 16, 32, 64, 128, 256.  

In [None]:
### Hyper parameters

batch_size = 128 # batch size is 128
n_epochs = 5 # train 5 epochs
learning_rate = 0.01 # learning rate is 0.01
input_size = 28*28 # tthe size of image is 28x28
hidden_size = 100 # 100 hidden neurons in each layer
output_size = 10 # classes of prediction
l2_norm = 0 # not to use l2 penalty
dropout = False # not to use dropout
get_grad = False # not to obtain grad

In [None]:
# create a model object
model = FeedForwardNeuralNetwork(input_size=input_size, hidden_size=hidden_size, output_size=output_size)
# loss function
loss_fn = torch.nn.CrossEntropyLoss()
# l2_norm can be done in SGD
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, weight_decay=l2_norm) 

### 3.2 Initialize model parameters
PyTorch provides default initialization (**uniform intialization**) for linear layer, and there are also some other useful initialization methods you may use in the homework.

Click on this [link](https://pytorch.org/docs/stable/_modules/torch/nn/init.html) for more details.

In [None]:
def show_weight_bias(model):
    """Show weights and bias distribution. 
    """
    # Create a figure and a set of subplots
    fig, axs = plt.subplots(2,3, sharey=False, tight_layout=True)
    
    # weight and bias for every hidden layer
    h1_w = model.hidden1.weight.detach().numpy().flatten()
    h1_b = model.hidden1.bias.detach().numpy().flatten()
    h2_w = model.hidden2.weight.detach().numpy().flatten()
    h2_b = model.hidden2.bias.detach().numpy().flatten()
    h3_w = model.hidden3.weight.detach().numpy().flatten()
    h3_b = model.hidden3.bias.detach().numpy().flatten()
    
    axs[0,0].hist(h1_w)
    axs[0,1].hist(h2_w)
    axs[0,2].hist(h3_w)
    axs[1,0].hist(h1_b)
    axs[1,1].hist(h2_b)
    axs[1,2].hist(h3_b)
    
    # set title for every sub plots
    axs[0,0].set_title('hidden1_weight')
    axs[0,1].set_title('hidden2_weight')
    axs[0,2].set_title('hidden3_weight')
    axs[1,0].set_title('hidden1_bias')
    axs[1,1].set_title('hidden2_bias')
    axs[1,2].set_title('hidden3_bias')

In [None]:
show_weight_bias(model)

In [None]:
def weight_bias_reset(model):
    """Custom initialization, you can imitate the code writing to use other initialization methods in homework.
    """
    for m in model.modules():
        if isinstance(m, nn.Linear):
            mean, std = 0, 0.1 
            
            torch.nn.init.normal_(m.weight, mean, std)
            torch.nn.init.normal_(m.bias, mean, std)
            
#             m.weight.data.normal_(mean, std)
#             m.bias.data.normal_(mean, std)

## 作业1
使用constant, xavier, kaiming三种方式初始化参数，并显示模型隐藏层的参数分布，不必初始化bias。

In [None]:
# TODO

def weight_bias_reset_constant(model):
    """Constant initalization
    """
    # remove pass and code here
    pass
        
# Reset parameters and show the distribution
# code here

In [None]:
# TODO

def weight_bias_reset_xavier_uniform(model):
    """xaveir_uniform, gain=1
    """
    # remove pass and code here
    pass
        
# Reset parameters and show the distribution
# code here

In [None]:
# TODO

def weight_bias_reset_kaiming_uniform(model):
    """kaiming_uniform, a=0, mode='fan_in', nonlinearity='relu'
    """
    # remove pass and code here
    pass
        
# Reset parameters and show the distribution
# code here        

### 3.3 Repeat over certain numbers of epoch

#### 3.3.1 data loading 
Please pay attention to data augmentation. 

Click on this [link](https://pytorch.org/docs/stable/torchvision/transforms.html). for more details.

```
torchvision.transforms.RandomVerticalFlip
torchvision.transforms.RandomHorizontalFlip
...
```


In [None]:
train_transform = transforms.Compose([
    transforms.ToTensor(), 
    transforms.Normalize((0.1307,), (0.3081,))
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

In [None]:
train_dataset = torchvision.datasets.MNIST(root='./data', 
                            train=True, 
                            transform=train_transform,
                            download=True)

test_dataset = torchvision.datasets.MNIST(root='./data', 
                           train=False, 
                           transform=test_transform,
                           download=False)

In [None]:
# train_dataset doesn't load any data, it just defines some method and stores some message to preprocess data
train_dataset

In [None]:
train_loader = torch.utils.data.DataLoader(dataset=train_dataset, 
                                           batch_size=batch_size, 
                                           shuffle=False)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset, 
                                          batch_size=batch_size, 
                                          shuffle=False)

#### 3.3.2 model training

In [None]:
def train(train_loader, model, loss_fn, optimizer, get_grad=False):
    """
    Args:
        train_loader: training data
        model: prediction model
        loss_fn: loss function to calculate the distance between target and outputs
        optimizer: optimize the loss function
        get_grad: True, False
    Returns:
        total_loss: loss
        average_grad2: average grad for hidden 2 in this epoch
        average_grad3: average grad for hidden 3 in this epoch
    """
    model.train()
    
    total_loss = 0
    grad_2 = 0.0 # store sum(grad) for hidden 2 layer
    grad_3 = 0.0 # store sum(grad) for hidden 3 layer
    
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad() 
        outputs = model(data)
        loss = loss_fn(outputs, target)  
        total_loss += loss.item() 
        loss.backward() 
        
        if get_grad == True:
            g2, g3 = model.get_grad() # get gradients of hiddern 2 and 3 layer in this batch
            grad_2 += g2
            grad_3 += g3 
            
        optimizer.step() 
            
    average_loss = total_loss / batch_idx 
    average_grad2 = grad_2 / batch_idx 
    average_grad3 = grad_3 / batch_idx 
    
    return average_loss, average_grad2, average_grad3

#### 3.3.3 model evaluation

In [None]:
def evaluate(val_loader, model, loss_fn):
    """
    Args:
        val_loader: data for evaluation
        model: prediction model
        loss_fn: loss function to calculate the distance between target and outputs
    Returns:
        total_loss:loss
        accuracy: model prediction accuracy
    """
    with torch.no_grad():
        
        model.eval()       
        correct = 0.0
        total_loss = 0  
        
        for batch_idx, (data, target) in enumerate(val_loader):
            outputs = model(data) 
            _, predicted = torch.max(outputs, 1)
            correct += (predicted == target).sum().detach().numpy()
            loss = loss_fn(outputs, target)  
            total_loss += loss.item() 
            
        accuracy = correct*100.0 / len(val_loader.dataset) 
        
    return total_loss, accuracy

In [None]:
def fit(train_loader, val_loader, model, loss_fn, optimizer, n_epochs, get_grad=False):
    """
    Args: 
        train_loader: training data
        val_loader: validation data
        model: prediction model
        loss_fn: loss function to calculate the distance between target and outputs
        optimizer: optimize the loss function
        n_epochs: training epochs
        get_grad: whether to get gradients of hidden2 layer and hidden3 layer or not
    Returns:
        train_accs: accuracy of training n_epochs, a list
        train_losses: loss of n_epochs, a list
    """
    
    grad_2 = [] 
    grad_3 = []
    
    train_accs = [] 
    train_losses = []
    
    for epoch in range(n_epochs): 
        
        train_loss, average_grad2, average_grad3 = train(train_loader, model, loss_fn, optimizer, get_grad)
        
        _, train_accuracy = evaluate(train_loader, model, loss_fn)
        message = 'Epoch: {}/{}. Train set: Average loss: {:.4f}, Accuracy: {:.4f}'.format(epoch+1, \
                                                                n_epochs, train_loss, train_accuracy)
        print(message)
    
        # save loss, accuracy, grad
        train_accs.append(train_accuracy)
        train_losses.append(train_loss)
        grad_2.append(average_grad2)
        grad_3.append(average_grad3)
    
        # evaluate model performance on val dataset
        val_loss, val_accuracy = evaluate(val_loader, model, loss_fn)
        message = 'Epoch: {}/{}. Validation set: Average loss: {:.4f}, Accuracy: {:.4f}'.format(epoch+1, \
                                                                n_epochs, val_loss, val_accuracy)
        print(message)
        
    if get_grad == True:
        fig, ax = plt.subplots() 
        ax.plot(grad_2, label='Gradients of Hidden 2 Layer') 
        ax.plot(grad_3, label='Gradients of Hidden 3 Layer') 
        plt.ylim(top=0.004)
        # place a legend on axes
        legend = ax.legend(loc='best', shadow=True, fontsize='x-large')
    
    return train_accs, train_losses

In [None]:
def show_curve(ys, title):
    """
    Args:
        ys: loss or acc list
        title: Loss or Accuracy
    """
    x = np.array(range(len(ys)))
    y = np.array(ys)
    plt.plot(x, y, c='b')
    plt.axis()
    plt.title('{} Curve:'.format(title))
    plt.xlabel('Epoch')
    plt.ylabel('{} Value'.format(title))
    plt.show()

## 作业 2
将n_epochs依次设为5和10，画出训练过程中loss和accuracy的变化曲线，分别观察模型在训练集上的拟合情况。

Hints: 因为jupyter对变量有上下文关系，模型和优化器需要重新声明。

In [None]:
# TODO

# set hyper parameters and declare model, optimizer
# code here


In [None]:
# TODO

# train and evaluate model
# code here


In [None]:
# TODO

# show curve
# code here


## 作业 3
适当调整其他参数，使模型能在5个epoch内在训练集上达到过拟合，画出训练过程中loss和accuracy的变化曲线。

In [None]:
# TODO

# set hyper parameters
# code here


In [None]:
# TODO

# train and evaluate model
# code here


In [None]:
# TODO

# show curve
# code here


### 3.4 save model 
PyTorch provides two methods to save the model. And we recommend the one that only saves parameters, because it's more flexible.

A common PyTorch convention is to save models using either a .pt or .pth file extension.

Click on this [link](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for more details.

In [None]:
# show parameters in model
print("Model's state_dict:")
for param_tensor in model.state_dict():
    print(param_tensor, "\t", model.state_dict()[param_tensor].size())

print("\nOptimizer's state_dict:")
for var_name in optimizer.state_dict():
    print(var_name, "\t", optimizer.state_dict()[var_name])

In [None]:
# save model
save_path = './model.pt'
torch.save(model.state_dict(), save_path)

In [None]:
# load parameters from files
saved_parameters = torch.load(save_path)
print(saved_parameters)

In [None]:
# initialize model with saved parameters
new_model = FeedForwardNeuralNetwork(input_size, hidden_size, output_size)
new_model.load_state_dict(saved_parameters)

## 4. Training Tricks

### 4.1 l2_norm

# 作业 4 
思考正则项的作用，将l2_norm的值分别设置为0.01和1, 训练模型，观察和不使用l2_norm的结果之间的差异。

Hint：we could minimize the regularization term by using $weight\_decay$ in **SGD optimizer**


In [None]:
# TODO

# set hyper parameters and declare model, optimizer
# code here


In [None]:
# TODO

# train and evaluate model
# code here


In [None]:
# TODO

# set hyper parameters
# code here


In [None]:
# TODO

# train and evaluate model
# code here


### 4.2 dropout
Overfitting is a serious problem in large networks. And large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training.

Click on this [link](http://jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf) for more details.

# 作业 5 
思考dropout的作用，使用dropout训练模型，观察和不使用dropout的结果之间的差异。

In [None]:
# TODO

# set hyper parameters and declare model, optimizer
# code here


In [None]:
# TODO

# Set dropout to True and probability = 0.5
# code here

In [None]:
# TODO

# train and evaluate model
# code here


### 4.3 batch normalization
Batch normalization is a technique for improving the performance and stability of artificial neural networks

\begin{equation}
    y=\frac{x-E[x]}{\sqrt{Var[x]+\epsilon}} * \gamma + \beta, 
\end{equation}

$\gamma$ and $\beta$ are learnable parameters

Click on this [link](https://arxiv.org/abs/1502.03167) for more details.

# 作业 6 
思考batch normalization的作用，使用batch normalization训练模型，观察和不使用batch normalization的结果之间的差异。

In [None]:
# TODO

# set hyper parameters and declare model, optimizer
# code here


In [None]:
# TODO

# Set batch normalization to True 
# code here

In [None]:
# TODO

# train and evaluate model
# code here


### 4.4 data augmentation
PyTorch provides many transformation methods for data augmentation.

Click on this [link](https://pytorch.org/docs/stable/torchvision/transforms.html) for more details.

## 作业 7
思考data augmentation的作用，使用data augmentation训练模型（具体transform方式不限），观察和不使用data augmentation的结果之间的差异。

In [None]:
# TODO

# use data augmentation
# reload train_loader with transform
# code here


In [None]:
# TODO

# set hyper parameters and declare model, optimizer
# code here


In [None]:
# TODO

# train and evaluate model
# code here


## 5. Gradient explosion and vanishing

For plotting how gradients change, you need to set **get_grad=True** in **fit function**.

## 作业 8
调整超参数或改变初始化方式，分别产生梯度爆炸和梯度消失两种情况，观察它们与正常情况下梯度的变化曲线。

In [None]:
# TODO

# normal gradient
# code here


In [None]:
# TODO

# gradient explosion
# code here


In [None]:
# TODO

# gradient vanishing
# code here
