More PyTorch tutorials can be found at [https://pytorch.org/tutorials/](https://pytorch.org/tutorials/)

# Import pytorch libraries

In [None]:
import math
import numpy as np

import torch
from torch import nn, optim, cuda
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.optim.lr_scheduler import StepLR
import torch.nn.functional as F

import torchvision
from torchvision import datasets, transforms


use_gpu=False
if cuda.is_available():
    # check if GPU is available
    print(cuda.get_device_properties(0))

# for visualization
import matplotlib.pyplot as plt
%matplotlib inline

# Work on Fashion MNIST Dataset

https://github.com/zalandoresearch/fashion-mnist



### Download Datasets

In [None]:
# preparing transforming functions
data_tf=transforms.Compose([transforms.ToTensor(), transforms.Normalize([0.5],[0.5])])
# data_tf=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])

# download datasets
# train_dataset = datasets.MNIST(root='../data', train=True, transform=data_tf, download=True)
# test_dataset = datasets.MNIST(root='../data', train=False, transform=data_tf, download=True)
train_dataset = datasets.FashionMNIST(root='../data', train=True, transform=data_tf, download=True)
test_dataset = datasets.FashionMNIST(root='../data', train=False, transform=data_tf, download=True)



### Load Datasets

In [None]:
# data loader
batch_size=64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

## Display sample pictures from the dataset

<!-- row_num=8 # if batch_size=64, plot matrix as 8x8
column_num=int(batch_size/row_num)

#---create a grid figure----
fig, axes = plt.subplots(row_num, column_num, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

#---iterate and plot figures---
for i, ax in enumerate(axes.flat):
    ax.imshow(test_dataset.data[i], cmap='Greys', interpolation='nearest')
    ax.text(0.1, 0.1, test_dataset.targets[i].numpy(), transform=ax.transAxes, color='green')
 -->


In [None]:
sample_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
data=iter(sample_loader)
samples,sample_labels=data.next()

row_num=8 # if batch_size=64, plot matrix as 8x8
column_num=int(batch_size/row_num)


fig, axes = plt.subplots(row_num, column_num, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

for i, ax in enumerate(axes.flat):
    ax.imshow(samples[i][0], cmap='Greys', interpolation='nearest')
    ax.text(0.1, 0.1, sample_labels[i].numpy(), transform=ax.transAxes, color='green')


### Labels
Each training and test example is assigned to one of the following labels:

| Label | Description |
| --- | --- |
| 0 | T-shirt/top |
| 1 | Trouser |
| 2 | Pullover |
| 3 | Dress |
| 4 | Coat |
| 5 | Sandal |
| 6 | Shirt |
| 7 | Sneaker |
| 8 | Bag |
| 9 | Ankle boot |


# Step 2: Define a Deep Neural Networks

<!-- ![nn](images/neural_network_example.png)

The neural network architectures in Pytorch can be defined in a class which inherits the properties from the base class from **nn** package called Module. This inheritance from the nn.Module class allows us to implement, access, and call a number of methods easily. We can define all the layers inside the constructor of the class, and the forward propagation steps inside the forward function.

We will define a simple Multilayer Perceptron with the following architecture:

* Input layer
```Python
nn.Linear(28 * 28, 512)
```
    * Layer type: nn.Linear(), which refers to a fully connection layer
    * Input size: 28*28, corresponding to the size of input data.
    * Output size: 512, the number of "neurons".
    
* Hidden layer
```
nn.Linear(512, 256)
```

    * Layer type: nn.Linear()
    * Input size: 512, output size of the previous layer(input layer).
    * Output size: 256, the number of "neurons" in this layer.
    
* Output layer
```
nn.Linear(256, 10)
```

    * Layer type: nn.Linear()
    * Input size: 256, output size of the previous layer(hidden layer).
    * Output size: 10, the number of classes we need to predict.

* Activation functions
Each linear layer's output needs to go through an activation function to "activate" it. We will get started with **F.sigmoid()** but can try F.relu() or others later.

The best practice is to name each layer and initialize them in the **__init__()** function as named building blocks and put the building blocks together in the **forward()** function which defines how the data actually flows in the network. In our case, each layer simply takes the output of the previous layer and perform transformations the generate outputs in sequence. -->

![LeNet](images/lenet.png)
LeNet. Original image published in [LeCun et al., 1998]

[LeCun et al., 1998]: https://ieeexplore.ieee.org/document/726791/

A **convolution neural network (CNN)** performs the operation of **convolution** which adds elements of each image pixel to its local neighbor, weighted by a matrix or a small matrix, which helps to extract local features (e.g. sharpness, blurness, and edge) in an image. A major difference between a convolution layer and a **fully connected (FC)** layer is each element or neuron in a convolution layer is only connected to its neighbors; however, in a FC layer, each element is connected to all elements from the previous layer.

![Convolution](images/convolution-example.png)

Convolution example. Picture from https://towardsdatascience.com/simple-introduction-to-convolutional-neural-networks-cdf8d3077bac.

Another important operation is **pooling**. Pooling is used to reduce the dimension of data (tenor) in the neural network. A frequently used pooling method is called "max pool" which first divide each tensor into smaller subsets, and get the maximum value of each subset that is used to form a new tensor with smaller dimension.

![Maxpool](images/maxpool-example.png)

Max pool example. Picture from https://analyticsindiamag.com/max-pooling-in-convolutional-neural-network-and-its-features/

A **dropout** layer randomly "drops" a network element or set its weight to zero. The purpose is to avoid generating a too "complicated" network which tends to overfit the data. Overfiting the data will decrease the generalization ability of the neural network meaning the model will perform poorly given unsean validation or test data.

A **deep neural network (DNN)**, such as CNN, is a composition of mulitple types of layers such as convlution layers, pooling layers, dropout layers, fully connected layers, etc. The final output layer can be a `softmax` layer which yeilds a result or label with the maximum prediciton probability. 


In [None]:
class CNNExample(nn.Module):
    """
    from: https://pytorch.org/tutorials/recipes/recipes/defining_a_neural_network.html
    """
    def __init__(self):
        super(CNNExample, self).__init__()
        
        # First 2D convolutional layer, taking in 1 input channel (grayscale image),
        # outputting 32 convolutional features, with a square kernel size of 3
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        # Second 2D convolutional layer, taking in the 32 input layers,
        # outputting 64 convolutional features, with a square kernel size of 3
        self.conv2 = nn.Conv2d(32, 64, 3, 1)

        # Designed to ensure that adjacent pixels are either all 0s or all active
        # with an input probability
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)

        # First fully connected layer
        self.fc1 = nn.Linear(9216, 128)
        # Second fully connected layer that outputs our 10 labels
        self.fc2 = nn.Linear(128, 10)
    
    def forward(self, x):
        x = self.conv1(x) # conv layer 1
        x = F.relu(x) # activation function
        x = self.conv2(x) # conv layer 2
        x = F.max_pool2d(x, 2) # pool layer
        x = self.dropout1(x) # dropout layer 1
        x = torch.flatten(x, 1) # flatten layer 
        x = self.fc1(x) # full connection layer 1
        x = F.relu(x) # activation function
        x = self.dropout2(x) # dropout layer 2
        x = self.fc2(x) # full connection layer 2
        output = F.log_softmax(x, dim=1) # output use softmax function
        return output

## Define the model for training
The following code defines a neural network. It will be used in later coding.

```Python
# define model
model = BatchNet(28*28, 300, 100, 10)
if cuda.is_available(): 
    #if GPU is available
    model=model.cuda()
```

# Step 3: Define optimizer and criterion

Optimizer is used to perform the gradient descent process. There are several optimizers avaialble such as SGD( Stochastic Gradient Descent), Adam, Adagrad, etc. The tricky part is how to set the right size of learning rate (`learning_rate`) which could have a huge impact on the final result. Learning rate is step size when training the neural network. A larger step size trains quicker but may "miss" the optimal solution. For example, let's simply use 0.1 as the starting point.

Criterion will be used to calculate the cost (or loss) so we can use the cost to do back propagation and update the weights we want to train. In our case, we will use `nn.CrossEntropyLoss()` since we are working on a multiclassfication problem.

```Python
#---optimizer for training the neural network---
learning_rate=1e-1

# use Adam optimizer
optimizer=optim.Adam(model.parameters(), lr=learning_rate)

#---criterion is the cost or error function---
criterion=nn.CrossEntropyLoss()
```

# Step 4 train the model (back propagation)

Training the model is an iterative process which contains many epoches. For each epoch we will repeatly load batches of data, perform forward propagation, calculate cost, perform back propagation using the optimizer.

**Epoch** is how many times one wants to train the neural network. Each epoch will load and train all the training data through the neural network. After each epoch, the cost (error) function tends to get lower errors. Larger number of epoch means more rounds of training and may further lower the error of training but will take longer time. However, you may notice in the experiment that a very large number of epoch may not be necessary if the training error is "acceptable" after a lower number of rounds. 

In [None]:
##########training############
def train_model(model, train_loader, criterion, optimizer, num_epochs):
    model.train()
    for epoch in range(num_epochs):
        train_loss=0
        for data in train_loader:
            img, label = data
            #img = img.view(img.size(0),-1)
            if use_gpu and cuda.is_available(): #if use_gpu switch is on
                img=Variable(img).cuda()
                label=Variable(label).cuda()
            else:
                img=Variable(img)
                label=Variable(label)
            ##forward training
            optimizer.zero_grad()
            out = model(img)
            loss = criterion(out, label)
            #loss=F.nll_loss(out,label)
            train_loss += loss.data*label.size(0)
            #print(loss.data)
            #print(label.size(0))
            train_loss+=loss.item()
            # backward propagation
            loss.backward()
            optimizer.step()
        #scheduler.step()
        if (epoch+1)%(num_epochs/10) ==0:
            print(f'Epoch {epoch+1}/{num_epochs}, eval_loss={train_loss/(len(test_dataset))}')
    return model


# Step 5: Test the model

We will use the trained model to make predictions on the validation dataset and compare the predictions against the actual targets. Dataloader will be used to iterate the validation dataset as well.

In [None]:
##########testing##########
def test_model(model, criterion, test_loader):
    eval_loss=0
    eval_acc=0
    #use evaluation model
    model.eval()
    for data in test_loader:
        img, label = data
        #img = img.view(img.size(0),-1)
        if use_gpu and cuda.is_available(): #if use_gpu switch is on
            img=Variable(img).cuda()
            label=Variable(label).cuda()
        else:
            img=Variable(img)
            label=Variable(label)
        out = model(img)
        loss = criterion(out, label)
        eval_loss += loss.data*label.size(0)
        _, pred = torch.max(out, 1) #???
        num_correct = (pred==label).sum()
        eval_acc += num_correct.data
        #print('pred: {}, label: {}, num_correct: {}'.format(pred, label, num_correct))
    loss=eval_loss/(len(test_dataset))
    accuracy=eval_acc*1.0/(len(test_dataset))
    print(f'Test Loss: {loss}, Acc: {accuracy}')
    return loss, accuracy


# Step 6: Putting Together and Action

In [None]:
%%time

#gpu switch on/off
use_gpu=True

# # define model
# model=CNN()
model=CNNExample()
# model = CNN2()
# print(model)

if use_gpu and cuda.is_available():
    print(f'using GPU: {cuda.get_device_properties(0)}')
    model=model.cuda()
else:
    print('using CPU')
    model=model.cpu()
    
# learning rate
#learning_rate=1e-1
learning_rate=1
    
#---optimizer for training the neural network---
# use Adadelta optimizer: https://pytorch.org/docs/stable/generated/torch.optim.Adadelta.html
optimizer=optim.Adadelta(model.parameters(), lr=learning_rate)
# optimizer=optim.Adadelta(model.parameters(), rho=0.9, eps=1e-07, lr=learning_rate)
##---uncomment the following lines to try out different optimizers---
# optimizer=optim.Adagrad(model.parameters(), lr=learning_rate)
# optimizer=optim.Adam(model.parameters(), lr=learning_rate)
# optimizer=optim.SGD(model.parameters(), lr=learning_rate)


#---criterion is the cost or error function---
criterion=nn.CrossEntropyLoss()
#criterion=nn.NLLLoss()

# #---scheduler---
# scheduler = StepLR(optimizer, step_size=1, gamma=0.7) # need by NLLLoss

# set number of epochs
num_epochs=20

# train model
model = train_model(model, train_loader, criterion, optimizer, num_epochs)
loss, accuracy=test_model(model, criterion, test_loader)


---

### Sample results:


* Macbook Pro M1 CPU:

learning rate|number of epocs|criterion|optimizer
---|---|---|---
0.1|20|cross entropy|Adadelta


```
using CPU
Epoch 2/20, eval_loss=2.506694793701172
Epoch 4/20, eval_loss=2.0360848903656006
Epoch 6/20, eval_loss=1.7916204929351807
Epoch 8/20, eval_loss=1.635208249092102
Epoch 10/20, eval_loss=1.4720598459243774
Epoch 12/20, eval_loss=1.3724684715270996
Epoch 14/20, eval_loss=1.2660068273544312
Epoch 16/20, eval_loss=1.1700396537780762
Epoch 18/20, eval_loss=1.0889387130737305
Epoch 20/20, eval_loss=1.0286983251571655
Test Loss: 0.2344028353691101, Acc: 0.919700026512146
CPU times: user 18min 28s, sys: 4min 32s, total: 23min 1s
Wall time: 7min 29s
```

---

* STEM Cloud GPU-PC3 CPU:


learning rate|number of epocs|criterion|optimizer
---|---|---|---
1|20|cross entropy|Adadelta

```
using CPU
Epoch 2/20, eval_loss=2.059727668762207
Epoch 4/20, eval_loss=1.5988885164260864
Epoch 6/20, eval_loss=1.3719987869262695
Epoch 8/20, eval_loss=1.217139482498169
Epoch 10/20, eval_loss=1.105204463005066
Epoch 12/20, eval_loss=1.0436577796936035
Epoch 14/20, eval_loss=0.9719496965408325
Epoch 16/20, eval_loss=0.9385913014411926
Epoch 18/20, eval_loss=0.9174977540969849
Epoch 20/20, eval_loss=0.8974134922027588
Test Loss: 0.37537112832069397, Acc: 0.9154000282287598
CPU times: user 30min 17s, sys: 1min 4s, total: 31min 21s
Wall time: 8min 53s
```

learning rate|number of epocs|criterion|optimizer
---|---|---|---
0.1|40|cross entropy|Adadelta

```
```

---

* STEM Cloud GPU-PC3 GPU:

learning rate|number of epocs|criterion|optimizer
---|---|---|---
1|20|cross entropy|Adadelta

```
using GPU: _CudaDeviceProperties(name='Quadro RTX 4000', major=7, minor=5, total_memory=7979MB, multi_processor_count=36)
Epoch 2/20, eval_loss=2.0272912979125977
Epoch 4/20, eval_loss=1.5873997211456299
Epoch 6/20, eval_loss=1.3690463304519653
Epoch 8/20, eval_loss=1.2476575374603271
Epoch 10/20, eval_loss=1.1200876235961914
Epoch 12/20, eval_loss=1.0211328268051147
Epoch 14/20, eval_loss=1.0031551122665405
Epoch 16/20, eval_loss=0.9409778118133545
Epoch 18/20, eval_loss=0.9311295747756958
Epoch 20/20, eval_loss=0.9014635682106018
Test Loss: 0.34685808420181274, Acc: 0.914900004863739
CPU times: user 2min 41s, sys: 697 ms, total: 2min 41s
Wall time: 2min 41s
```


learning rate|number of epocs|criterion|optimizer
---|---|---|---
1|40|cross entropy|Adadelta

```
using GPU: _CudaDeviceProperties(name='Quadro RTX 4000', major=7, minor=5, total_memory=7979MB, multi_processor_count=36)
Epoch 4/40, eval_loss=1.603761911392212
Epoch 8/40, eval_loss=1.2219514846801758
Epoch 12/40, eval_loss=1.0351918935775757
Epoch 16/40, eval_loss=0.9253424406051636
Epoch 20/40, eval_loss=0.8904345631599426
Epoch 24/40, eval_loss=0.8199371099472046
Epoch 28/40, eval_loss=0.7929683327674866
Epoch 32/40, eval_loss=0.793138325214386
Epoch 36/40, eval_loss=0.7336294054985046
Epoch 40/40, eval_loss=0.7716048359870911
Test Loss: 0.6219379305839539, Acc: 0.9109999537467957
CPU times: user 5min 19s, sys: 645 ms, total: 5min 19s
Wall time: 5min 20s
```

<!-- * Google Colab CPU:
    * learning rate: 0.1; number of epocs: 20; criterion: cross entropy; optimizer: Adam
 -->


# Step 7: make prediction

This step is similar to the validation step except that we are not comparing the predictions as there's no ground truth of target to compare with.

In [None]:
sample_size=batch_size
# let shuffle be True to load random images each time
sample_loader = DataLoader(test_dataset, batch_size=sample_size, shuffle=True) 
data=iter(sample_loader)
samples,sample_labels=data.next()

# if use_gpu and cuda.is_available(): #if use_gpu switch is on
#     samples=Variable(samples).cuda()
#     sample_labels=Variable(sample_labels).cuda()
# else:
#     samples=Variable(samples)
#     sample_labels=Variable(sample_labels)

model.to('cpu') #move model to cpu for evaluation
model.eval()
output=model(samples)
_, pred = torch.max(output, 1)

# print(sample_labels)
# print(pred)

row_num=8 # if batch_size=64, plot matrix as 8x8
column_num=int(batch_size/row_num)
fig, axes = plt.subplots(row_num, column_num, figsize=(8, 8),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))

error = 0 # prediction error
for i, ax in enumerate(axes.flat):
    ax.imshow(samples[i][0], cmap='Greys', interpolation='nearest')
    ax.text(0.1, 0.1, sample_labels[i].numpy(), transform=ax.transAxes, color='green')
    if sample_labels[i].numpy()==pred[i].numpy():
        color_pred='green'
    else:
        color_pred='red'
        error+=1
    ax.text(0.8, 0.1, pred[i].numpy(), transform=ax.transAxes, color=color_pred)

accuracy_predict=(1-error*1.0/sample_size)*100
print(f'Predition Errors: {error} of {sample_size} | Prediction accuracy: {accuracy_predict}%')


<font style="font-size:30px" color="red">Your Turn</font>

1. Try different values for num_epochs.

1. Try different values for batch_size such as 64, 128, 256, 512.

1. Try different learning rate (0.0001, 0.001, 0.1). What is your observation of the evalation loss (`eval_loss`) in training?

1. `relu()` was used as the activation function for the MLP model we implemented. Try other activation functions such as `sigmoid()` and `tanh()`. What is your observation in training? (You may want to refer to the [PyTorch Functional](https://pytorch.org/docs/stable/nn.functional.html) for more details.)
    
1. Try different optimizers such as `SGD` and `Adagrad`.  What is your observation in training? (Refer to [PyTorch Optimizer](https://pytorch.org/docs/stable/optim.html))

1. Try to add/ remove hiden layers, as well as different number of neurons and report your validation results.  What is your observation in training?

1. What can you think of if we want to improve the current model?


<!-- # Assignments: -->


<!-- 1. Can you implment the following functions for model training, evaluation and prediction so we can reuse them when we need to test different things afterwards without having to replicate the codes every time.

```python

def train_model(nn_model, train_loader, optimizer, criterion, n_epoch):
    # YOUR IMPLEMENTATION
    return nn_model

def eval_model(nn_model, val_loader):
    # YOUR IMPLEMENTATION
    return calculated_accuracy


def predict(nn_model, test_loader):
    # YOUR IMPLEMENTATION
    return predictions

```


2. F.sigmoid() was used as the activation function for the MLP model we implemented. Can you try other activation functions such as F.relu() and F.tanh()? You may want to refer to the [PyTorch Functional](https://pytorch.org/docs/stable/nn.functional.html) for more details. Use the train_model(), eval_model() functions you implmented so you don't have to repeat the same codes.

3. Can you add a dropout layer between input and the hidden layer, and another one between the hidden layer and the output layer?

4. Try to add/ remove hiden layers, as well as different number of neurons and report your validation results.

5. Try different learning rate (0.0001, 0.001, 0.01, 0.1). What is your observation?

6. Try different values for batch_size(64, 128, 256, 512).

7. Try different values for n_epochs.

8. What can you think of if we want to improve the current model?
 -->

