# Introduction to Deep Learning with PyTorch

Neural networks have been at the forefront of Artificial Intelligence research during the last few years, and have provided solutions to many difficult problems like image classification, language translation or Alpha Go. PyTorch is one of the leading deep learning frameworks, being at the same time both powerful and easy to use. In this course you will use PyTorch to first learn about the basic concepts of neural networks, before building your first neural network to predict digits from MNIST dataset. You will then learn about convolutional neural networks, and use them to build much more powerful models which give more accurate results. You will evaluate the results and use different techniques to improve them. Following the course, you will be able to delve deeper into neural networks and start your career in this fascinating field.

**Instructor:** Ismail Elezi, PhD researched at Ca' Foscari University of Venice

## $\star$ Chapter 1: Introduction to PyTorch
In this first chapter, we introduce basic concepts of neural networks and deep learning using PyTorch library.

* **Why PyTorch?**
    * Simplicity
    * "PyThonic"- easy to use
    * Strong GPU support - models run fast
    * Many algorithms are already implemented
    * Automatic differentiation
    * Strong OOP
    * Natural choice for many companies like Facebook and SalesForce
    * One of the most used deep learning libraries in academical research
    * Similar to NumPy, making the switch pretty painless
    
* Calculating derivatives and gradients is a very important aspect of deep learning algorithms 
* Luckily, PyTorch is very good at doing it for us

#### PyTorch compared to NumPy
* PyTorch's equivalent to ndarrays is a `torch.tensor`
* Image a tensor as an array with an arbitrary number of dimensions

In [None]:
#pip install torchvision

In [47]:
import torch
import torch.nn as nn
import torchvision
import torch.utils.data
import torchvision.transforms as transforms
import numpy as np
import pandas as pd

In [2]:
torch.tensor([[2, 3, 5], [1, 2, 9]])

tensor([[2, 3, 5],
        [1, 2, 9]])

In [4]:
torch.rand(2, 2)

tensor([[0.8223, 0.1763],
        [0.8878, 0.3068]])

In [5]:
a = torch.rand((3, 5))

In [6]:
print(a)

tensor([[0.8539, 0.2192, 0.4255, 0.3891, 0.9884],
        [0.7929, 0.6227, 0.6128, 0.2260, 0.3861],
        [0.2820, 0.8878, 0.0554, 0.6852, 0.8450]])


In [7]:
a.shape

torch.Size([3, 5])

In [10]:
b = torch.rand(5, 3)

In [11]:
torch.matmul(a, b)

tensor([[1.8819, 1.2884, 1.5931],
        [1.6487, 1.3359, 1.2814],
        [1.5481, 1.1524, 1.1297]])

In [13]:
c = torch.rand(3, 5)

In [14]:
a * c

tensor([[0.2440, 0.0965, 0.2734, 0.2897, 0.5755],
        [0.0186, 0.6216, 0.6122, 0.2168, 0.2859],
        [0.0304, 0.1187, 0.0416, 0.4657, 0.7632]])

In [15]:
torch.zeros(2, 2)

tensor([[0., 0.],
        [0., 0.]])

In [16]:
torch.ones(2, 2)

tensor([[1., 1.],
        [1., 1.]])

In [17]:
torch.eye(2, 2)

tensor([[1., 0.],
        [0., 1.]])

#### from NumPy to PyTorch
* `d_torch = torch.from_numpy(c_numpy)`

#### from PyTorch to NumPy
* `c_torch.numpy()`

<img src='data/basic_torch_functions.png' width="600" height="300" align="center"/>

#### Forward Propagation
* Also known as "foward pass"
* Intuitively, a **computational graph** is a network of nodes that represent numbers, scalars, or tensors and are connected via edges that represent functions or operations

#### PyTorch Implementation 

<img src='data/graph1.png' width="400" height="200" align="center"/>

In [22]:
# First initialize tensors a, b, c, and d
a = torch.Tensor([2])
b = torch.Tensor([-4])
c = torch.Tensor([-2])
d = torch.Tensor([2])

In [23]:
e = a + b
f = c * d

In [24]:
g = e * f

In [25]:
print(e, f, g)

tensor([-2.]) tensor([-4.]) tensor([8.])


* Neural networks (and most other classifiers) can be understood as **computational graphs**
    * In fact, your code gets converted to a computational graph
    * An additional benefit of computational graphs, is that they make the automatic computation of derivatives (or gradients) much easier.
    
#### Exercises: Forward pass

In [26]:
# Initialize tensors x, y and z
x = torch.rand(1000, 1000)
y = torch.rand(1000, 1000)
z = torch.rand(1000, 1000)

# Multiply x with y
q = torch.matmul(x, y)

# Multiply elementwise z with q
f = z * q

mean_f = torch.mean(f)
print(mean_f)

tensor(124.9853)


### Backpropagation by auto-differentiation
* The main algorithm in neural networks: the **backpropagation algorithm**

#### Derivatives
* Derivatives are one of the central concepts in calculus
* In layman's terms, the derivatives represent the rate of change in a function
    * Where the function is **rapidly changing**, the absolute value of the **derviatives is high**.
    * When the function **is not changing** the derivtives are **close to 0**.
    * They could also be interpreted as describing the steepness of a function
    
<img src='data/derivatives.png' width="400" height="200" align="center"/>

* In this graph, points `A` and `C` have large derivatives, while point `B` has a very small derivative
* Khan Academy Derivatives course comes highly recommended

<img src='data/derivative_rules.png' width="400" height="200" align="center"/>

* The **Addition** or **Sum** rule says that for two functions, $f$ and $g$, the derivative of their sum is the sum of their individual derivatives
* The **Multiplication** rule says that the derivative of their product is $f$ times derivative of $g$ times derivative of $f$
* The derivative of a number times a function is the number
    * For example, the derivative of $3x$ is $3$
* The derivative of a number itself is always zero
* The derivative of something with respect to itself is always 1 
* **Chain rule** deals with the composition of functions
* A closely related term with derivatives is the gradient
* The **gradient** is a multi-variable generalization of the derivative
    * Considering that neural networks have many variables, we will typically use the term gradient instead of derivative when working with NNs
    
#### Backpropagation in PyTorch
* The derivatives are calculated in PyTorch using the reverse mode of auto-differentiation, so you will rarely need to write code to calculate derivatives

In [27]:
x = torch.tensor(-3., requires_grad=True)
y = torch.tensor(5., requires_grad=True)
z = torch.tensor(-2., requires_grad=True)

q = x + y
f = q * z

f.backward()

In [28]:
print("Gradient of z is : " + str(z.grad))
print("Gradient of y is : " + str(y.grad))
print("Gradient of x is : " + str(x.grad))

Gradient of z is : tensor(2.)
Gradient of y is : tensor(-2.)
Gradient of x is : tensor(-2.)


* **Note** that we need to set the `requires_grad` parameter to `True` in order to tell PyTorch that we need their derivatives
* `f.backward()` tells PyTorch to compute the derivatives

### Introduction to Neural Networks
* The simplest form of modern neural networks is: fully-connected neural networks (Dense)

#### Fully connected neural networks with PyTorch

In [29]:
input_layer = torch.rand(10)

In [30]:
w1 = torch.rand(10, 20)
w2 = torch.rand(20, 20)
w3 = torch.rand(20, 4)

* In order to get the values of the first hidden layer `h1`, we multiply the vector of features with the first matrix of weights `w1`
* Look at the matrix of weights. The first dimension should always correspond to the preceding layer, and the second dimension to the following layer

In [31]:
h1 = torch.matmul(input_layer, w1)

* Similarly, we continue for the second hidden layer, `h2`, which is the product of the first hidden layer `h1` and the second matrix of weights `w2`.
* Finally, we get the results of the `output_layer`, which has 4 classes, by multiplying the second hidden layer `h2` with the third matrix of weights `w3`

In [32]:
h2 = torch.matmul(h1, w2)

In [33]:
output_layer = torch.matmul(h2, w3)

In [34]:
print(output_layer)

tensor([199.4510, 194.9810, 236.2898, 185.9965])


### Building a neural network- PyTorch style

```
import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 20)
        self.output = nn.Linear(20, 4)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.output(x)
        return x

input_layer = torch.rand(10)
net = Net()
result = net(input_layer)
```

* In the `__init__` method, we define our parameters, the tensors of weights.
* For fully connected layers, they are called `nn.Linear`
    * The first parameter is the number of units of the current layer
    * The second parameter is the number of units in the next layer
    * In the forward method, we apply all those weights to our input
 
#### Exercises: Your first neural network

```
# Initialize the weights of the neural network
weight_1 = torch.rand(1, 1)
weight_2 = torch.rand(1, 1)

# Multiply input_layer with weight_1
hidden_1 = torch.matmul(input_layer, weight_1)

# Multiply hidden_1 with weight_2
output_layer = torch.matmul(hidden_1, weight_2)
print(output_layer)
```
***
```
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # Instantiate all 2 linear layers  
        self.fc1 = nn.Linear(784, 200)
        self.fc2 = nn.Linear(200, 10)

    def forward(self, x):
      
        # Use the instantiated layers and return x
        x = self.fc1(x)
        x = self.fc2(x)
        return x
```

## Activation functions

In [35]:
input_layer = torch.tensor([2., 1.])
weight_1 = torch.tensor([[0.45, 0.32], [-0.12, 0.29]])
hidden_layer = torch.matmul(input_layer, weight_1)
weight_2 = torch.tensor([[0.48, -0.12], [0.64, 0.91]])
output_layer = torch.matmul(hidden_layer, weight_2)
print(output_layer)

tensor([0.9696, 0.7527])


#### Matrix multiplication is a linear transformation
* Now, let's try to do something different
* Let's first multiply the matrices with `torch.matmul` and then we'll multiply the input with the product of these matrices
* When we print the results, we see something interesting: **the result of the output layer is exactly the same as before.**

In [38]:
input_layer = torch.tensor([2., 1.])
weight_1 = torch.tensor([[0.45, 0.32], [-0.12, 0.29]])
weight_2 = torch.tensor([[0.48, -0.12], [0.64, 0.91]])
weight = torch.matmul(weight_1, weight_2)
output_layer = torch.matmul(input_layer, weight)
print(output_layer)
print(weight)

tensor([0.9696, 0.7527])
tensor([[0.4208, 0.2372],
        [0.1280, 0.2783]])


* This means that we can achieve the exact result by using a single layer neural network, with this particular set of weights. 
* Linear algebra demonstrates that matrix multiplication is actually a linear transformation, meaning that we can simplify any neural network in a single layer neural network
* But, this comes with an irritating consequence: our neural nets are not that powerful; using them *alone* only allows us to separate linearly separable datasets (for which there are a host of more intuitive ML algorithms).
* To separate non-linearly-separable functions, we use **activation functions.**

<img src='data/activation_functions.png' width="600" height="300" align="center"/>

* **Activation functions** are non-linear functions which are inserted in each layer of the neural network, making neural networks nonlinear and allowing them to deal with highly non-linear datasets, thus making them much more powerful.

In [41]:
relu = nn.ReLU()

tensor_1 = torch.tensor([2., -4.])
print(relu(tensor_1))

tensor_2 = torch.tensor([[2., -4.], [1.2, 0.]])
print(relu(tensor_2))

tensor([2., 0.])
tensor([[2.0000, 0.0000],
        [1.2000, 0.0000]])


### Loss Functions
* So far, all neural networks in this course have had random weights (and so they weren't particularly useful)
* The recipe for training neural networks is the following:
    * Initialize neural networks with random weights
    * Do a forward pass
    * Calculate loss function (1 number)
    * Calculate the gradients using backpropagation
    * Change the weights based on gradients
* Loss (cost) function for **regression: least squared loss**
* Loss (cost) function for **classification: softmax or (categorical) cross-entropy loss**
* For more complicated problems (like object detection), more complicated losses
* Loss functions should be **differentiable**; otherwise we won't be able to compute gradients
* For this reason, instead of using accuracy (which is not differentiable), we need to use some proxy loss functions (in neural nets, a softmax function followed by a cross-entropy function performs really well).
* **Softmax** is a function that turns numbers into probabilities

<img src='data/softmax_cross_entropy2.png' width="600" height="300" align="center"/>

### CE loss in PyTorch
* `logits` = scores for each class
* `ground_truth` = cat
* `criterion` = loss function
* Below we choose `nn.CrossEntropyLoss()` which combines **softmax** with **cross-entropy**
* Note that we get the same result from the code below as we do in the illustration above.

In [42]:
logits = torch.tensor([[3.2, 5.1, -1.7]])
ground_truth = torch.tensor([0])
criterion = nn.CrossEntropyLoss()

loss = criterion(logits, ground_truth)
print(loss)

tensor(2.0404)


What is the cat class prediction had been much higher?

In [43]:
logits = torch.tensor([[10.2, 5.1, -1.7]])
loss = criterion(logits, ground_truth)
print(loss)

tensor(0.0061)


What is the cat class prediction had been much lower?

In [44]:
logits = torch.tensor([[-10, 5.1, -1.7]])
loss = criterion(logits, ground_truth)
print(loss)

tensor(15.1011)


The rule of thumb is that **the more accurate the network is, the smaller the loss (and vice versa).**

#### Exercises: Calculating loss function in PyTorch

In [45]:
# Initialize the scores and ground truth
logits = torch.tensor([[-1.2, 0.12, 4.8]])
ground_truth = torch.tensor([2])

# Instantiate cross entropy loss
criterion = nn.CrossEntropyLoss()

# Compute and print the loss
loss = criterion(logits, ground_truth)
print(loss)

tensor(0.0117)


#### Exercises: Loss function of random scores
If the neural network predicts random scores, what would be its loss function? Let's find it out in PyTorch. The neural network is going to have 1000 classes, each having a random score. For ground truth, it will have class 111. Calculate the loss function.

In [46]:
# Import torch and torch.nn
import torch
import torch.nn as nn

# Initialize logits and ground truth
logits = torch.rand(1,1000)
ground_truth = torch.tensor([111])

# Instantiate cross-entropy loss
criterion = nn.CrossEntropyLoss()

# Calculate and print the loss
loss = criterion(logits, ground_truth)
print(loss)

tensor(7.3071)


### Preparing a dataset in PyTorch
* In order to be able to use datasets in PyTorch, they need to be in some PyTorch friendly format that the framework will be able to understand
* **`torchvision`:** a package which deals with datasets and pretrained neural nets
* **Below we define a transformation of images to torch tensors, usings `transforms`.**

In [48]:
transform = transforms.Compose(
            [transforms.ToTensor(),
             transforms.Normalize((0.4914, 0.48216, 0.44653),
                                  (0.24703, 0.24349, 0.26159))])

* We decide where the path to the dataset will be stored (using `root` parameter), in the case below in the `data` folder.
* We also set the `download` flag to `True`, which tells PyTorch that if dataset is not in the specified folder, to download it and put it there.
* Finally, we set `transform` to `transform`, essentially transforming images to torch tensors by applying the transformation we defined in the codeblock above.
* We build trainloader and testloader, getting the data ready for PyTorch

```
# Get datasets
trainset = torchvision.datasets.CIFAR10(root='./data', train = True, 
                                        download=True, transform = transform)
testset = torchvision.datasets.CIFAR10(root='./data', train = False,
                                       download=True, transform = transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32,
                                          shuffle=True, num_workers=4)
testloader = torch.utils.data.DataLoader(testset, batch_size = 32,
                                         shuffle = True, num_workers=4)
```    

#### Inspecting the dataloader
* It's possible to inspect the dataloader, for example, we can look at the **shape of the testing or training datasets**, the **minibatch size**, or the **type of the random sampler**

```
print(testloader.dataset.test_data.shape, trainloader.dataset.train_data.shape)

print(testloader.batch_size)

print(trainloader.sampler)
```

### Training neural networks
* Here we will go more in depth on how to train neural networks in PyTorch
    * 1) **Prepare the dataloaders** for the dataset we want the neural network to train on
    * 2) **Build a neural network**
            * By default, all parameters of a neural network are initialized with random numbers
            * There are other strategies for initialization however
* Loop over:
    * 3) **Do a forward pass** (using a minibatch)
    * 4) **Calculate the loss function** (1 number which tries to measure how good the neural network is in the training set)
    * 5) **Calculate the gradients** using backpropagation
    * 6) **Change the weights** based on gradients; SGD= **`weight -= weight_gradient * learning_rate`**
    
#### Neural Network: Recap

```
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
    super(Net, self).__init__()
    self.fc1 = nn.Linear(32 * 32 * 3, 500)
    self.fc2 = nn.Linear(500, 10)
    
    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)
```
* Note that CIFAR10 has images of shape (32, 32, 3) so as input layer we have 32 * 32 * 3 units
* We decide to have 500 units in the hidden layer, a decision made by us (hyperparameter)
* With the dataset having 10 classes, we put 10ths 0 units in the output layer

### Training the Neural Network 

```
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=3e-4)

for epoch in range(10): # loop over the dataset multiple times
    for i, data in enumerate(trainloader, 0):
    
    # Get the inputs
    inputs, labels = data
    inputs = inputs.view(-1, 32 * 32 * 3)
    
    # Zero the parameter gradients 
    optimizer.zero_grad()
    
    # Forward + backward + optimize
    outputs = net(inputs)
    loss = criterion(outputs, labels)
    loss.backward()
    optimizer.step()
```

* The line `inputs = inputs.view(-1, 32 * 32 * 3)` simply puts all the entries of the images into two vectors.

#### Using the net to get predictions

```
correct, total = 0, 0
predictions = []
net.eval()
for i, data in enumerate(testloader, 0):
    inputs, labels = data
    inputs = inputs.view(-1, 32*32*3)
    outputs = net(inputs)
    _, predicted = torch.max(outputs.data, 1)
    predictions.append(outputs)
    total += labels.size(0)
    correct += (predicted == labels).sum().item()
    
print('The testing set accuracy of network is: %d %%' % (100 * correct / total))
```
* **Note** that we first set the net in test (evaluation) mode using **`net.eval()`**

### Convolution operator
* 1) Units should be onnected with only a few untis from the previous layer
* 2) Units share weights

* The size of a convolutional filter **must** be smaller than the image
* A convolutional layer is a layer that contains multiple activation maps.

<img src='data/conv1.png' width="400" height="200" align="center"/>

#### Using the `torch.nn` package

```
image = torch.rand(16, 3, 32, 32)
conv_filter = torch.nn.Conv2d(in_channels=3,
                              out_channels=1,
                              kernel_size= 5,
                              stride=1,
                              padding=0)
output_feature = conv_filter(image)

print(output_feature.shape)
```

#### Using the torch functional package

```
image = torch.rand(16, 3, 32, 32)
filter = torch.rand(1, 3, 5, 5)
out_feat_F = F.conv2d(image, filter, stride = 1, padding = 0)

print(out_feat_F.shape)
```

#### Exercises: Convolution operator - OOP way

```
# Create 10 random images of shape (1, 28, 28)
images = torch.rand(10, 1, 28, 28)

# Build 6 conv. filters
conv_filters = torch.nn.Conv2d(in_channels=1, out_channels=6, kernel_size=3, stride=1, padding=1)

# Convolve the image with the filters 
output_feature = conv_filters(images)
print(output_feature.shape)
```

#### Exercises: Convolution operator - Functional way

```
# Create 10 random images
image = torch.rand(10, 1, 28, 28)

# Create 6 filters
filters = torch.rand(6, 1, 3, 3)

# Convolve the image with the filters
output_feature = F.conv2d(image, filters, stride=1, padding=1)
print(output_feature.shape)
```

### Pooling operators
* The convolutional operator is the main building block in CNNs
* Another very important layer in CNNs is the pooling operator, which can come in in two different ways:
    * While convolutions are used to extract features from the image, **pooling is the way of feature selection**, choosing the most dominant features from the image, or combining different features
    * Additionally, pooling lowers the resolution of the images, **making the computations more efficient.**
* **Pooling** is simply lowering the spatial dimension

<img src='data/pooling1.png' width="400" height="200" align="center"/>

* The two most important pooling operators are max-pooling and average-pooling
#### Max-Pooling
* Max pooling takes the maximum number in a given region of a given region, as show below
* Note that typically for pooling, we consider filters with size 2x2 and strides of size 2
* By considering only the largest values in patches of the image, we make learning invariant to small shifting/translation.

<img src='data/maxpooling1.png' width="400" height="200" align="center"/>

#### Average-Pooling
* Typically used in later stages of deep networks

<img src='data/avgpooling1.png' width="400" height="200" align="center"/>

### Max-Pooling in PyTorch: OOP
* Multiple brackets are need because the image needs to have 4 dimensions (for minibatch size, depth, height, and width)

```
im = torch.Tensor([[[[3, 1, 3, 5], [6, 0, 7, 9], 
                     [3, 2, 1, 4], [0, 2, 4, 3]]]])
max_pooling = torch.nn.MaxPool2d(2)   
output_feature = max_pooling(im)
print(output_feature)
```

### Max-Pooling in PyTorch: Functional

```
im = torch.Tensor([[[[3, 1, 3, 5], [6, 0, 7, 9],
                     [3, 2, 1, 4], [0, 2, 4, 3]]]])
output_feature_F = F.max_pool2d(im, 2)
print(output_feature_F)
```

* **In order to apply Average-Pooling we do exactly the same thing, only replace `MaxPool2d()` with `AvgPool2d()`.**

#### Exercises: Max-pooling operator - Both ways

```
# Build a pooling operator with size `2`.
max_pooling = torch.nn.MaxPool2d(2)

# Apply the pooling operator
output_feature = max_pooling(im)

# Use pooling operator in the image
output_feature_F = F.max_pool2d(im, 2)

# print the results of both cases
print(output_feature)
print(output_feature_F)
```

#### Exercises: Average-Pooling operator- Both ways

```
# Build a pooling operator with size `2`.
avg_pooling = torch.nn.AvgPool2d(2)

# Apply the pooling operator
output_feature = avg_pooling(im)

# Use pooling operator in the image
output_feature_F = F.avg_pool2d(im, 2)

# print the results of both cases
print(output_feature)
print(output_feature_F)
```

### Convolutional Neural Networks 
* While CNNs have existed for decades, their resurgence happened in 2012, when Alex Krizhevsky, Ily Sutskever, and Geoffrey Hinton published the so-called **AlexNet** paper and smashed every record in image classification.
* Until that time, people were aware of the existence on CNNs, but they didn't take them seriously

<img src='data/alexnet.png' width="700" height="350" align="center"/>

* **Almost everything in computer vision is empowered by CNNs.** (If not, they at least play a large part in it)
* Coding AlexNet in PyTorch is surprisingly easy

In [49]:
class AlexNet(nn.Module):
    
    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size = 11, stride = 4, padding = 2)
        self.relu = nn.ReLU(inplace = True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2)
        self.conv2 = nn.Conv2d(64, 192, kernel_size = 5, padding = 2)
        self.conv3 = nn.Conv2d(192, 384, kernel_size = 3, padding = 1)
        self.conv4 = nn.Conv2d(384, 256, kernel_size = 3, padding = 1)
        self.conv5 = nn.Conv2d(256, 256, kernel_size = 3, padding = 1)
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.fc1 = nn.Linear(256 * 6 * 6, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, num_classes)

**Now all that remains is implementing the forward method.**

In [50]:
def forward(self, x):
    x = self.relu(self.conv1(x))
    x = self.maxpool(x)
    x = self.relu(self.conv2(x))
    x = self.maxpool(x)
    x = self.relu(self.conv3(x))
    x = self.relu(self.conv4(x))
    x = self.relu(self.conv5(x))
    x = self.maxpool(x)
    x = self.avgpool(x)
    x = x.view(x.size(0), 256 * 6 * 6)
    x = self.relu(self.fc1(x))
    x = self.relu(self.fc2(x))
    return self.fc3(x)

In [51]:
net = AlexNet()

**Of course, in order for AlexNet to make a correct prediction, it needs to be trained first.**

Building the net is simply a matter of creatinf an object from this class.

### Training CNNs
* The number of channels for convolutional filters is arbitrary
* It is very common to progressively increase the number of channels in the convolutional layers of a CNN
#### Instantiate model, define loss and opt

```
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=3e-4)
```

#### Training

```
for epoch in range(10):
    for i, data in enumerate(trainloader, 0):
        # Get the inputs
        inputs, labels = data
        
        # Zero the parameter gradients
        optimizer.zero_grad()
        
        # Forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
print('Finished Training')
```

#### Evaluating the results

```
correct, total = 0, 0
predictions = []
net.eval()
for i, data in enumerate(testloader, 0):
    inputs, labels = data
    outputs = net(inputs)
    _, predicted = torch.max(outputs.data, 1)
    predictions.append(outputs)
    total += labels.size(0)
    correct += (predicted == labels).sum().item()
    
print('The testing set accuract of the network is: %d %%' % (100 * correct / total))
```

# $\star$ Chapter 4: Using Convolutional Neural Networks
In this last chapter, we learn how to make neural networks work well in practice, using concepts like regularization, batch-normalization and transfer learning.

### The sequential module
* Here we will examine some more advanced techniques
* While the effect of these techniques is small in simple neural networks, perhaps making it hard to appreciate them, they are a must when working with big neural networks, and knowing them will make a *big* difference.

<img src='data/alexnet_seq.png' width="700" height="350" align="center"/>

<img src='data/seq_forward.png' width="700" height="350" align="center"/>

* The **sequential module** is very useful for feedforward networks (where the flow goes in one direction)
* By using this module, you can divide your nework into part which logically make sense
* You can also reuse the modules to create similar blocks in the neural network
* As you can see above, we define all the convolutions, poolings, fully-connected layers, etc, (same as before), but now the order of operators also matters in the declaration.
* Additionally, we **encapsulate them with an `nn.Sequential()`**
* In the case above we are using one sequential module for the feature extraction part (convolutions and poolings), and one for the classification part (fully connected layers)
* This is a very optimized OOP way of doing things and allows you to change parts of the network independently from each other. 
* By using the sequential module, instead of applying each operation, we actually need to apply each sequential module 

In [52]:
def forward(self, x):
    x = self.features(x)
    x = self.avgpool(x)
    x = x.view(x.size(0), 256 * 6 * 6)
    x = self.classifier(x)
    return x

### The problem of overfitting
* Arguably the biggest problem in ML is overfitting:

<img src='data/overfitting1.png' width="400" height="200" align="center"/>

* We detect overfitting by plotting the accuracy of your algorithm in both the training and testing set.
* If there is a large gap in accuracy between training and testing is large, we have a case of **overfitting**, also called **high variance**

<img src='data/overfitting2.png' width="400" height="200" align="center"/>

#### PyTorch validation sets

<img src='data/pytorch_val.png' width="700" height="350" align="center"/>

```
# Shuffle the indices
indices = np.arange(60000)
np.random.shuffle(indices)

# Build the train loader
train_loader = torch.utils.data.DataLoader(datasets.MNIST('mnist', download=True, train=True,
                     transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])),
                     batch_size=64, shuffle=False, sampler=torch.utils.data.SubsetRandomSampler(indices[:55000]))

# Build the validation loader
val_loader = torch.utils.data.DataLoader(datasets.MNIST('mnist', download=True, train=True,
                   transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))])),
                   batch_size=64, shuffle=False, sampler=torch.utils.data.SubsetRandomSampler(indices[55000:]))
```

### Regularization techniques

#### L2 Regularization
* **`optimizer = optim.Adam(net.parameters(), lr=3e-4, weight_decay=0.0001)`**
* (All you need to do is add the `weight_decay` argument

#### Dropout
* Typically dropout is used in fully-connected layers, while it is rarely used in convolutional layers

<img src='data/alexnet_dropout.png' width="600" height="300" align="center"/>

#### Batch normalization 
* Very important technique used nowadays in practically every neural network 
* In layman's terms, it computes the mean and the variance of the minibatch for each feature, and then it normalizes the features based on those stats
* Nowadays its "unthinkable" to train large neural networks without batch normalization (and is highly recommended for small batches as well)
* **`self.bn = nn.BatchNorm2d(num_features=64, eps=1e-05, momentum=0.9)`**

#### Early-stopping
* Simply checks the accuracy of the network in the validation set at the end of each epoch and if, after $n$ epochs, the performance of the net hasn't increased (or if it's decreased), then training is terminated

<img src='data/early_stopping.png' width="400" height="200" align="center"/>

**Some of the techniques mentioned above (like dropout and batch-norm) behave differently when the net is getting trained and when the net is getting evaluated.**
* We need to manually tell PyTorch if we are training or evaluating the net with:
    * `model.train()`
    * `model.eval()`
* **It is very important to set te net in the correct mode, otherwise the training and evaluation will be broken.**

```
# Instantiate the network
model = Net()

# Instantiate the cross-entropy loss
criterion = nn.CrossEntropyLoss()

# Instantiate the Adam optimizer
optimizer = optim.Adam(model.parameters(), lr=3e-4, weight_decay=0.001)
```
***

```
class Net(nn.Module):
    def __init__(self):
        
        # Define all the parameters of the net
        self.classifier = nn.Sequential(
            nn.Linear(28*28, 200),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(200, 500),
            nn.ReLU(inplace=True),
            nn.Linear(500, 10))
        
    def forward(self, x):
    
    	# Do the forward pass
        return self.classifier(x)
```
***

```
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # Implement the sequential module for feature extraction
        self.features = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=10, kernel_size=3, stride=1, padding=1),
            nn.MaxPool2d(2, 2), nn.ReLU(inplace=True), nn.BatchNorm2d(10),
            nn.Conv2d(in_channels=10, out_channels=20, kernel_size=3, stride=1, padding=1),
            nn.MaxPool2d(2, 2), nn.ReLU(inplace=True), nn.BatchNorm2d(20))
        
        # Implement the fully connected layer for classification
        self.fc = nn.Linear(in_features=20 * 7 * 7, out_features=10)
```

### Transfer learning
* An interesting discovery in CNN research was that the deeper you progress in the network, the more abstract the features become.
* A nice consequence of this is that the low-level features are very general and to a large degree dataset independent

<img src='data/transfer_learning.png' width="600" height="300" align="center"/>

* So far we have trained all nets from scratch, initializing them with random weights.
* However, in practice, this isn't usually how things are done.
* **Instead of training the net from scratch, we download a net trained on another dataset (typically a big dataset like ImageNet containing 1-2 million images) and then we retrain the net in our dataset.**
* **This allows us not only to achieve significantly better results in less training time, but also to train networks on very small datasets (containing only hundreds of images).**
* **With this technique, you can train large neural networks on very small datasets.**

<img src='data/transfer_learning2.png' width="600" height="300" align="center"/>

* In the literature, this "retraining" is typically called **finetuning**, but the essence is the same 

#### Finetuning
* There are two ways of finetuning neural networks:
    * **Freeze most of the layers**: (not updated them during propagation) and finetuning only the last few layers (or only the very last one
    * **Finetune everything**
* Typically if your dataset is extremely small, it is a good idea to freeze most of the layers, in order to avoid overfitting

#### Finetuning in PyTorch
* Let's say we have trained a net on CIFAR-10, which we have saved as `cifar10_net.pth`
* **Torchvision is a PyTorch library with many pretrained networks, ready to be used for your dataset.**

```
# Create a new model
model = Net()

# Change the number of out channels
model.fc = nn.Linear(7 * 7 * 512, 26)

# Train and evaluate the model
model.train()
train_net(model, optimizer, criterion)
print("Accuracy of the net is: " + str(model.eval()))

# Create a model using
model = Net()

# Load the parameters from the old model
model.load_state_dict(torch.load('my_net.pth'))

# Change the number of out channels
model.fc = nn.Linear(7 * 7 * 512, 26)

# Train and evaluate the model
model.train()
train_net(model, optimizer, criterion)
print("Accuracy of the net is: " + str(model.eval()))

# Import the module
import torchvision

# Download resnet18
model = torchvision.models.resnet18(pretrained=True)

# Freeze all the layers bar the last one
for param in model.parameters():
    param.requires_grad = False

# Change the number of output units
model.fc = nn.Linear(512, 7)
```

<img src='data/finetuning.png' width="600" height="300" align="center"/>

<img src='data/freezing.png' width="600" height="300" align="center"/>

<img src='data/torchvision_lib.png' width="600" height="300" align="center"/>