# Introduction to Deep Learning with PyTorch

Neural networks have been at the forefront of Artificial Intelligence research during the last few years, and have provided solutions to many difficult problems like image classification, language translation or Alpha Go. PyTorch is one of the leading deep learning frameworks, being at the same time both powerful and easy to use. In this course you will use PyTorch to first learn about the basic concepts of neural networks, before building your first neural network to predict digits from MNIST dataset. You will then learn about convolutional neural networks, and use them to build much more powerful models which give more accurate results. You will evaluate the results and use different techniques to improve them. Following the course, you will be able to delve deeper into neural networks and start your career in this fascinating field.

**Instructor:** Ismail Elezi, PhD researched at Ca' Foscari University of Venice

## $\star$ Chapter 1: Introduction to PyTorch
In this first chapter, we introduce basic concepts of neural networks and deep learning using PyTorch library.

* **Why PyTorch?**
    * Simplicity
    * "PyThonic"- easy to use
    * Strong GPU support - models run fast
    * Many algorithms are already implemented
    * Automatic differentiation
    * Strong OOP
    * Natural choice for many companies like Facebook and SalesForce
    * One of the most used deep learning libraries in academical research
    * Similar to NumPy, making the switch pretty painless
    
* Calculating derivatives and gradients is a very important aspect of deep learning algorithms 
* Luckily, PyTorch is very good at doing it for us

#### PyTorch compared to NumPy
* PyTorch's equivalent to ndarrays is a `torch.tensor`
* Image a tensor as an array with an arbitrary number of dimensions

In [None]:
#pip install torchvision

In [47]:
import torch
import torch.nn as nn
import torchvision
import torch.utils.data
import torchvision.transforms as transforms
import numpy as np
import pandas as pd

In [2]:
torch.tensor([[2, 3, 5], [1, 2, 9]])

tensor([[2, 3, 5],
        [1, 2, 9]])

In [4]:
torch.rand(2, 2)

tensor([[0.8223, 0.1763],
        [0.8878, 0.3068]])

In [5]:
a = torch.rand((3, 5))

In [6]:
print(a)

tensor([[0.8539, 0.2192, 0.4255, 0.3891, 0.9884],
        [0.7929, 0.6227, 0.6128, 0.2260, 0.3861],
        [0.2820, 0.8878, 0.0554, 0.6852, 0.8450]])


In [7]:
a.shape

torch.Size([3, 5])

In [10]:
b = torch.rand(5, 3)

In [11]:
torch.matmul(a, b)

tensor([[1.8819, 1.2884, 1.5931],
        [1.6487, 1.3359, 1.2814],
        [1.5481, 1.1524, 1.1297]])

In [13]:
c = torch.rand(3, 5)

In [14]:
a * c

tensor([[0.2440, 0.0965, 0.2734, 0.2897, 0.5755],
        [0.0186, 0.6216, 0.6122, 0.2168, 0.2859],
        [0.0304, 0.1187, 0.0416, 0.4657, 0.7632]])

In [15]:
torch.zeros(2, 2)

tensor([[0., 0.],
        [0., 0.]])

In [16]:
torch.ones(2, 2)

tensor([[1., 1.],
        [1., 1.]])

In [17]:
torch.eye(2, 2)

tensor([[1., 0.],
        [0., 1.]])

#### from NumPy to PyTorch
* `d_torch = torch.from_numpy(c_numpy)`

#### from PyTorch to NumPy
* `c_torch.numpy()`

<img src='data/basic_torch_functions.png' width="600" height="300" align="center"/>

#### Forward Propagation
* Also known as "foward pass"
* Intuitively, a **computational graph** is a network of nodes that represent numbers, scalars, or tensors and are connected via edges that represent functions or operations

#### PyTorch Implementation 

<img src='data/graph1.png' width="400" height="200" align="center"/>

In [22]:
# First initialize tensors a, b, c, and d
a = torch.Tensor([2])
b = torch.Tensor([-4])
c = torch.Tensor([-2])
d = torch.Tensor([2])

In [23]:
e = a + b
f = c * d

In [24]:
g = e * f

In [25]:
print(e, f, g)

tensor([-2.]) tensor([-4.]) tensor([8.])


* Neural networks (and most other classifiers) can be understood as **computational graphs**
    * In fact, your code gets converted to a computational graph
    * An additional benefit of computational graphs, is that they make the automatic computation of derivatives (or gradients) much easier.
    
#### Exercises: Forward pass

In [26]:
# Initialize tensors x, y and z
x = torch.rand(1000, 1000)
y = torch.rand(1000, 1000)
z = torch.rand(1000, 1000)

# Multiply x with y
q = torch.matmul(x, y)

# Multiply elementwise z with q
f = z * q

mean_f = torch.mean(f)
print(mean_f)

tensor(124.9853)


### Backpropagation by auto-differentiation
* The main algorithm in neural networks: the **backpropagation algorithm**

#### Derivatives
* Derivatives are one of the central concepts in calculus
* In layman's terms, the derivatives represent the rate of change in a function
    * Where the function is **rapidly changing**, the absolute value of the **derviatives is high**.
    * When the function **is not changing** the derivtives are **close to 0**.
    * They could also be interpreted as describing the steepness of a function
    
<img src='data/derivatives.png' width="400" height="200" align="center"/>

* In this graph, points `A` and `C` have large derivatives, while point `B` has a very small derivative
* Khan Academy Derivatives course comes highly recommended

<img src='data/derivative_rules.png' width="400" height="200" align="center"/>

* The **Addition** or **Sum** rule says that for two functions, $f$ and $g$, the derivative of their sum is the sum of their individual derivatives
* The **Multiplication** rule says that the derivative of their product is $f$ times derivative of $g$ times derivative of $f$
* The derivative of a number times a function is the number
    * For example, the derivative of $3x$ is $3$
* The derivative of a number itself is always zero
* The derivative of something with respect to itself is always 1 
* **Chain rule** deals with the composition of functions
* A closely related term with derivatives is the gradient
* The **gradient** is a multi-variable generalization of the derivative
    * Considering that neural networks have many variables, we will typically use the term gradient instead of derivative when working with NNs
    
#### Backpropagation in PyTorch
* The derivatives are calculated in PyTorch using the reverse mode of auto-differentiation, so you will rarely need to write code to calculate derivatives

In [27]:
x = torch.tensor(-3., requires_grad=True)
y = torch.tensor(5., requires_grad=True)
z = torch.tensor(-2., requires_grad=True)

q = x + y
f = q * z

f.backward()

In [28]:
print("Gradient of z is : " + str(z.grad))
print("Gradient of y is : " + str(y.grad))
print("Gradient of x is : " + str(x.grad))

Gradient of z is : tensor(2.)
Gradient of y is : tensor(-2.)
Gradient of x is : tensor(-2.)


* **Note** that we need to set the `requires_grad` parameter to `True` in order to tell PyTorch that we need their derivatives
* `f.backward()` tells PyTorch to compute the derivatives

### Introduction to Neural Networks
* The simplest form of modern neural networks is: fully-connected neural networks (Dense)

#### Fully connected neural networks with PyTorch

In [29]:
input_layer = torch.rand(10)

In [30]:
w1 = torch.rand(10, 20)
w2 = torch.rand(20, 20)
w3 = torch.rand(20, 4)

* In order to get the values of the first hidden layer `h1`, we multiply the vector of features with the first matrix of weights `w1`
* Look at the matrix of weights. The first dimension should always correspond to the preceding layer, and the second dimension to the following layer

In [31]:
h1 = torch.matmul(input_layer, w1)

* Similarly, we continue for the second hidden layer, `h2`, which is the product of the first hidden layer `h1` and the second matrix of weights `w2`.
* Finally, we get the results of the `output_layer`, which has 4 classes, by multiplying the second hidden layer `h2` with the third matrix of weights `w3`

In [32]:
h2 = torch.matmul(h1, w2)

In [33]:
output_layer = torch.matmul(h2, w3)

In [34]:
print(output_layer)

tensor([199.4510, 194.9810, 236.2898, 185.9965])


### Building a neural network- PyTorch style

```
import torch
import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(10, 20)
        self.fc2 = nn.Linear(20, 20)
        self.output = nn.Linear(20, 4)
        
    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.output(x)
        return x

input_layer = torch.rand(10)
net = Net()
result = net(input_layer)
```

* In the `__init__` method, we define our parameters, the tensors of weights.
* For fully connected layers, they are called `nn.Linear`
    * The first parameter is the number of units of the current layer
    * The second parameter is the number of units in the next layer
    * In the forward method, we apply all those weights to our input
 
#### Exercises: Your first neural network

```
# Initialize the weights of the neural network
weight_1 = torch.rand(1, 1)
weight_2 = torch.rand(1, 1)

# Multiply input_layer with weight_1
hidden_1 = torch.matmul(input_layer, weight_1)

# Multiply hidden_1 with weight_2
output_layer = torch.matmul(hidden_1, weight_2)
print(output_layer)
```
***
```
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # Instantiate all 2 linear layers  
        self.fc1 = nn.Linear(784, 200)
        self.fc2 = nn.Linear(200, 10)

    def forward(self, x):
      
        # Use the instantiated layers and return x
        x = self.fc1(x)
        x = self.fc2(x)
        return x
```

## Activation functions

In [35]:
input_layer = torch.tensor([2., 1.])
weight_1 = torch.tensor([[0.45, 0.32], [-0.12, 0.29]])
hidden_layer = torch.matmul(input_layer, weight_1)
weight_2 = torch.tensor([[0.48, -0.12], [0.64, 0.91]])
output_layer = torch.matmul(hidden_layer, weight_2)
print(output_layer)

tensor([0.9696, 0.7527])


#### Matrix multiplication is a linear transformation
* Now, let's try to do something different
* Let's first multiply the matrices with `torch.matmul` and then we'll multiply the input with the product of these matrices
* When we print the results, we see something interesting: **the result of the output layer is exactly the same as before.**

In [38]:
input_layer = torch.tensor([2., 1.])
weight_1 = torch.tensor([[0.45, 0.32], [-0.12, 0.29]])
weight_2 = torch.tensor([[0.48, -0.12], [0.64, 0.91]])
weight = torch.matmul(weight_1, weight_2)
output_layer = torch.matmul(input_layer, weight)
print(output_layer)
print(weight)

tensor([0.9696, 0.7527])
tensor([[0.4208, 0.2372],
        [0.1280, 0.2783]])


* This means that we can achieve the exact result by using a single layer neural network, with this particular set of weights. 
* Linear algebra demonstrates that matrix multiplication is actually a linear transformation, meaning that we can simplify any neural network in a single layer neural network
* But, this comes with an irritating consequence: our neural nets are not that powerful; using them *alone* only allows us to separate linearly separable datasets (for which there are a host of more intuitive ML algorithms).
* To separate non-linearly-separable functions, we use **activation functions.**

<img src='data/activation_functions.png' width="600" height="300" align="center"/>

* **Activation functions** are non-linear functions which are inserted in each layer of the neural network, making neural networks nonlinear and allowing them to deal with highly non-linear datasets, thus making them much more powerful.

In [41]:
relu = nn.ReLU()

tensor_1 = torch.tensor([2., -4.])
print(relu(tensor_1))

tensor_2 = torch.tensor([[2., -4.], [1.2, 0.]])
print(relu(tensor_2))

tensor([2., 0.])
tensor([[2.0000, 0.0000],
        [1.2000, 0.0000]])


### Loss Functions
* So far, all neural networks in this course have had random weights (and so they weren't particularly useful)
* The recipe for training neural networks is the following:
    * Initialize neural networks with random weights
    * Do a forward pass
    * Calculate loss function (1 number)
    * Calculate the gradients using backpropagation
    * Change the weights based on gradients
* Loss (cost) function for **regression: least squared loss**
* Loss (cost) function for **classification: softmax or (categorical) cross-entropy loss**
* For more complicated problems (like object detection), more complicated losses
* Loss functions should be **differentiable**; otherwise we won't be able to compute gradients
* For this reason, instead of using accuracy (which is not differentiable), we need to use some proxy loss functions (in neural nets, a softmax function followed by a cross-entropy function performs really well).
* **Softmax** is a function that turns numbers into probabilities

<img src='data/softmax_cross_entropy2.png' width="600" height="300" align="center"/>

### CE loss in PyTorch
* `logits` = scores for each class
* `ground_truth` = cat
* `criterion` = loss function
* Below we choose `nn.CrossEntropyLoss()` which combines **softmax** with **cross-entropy**
* Note that we get the same result from the code below as we do in the illustration above.

In [42]:
logits = torch.tensor([[3.2, 5.1, -1.7]])
ground_truth = torch.tensor([0])
criterion = nn.CrossEntropyLoss()

loss = criterion(logits, ground_truth)
print(loss)

tensor(2.0404)


What is the cat class prediction had been much higher?

In [43]:
logits = torch.tensor([[10.2, 5.1, -1.7]])
loss = criterion(logits, ground_truth)
print(loss)

tensor(0.0061)


What is the cat class prediction had been much lower?

In [44]:
logits = torch.tensor([[-10, 5.1, -1.7]])
loss = criterion(logits, ground_truth)
print(loss)

tensor(15.1011)


The rule of thumb is that **the more accurate the network is, the smaller the loss (and vice versa).**

#### Exercises: Calculating loss function in PyTorch

In [45]:
# Initialize the scores and ground truth
logits = torch.tensor([[-1.2, 0.12, 4.8]])
ground_truth = torch.tensor([2])

# Instantiate cross entropy loss
criterion = nn.CrossEntropyLoss()

# Compute and print the loss
loss = criterion(logits, ground_truth)
print(loss)

tensor(0.0117)


#### Exercises: Loss function of random scores
If the neural network predicts random scores, what would be its loss function? Let's find it out in PyTorch. The neural network is going to have 1000 classes, each having a random score. For ground truth, it will have class 111. Calculate the loss function.

In [46]:
# Import torch and torch.nn
import torch
import torch.nn as nn

# Initialize logits and ground truth
logits = torch.rand(1,1000)
ground_truth = torch.tensor([111])

# Instantiate cross-entropy loss
criterion = nn.CrossEntropyLoss()

# Calculate and print the loss
loss = criterion(logits, ground_truth)
print(loss)

tensor(7.3071)


### Preparing a dataset in PyTorch
* In order to be able to use datasets in PyTorch, they need to be in some PyTorch friendly format that the framework will be able to understand
* **`torchvision`:** a package which deals with datasets and pretrained neural nets
* Below we define a transformation of images to torch tensors, usings `transforms`.

In [48]:
transform = transforms.Compose(
            [transforms.ToTensor(),
             transforms.Normalize((0.4914, 0.48216, 0.44653),
                                  (0.24703, 0.24349, 0.26159))])

<img src='data/visualize_parameters.png' width="600" height="300" align="center"/>
<img src='data/whats_for_dinner.png' width="400" height="200" align="center"/>