## Introduction to PyTorch

<img src="https://miro.medium.com/max/1200/1*jcZLpgh3gppeFFgcpFSP0w.jpeg" alt="PyTorch" title="PyTorch" style="width: 200px;"/>

PyTorch is a Python based tool for scientific computing that provides three main features:
- An n-dimensional Tensor, which is similar to numpy but can run on GPUs
- Easily build big computational graphs for deep learning
- Automatic differentiation for computing gradients for neural networks

You can install PyTorch from: https://pytorch.org/

In [1]:
import torch

## PyTorch's Tensor

Tensor is an n-dimensional array, which is a generalization of a matrix that can be indexed in more than 2 dimensions. Tensor is similar to numpy that most of the operations in numpy object can be performed on a tensor object. However, tensor object benefits from strong GPU acceleration while numpy does not. All computations in deep learning are performed on tensors. Tensors also store optional information such as gradient and bookkeeping for computational graph. 


###  Tensor Creation

Construct a randomly initialized tensor of size 5x3:

In [2]:
x = torch.rand(5, 3) # a tensor filled with random numbers from a uniform distribution on the interval [0,1)
print(x) 

tensor([[0.6839, 0.3716, 0.3839],
        [0.2629, 0.0449, 0.2231],
        [0.1178, 0.9471, 0.7022],
        [0.6992, 0.4817, 0.5664],
        [0.9368, 0.7465, 0.2518]])


Construct a tensor filled zeros and of data type (dtype) long:

In [3]:
x = torch.zeros(5, 3, dtype=torch.long)
print(x)

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])


Construct a tensor directly from data:

In [4]:
x = torch.tensor([[5.5, 3], [3.2, 13], [6.9, 23]])
print(x)

tensor([[ 5.5000,  3.0000],
        [ 3.2000, 13.0000],
        [ 6.9000, 23.0000]])


Construct a tensor based on existing tensor: (reusing properties of input tensor like dtype by default)

In [5]:
y = torch.randn_like(x) 
print(y)

tensor([[ 0.5326, -1.8590],
        [-0.0550, -0.1841],
        [-0.6633, -0.8143]])


Get the size of the tensor:

In [6]:
print(y.size())  # you can also use y.shape()
print(y.size(1)) # get the number of columns

torch.Size([3, 2])
2


Construct a 3D tensor:

In [7]:
z = torch.tensor([ [[5.5, 3], [3.2, 13], [6.9, 23]], [[2.1, 3.3], [1.8, 2], [5.2, 20]] ])
print(z)

tensor([[[ 5.5000,  3.0000],
         [ 3.2000, 13.0000],
         [ 6.9000, 23.0000]],

        [[ 2.1000,  3.3000],
         [ 1.8000,  2.0000],
         [ 5.2000, 20.0000]]])


What is a 3D tensor anyway? Think about it like this. If you have a vector, indexing into the vector gives you a scalar. If you have a matrix, indexing into the matrix gives you a vector. If you have a 3D tensor, then indexing into the tensor gives you a matrix!

The size of the 3D tensor will be 2x3x2 (in this example, 3D tensor is a collection of two 3x2 matrices). Let's print the size: 

In [8]:
print(z.size())

torch.Size([2, 3, 2])


In [9]:
print(z[1]) # accesses the second dimension (matrix) whose size will be 3x2

tensor([[ 2.1000,  3.3000],
        [ 1.8000,  2.0000],
        [ 5.2000, 20.0000]])


In [10]:
print(z[1][1]) # accesses the second dimension of the second dimension (vector) whose size will be 1x2

tensor([1.8000, 2.0000])


In [11]:
print(z[1][1][1]) # accesses the second dimension of the second dimension of the second dimension (scalar) whose size will be 1x1

tensor(2.)


In [12]:
print(z.dtype) # prints data type of tensor

torch.float32


The default data type of a tensor is Float. If you want an integer tensor, you can do:

In [13]:
it = torch.tensor([3, 4], dtype=torch.int)
print(it)

tensor([3, 4], dtype=torch.int32)


###  Operations on a Tensor

There are multiple syntaxes for operations. Let us take a look at the addition operation.

In [14]:
# let's create two tensors
x = torch.rand(5, 3)
y = torch.rand(5, 3)
# let's print those two tensors
print(x)
print(y)

tensor([[4.3471e-01, 2.5028e-04, 7.6348e-01],
        [9.7835e-01, 4.7258e-01, 9.5781e-02],
        [3.0613e-01, 4.6486e-01, 1.7439e-01],
        [4.1998e-01, 1.7716e-03, 7.4684e-01],
        [5.6333e-01, 6.7918e-01, 7.9335e-01]])
tensor([[0.2797, 0.3715, 0.3162],
        [0.3177, 0.1429, 0.6317],
        [0.9134, 0.8784, 0.2730],
        [0.8035, 0.0474, 0.7807],
        [0.5321, 0.1873, 0.2090]])


In [15]:
# let's add them in two ways
print(x + y) # method 1
print(torch.add(x, y)) # method 2

tensor([[0.7144, 0.3717, 1.0797],
        [1.2960, 0.6155, 0.7274],
        [1.2195, 1.3433, 0.4473],
        [1.2235, 0.0491, 1.5276],
        [1.0954, 0.8665, 1.0024]])
tensor([[0.7144, 0.3717, 1.0797],
        [1.2960, 0.6155, 0.7274],
        [1.2195, 1.3433, 0.4473],
        [1.2235, 0.0491, 1.5276],
        [1.0954, 0.8665, 1.0024]])


In [16]:
y.add_(x) # adds x to y (in-place) y := y + x  (method 3) or alternatively we can do y = torch.add(x,y)

tensor([[0.7144, 0.3717, 1.0797],
        [1.2960, 0.6155, 0.7274],
        [1.2195, 1.3433, 0.4473],
        [1.2235, 0.0491, 1.5276],
        [1.0954, 0.8665, 1.0024]])

Any operation that mutates a tensor in-place is post-fixed with an ``_``. For example: ``x.copy_(y)``, ``x.t_()`` (transpose), will change x.

We can use standard NumPy-like indexing:

In [17]:
print(x) # prints x whose size is 5x3
print(x[:, 1]) # prints the second column of tensor whose size is 1x5

tensor([[4.3471e-01, 2.5028e-04, 7.6348e-01],
        [9.7835e-01, 4.7258e-01, 9.5781e-02],
        [3.0613e-01, 4.6486e-01, 1.7439e-01],
        [4.1998e-01, 1.7716e-03, 7.4684e-01],
        [5.6333e-01, 6.7918e-01, 7.9335e-01]])
tensor([2.5028e-04, 4.7258e-01, 4.6486e-01, 1.7716e-03, 6.7918e-01])


Let us resize/reshape tensor:

In [18]:
# let's print x's size
print(x.size()) #  5x3

torch.Size([5, 3])


In [19]:
# let's reshape x to a flat array
print(x.view(15)) # print reshaped tensor
print(x.view(15).size()) # print size of the reshaped tensor

tensor([4.3471e-01, 2.5028e-04, 7.6348e-01, 9.7835e-01, 4.7258e-01, 9.5781e-02,
        3.0613e-01, 4.6486e-01, 1.7439e-01, 4.1998e-01, 1.7716e-03, 7.4684e-01,
        5.6333e-01, 6.7918e-01, 7.9335e-01])
torch.Size([15])


Let us multiply two tensors:

In [20]:
# let's create two tensors
a = torch.randn(4, 1) 
b = torch.randn(1, 4)

In [21]:
# let's multiply each other: a x b 
torch.mul(a, b) # (4 x 1) x (1 x 4) = (4 x 4) 

tensor([[-0.3922,  0.0585,  0.4208,  0.0853],
        [ 3.5755, -0.5333, -3.8356, -0.7776],
        [ 0.1160, -0.0173, -0.1245, -0.0252],
        [ 0.8985, -0.1340, -0.9639, -0.1954]])

Let us compute the mean of a tensor in one particular dimension:

In [22]:
# let's create a tensor
a = torch.randn(3, 5)
print(a)

tensor([[-0.7573,  0.2316,  0.3929, -0.4683, -1.4706],
        [ 0.8319,  0.1011,  0.1805,  1.8710,  1.6045],
        [-0.3907,  1.5070,  0.2966,  0.4689, -2.4953]])


In [23]:
# let's perform mean of the tensor over columns
print(torch.mean(a, 1)) # reduce over columns (1) results in 1x3

tensor([-0.4144,  0.9178, -0.1227])


In [24]:
# let's perform mean of the tensor over rows
print(torch.mean(a, 0)) # reduce over rows (0) results in 1x5

tensor([-0.1054,  0.6132,  0.2900,  0.6239, -0.7871])


Let us compute the max of a tensor in one particular dimension:

In [25]:
# let's create a tensor
a = torch.randn(4, 3)
print(a)

tensor([[ 0.0470,  1.2275, -0.9094],
        [ 0.3157, -1.0225,  0.4018],
        [ 0.0237, -1.3291, -0.2868],
        [-0.0337,  1.7147,  1.0138]])


In [26]:
# let's identify maximum in each row
values, indices = torch.max(a, 1) 
print(values) # values is the maximum value of each row of the input tensor. (1x4)
print(indices) # indices is the index location of each maximum value found (argmax) (1x4)

tensor([1.2275, 0.4018, 0.0237, 1.7147])
tensor([1, 2, 0, 1])


You can take a look at the list of supported Tensor functions here: https://pytorch.org/docs/stable/tensors.html

### NumPy Bridge 
We can easily convert a Torch tensor to a numpy array and vice versa.

Let us convert a Torch Tensor to a NumPy array:

In [27]:
# let's create a tensor
a = torch.ones(5)
print(a)

tensor([1., 1., 1., 1., 1.])


In [28]:
# let's convert that tensor to numpy
b = a.numpy() # converts Tensor to NumPy with a,b pointing to same memory locations
print(b)

[1. 1. 1. 1. 1.]


In [29]:
# manipulate 'a' (changes will reflect in 'b')
a.add_(1)
print(a)
print(b)

tensor([2., 2., 2., 2., 2.])
[2. 2. 2. 2. 2.]


Let us convert a NumPy array to a Torch Tensor:

In [30]:
# let's create a numpy array
import numpy as np
a = np.ones(5)
print(a)

[1. 1. 1. 1. 1.]


In [31]:
# let's convert to tensor
b = torch.from_numpy(a)
print(b)

tensor([1., 1., 1., 1., 1.], dtype=torch.float64)


In [32]:
# manipulate 'a' (changes will reflect in 'b')
np.add(a, 1, out=a)
print(a)
print(b)

[2. 2. 2. 2. 2.]
tensor([2., 2., 2., 2., 2.], dtype=torch.float64)


### CUDA Tensors

We can use ``.to`` function to move Tensors onto any device. Generally, we move tensors to GPUs to accelerate the computation.

In [33]:
# let us run this cell only if CUDA is available
# We will use ``torch.device`` objects to move tensors in and out of GPU
if torch.cuda.is_available():
  device = torch.device("cuda")          # a CUDA device object
  y = torch.ones_like(x, device=device)  # directly create a tensor on GPU
  x = x.to(device)                       # or just use strings ``.to("cuda")``
  z = x + y
  print(z)
  print(z.to("cpu", torch.double))       # ``.to`` can also change dtype together!

## Computational Graph and Automatic Differentiation

Computation graph is an essential concept for efficient deep learning programming, because it allows you to not explicitly write the back propagation gradients yourselves. A computation graph is simply a specification of how your data is combined to give you the output. Since the graph totally specifies what parameters were involved with which operations, it contains enough information to compute derivatives. 

The ``autograd`` package provides automatic differentiation for all operations on Tensors. It is a define-by-run framework, which means that your backprop is defined by how your code is run, and that every single iteration can be different.

Let us see this in more simple terms with some examples.

Besides keeping track of size, data type and other things, Tensors can also keep track of how it was created. If you set its attribute ``.requires_grad`` as ``True``, it starts to track all operations on it.

Let us see some example:

In [34]:
# create two tensors
a = torch.tensor([1., 2., 3], requires_grad=True) 
b = torch.tensor([5., 1., 4], requires_grad=True) 
print(a) # 1x3
print(b) # 1x3

tensor([1., 2., 3.], requires_grad=True)
tensor([5., 1., 4.], requires_grad=True)


In [35]:
# add those tensors
c = a + b # 1x3
print(c.data) # 1x3

tensor([6., 3., 7.])


In [36]:
# but c knows something extra.
print(c.grad_fn) # c knows that it was a result of addition of two tensors

<AddBackward0 object at 0x10bb80c18>


In [37]:
# sum all entries in c
d = c.sum() # 1x1
print(d)
print(d.grad_fn) # d knows that it was a sum of all elements in a single tensor.

tensor(16., grad_fn=<SumBackward0>)
<SumBackward0 object at 0x10bb86358>


``c`` knows that it was a result of addition of two tensors while ``d`` knows that it was a sum of all elements in a single tensor. Thus, a computation graph is simply a specification of how your data is combined to give you the output. You can imagine this computational graph as below:
<img src="images/sl1_pytorch_cgraph_sum.jpg" alt="computational graph example" title="Example Computational Graph" />

Once we define the computational graph, we can call ``.backward()`` and have all the gradients computed automatically. The gradient for a tensor will be accumulated into ``.grad`` attribute.

In [38]:
d.backward()
print(a.grad) # 1x3

tensor([1., 1., 1.])


You can trust us that these gradients are correct. PyTorch lets you define arbitrary computation graph that is made up of tensor (``torch.tensor``) and modules (off-the-shelf layers from ``torch.nn`` and custom layers/models). 

Important Note: If you run the above block multiple times, the gradient will increment. That is because Pytorch accumulates the gradient into the ``.grad`` property, since for many models this is very convenient.

## Linearities, Nonlinearities and Loss functions

A deep learning model is typically composed of linearities (affine transformation) and nonlinearities in a clever way. The nonlinearities makes the deep learning models powerful. The last node of the computational graph is typically a loss function (or objective function), which measure how far away the model prediction is from the actual target. PyTorch has most of the commonly used linearities, nonlinearities and loss functions already inbuilt into the library. Adding them to your computational graph is straightforward as we will see now.

### Linearities
Affine transformation is the commonly used linearity, which is a function $f(x)$ where $f(x) = A x + b$
for a matrix $A$ and vectors $x, b$. The parameters to be learned here are $A$ and $b$. Often, $b$ is refered to as the bias term.

Let us transform a sample data using affine transformation:

In [39]:
# let's define a affine transformation based linearity layer
linear_layer = torch.nn.Linear(5, 3) # maps from R^5 to R^3, parameters A, b
print(linear_layer)

Linear(in_features=5, out_features=3, bias=True)


Let us print the parameters: A matrix and b vector

In [40]:
print(linear_layer.weight.data) # prints A  3x5
print(linear_layer.bias.data) # prints b  1x3

tensor([[ 0.4148,  0.0942, -0.3873, -0.2727,  0.1664],
        [ 0.0055, -0.3496,  0.2675,  0.3780, -0.4020],
        [ 0.1789,  0.3826, -0.1834,  0.3528, -0.4270]])
tensor([0.4166, 0.3994, 0.2847])


Let us create some data and pass it to the linear layer.

In [41]:
# let's create data 'x'
x = torch.randn(1, 5) # data is 1x5. 

In [42]:
# let's compute f(x) by passing x to the linear layer
transformed_output = linear_layer(x)
print(transformed_output)
# print(linear_layer(torch.randn(1, 6))) # error: size mismatch, m1: [1 x 6], m2: [5 x 3] 

tensor([[ 0.4444,  0.7414, -0.2102]], grad_fn=<AddmmBackward>)


You can look into the other linear layers here: https://pytorch.org/docs/stable/nn.html#linear-layers

### Nonlinearities
Nonlinearities lets you build powerful deep learning models. For example, sigmoid nonlinearity squashes the input to be between 0 and 1. Sigmoid nonlinearity is defined by $\sigma(x) = \frac{1}{1+\exp(-x)}$. 

Let us see an example for using sigmoid nonlinearity on sample input:

In [43]:
# let's define sigmoid layer
sigmoid_layer = torch.nn.Sigmoid()

In [44]:
# let's define x
x = torch.randn(1, 5) # data is 1x5.
print(x)

tensor([[-1.8064,  0.1351,  0.6048,  0.6664,  0.7107]])


In [45]:
# let's pass x to sigmoid layer
sigmoid_out = sigmoid_layer(x) # applies sigmoid element-wise results in 1x5 (doesn't change dimension)
print(sigmoid_out)

tensor([[0.1411, 0.5337, 0.6468, 0.6607, 0.6706]])


Another commonly used nonlinearity is softmax function, which rescales an n-dimensional input Tensor so that the elements of the n-dimensional output Tensor lie in the range $[0,1]$ and sum to 1. 

The softmax function is defined as, $softmax(x_i) = \frac{\exp{(x_i)}}{\sum_j{\exp{(x_j)}}}$

Let us see an example for using softmax nonlinearity on sample input:

In [46]:
# let's create a softmax layer
softmax_layer = torch.nn.Softmax(dim=1) # row-wise softmax

In [47]:
# let's create x
x = torch.randn(1, 5) # data is 1x5.
print(x)

tensor([[-0.6029, -0.6874, -1.2841, -1.0747,  0.9904]])


In [48]:
# let's pass x to the softmax_layer
softmax_out = softmax_layer(x) 

In [49]:
# lets print softmax output
print(softmax_out) # each entry is between 0 to 1  

tensor([[0.1255, 0.1153, 0.0635, 0.0783, 0.6174]])


In [50]:
# softmax output sums to 1.0
print(softmax_out[0].sum()) 

tensor(1.0000)


You can look at the other nonlinear layers here: https://pytorch.org/docs/stable/nn.html#non-linear-activations-other

### Loss function
A loss function takes the (model prediction, target) pair of inputs, and computes a value that estimates how far away the model prediction is from the target.

A simple loss is: ``nn.MSELoss`` which computes the mean-squared error between the model prediction ($\hat{y}_i$) and the target ($y_i$). Mean Squared error can be written as, $(\hat{y}_i - y_i)^{2}$

Let us see an example for using MSELoss on output from Linear+softmax model and (randomly sampled) target (usually target is annotated by human but we create it synthetically in this tutorial for simplicity):

In [51]:
# create data (input, target)
data_input = torch.randn(1,3) # 1 example, 3 input features
data_output = torch.randn(1,3) # 1 example, 3 target label
print(data_output) # 1x3

tensor([[-0.0261, -0.0251,  0.2468]])


In [52]:
# define linear and softmax layer
linear_layer = torch.nn.Linear(3, 3)
softmax_layer = torch.nn.Softmax(dim=1)

In [53]:
# forward pass the data_input through the model (computational graph)
transformed_output = linear_layer(data_input) # maps from R^3 to R^1, parameters A, b i.e. maps 1x3 to 1x3
model_output = softmax_layer(transformed_output)
print(model_output)

tensor([[0.4277, 0.1236, 0.4487]], grad_fn=<SoftmaxBackward>)


In [54]:
# compute the MSELoss
criterion = torch.nn.MSELoss()
loss = criterion(model_output, data_output) 
print(loss) # the MSE loss of 1 individual example  

tensor(0.0896, grad_fn=<MseLossBackward>)


You can imagine the above computational graph to be like this:

<img src="images/sl1_pytorch_softmax.jpg" alt="computational graph example" title="Example Computational Graph" />

You can look at the other loss layers here: https://pytorch.org/docs/stable/nn.html#loss-functions

## Optimization

So far, we know:
- how to define an arbitrary computation graph (model) with linearities, nonlinearities
- how to add loss function to our model to measure the quality of models' predictions
- how to use ``backward`` function to compute the gradients

The only remaining piece of your PyTorch model is how to update (or learn) the weights (e.g., parameters $A, b$ of the linear layer). The commonly used optimization algorithm for neural networks is gradient descent (GD), which first randomly initializes the weight and changes the weight based on the update rule: $\theta^{(t+1)} = \theta^{(t)} - \frac{1}{n} \eta \nabla_\theta{L(\theta)} $, where $\eta$, $\theta$ and $L(\theta)$ corresponds to the learning rate (or step size), the parameters of the model (e.g., $A$ and $b$ put together) and the loss function over the parameters. $\nabla_\theta{L(\theta)}$ correspond to the gradient which is calculated (resides in ``grads`` attribute of the tensor) when you call ``backward`` function. Therefore, the weight at the current step is equivalent to the weight at the previous step subtracted from multiplying the gradient (w.r.t weight in previous step) with the learning rate scaled by the number of examples ($n$) for this update.

Let us see an example which updates the weight of our previous Linear+Softmax model with MSELoss:

In [55]:
# get a reference to the weights (or parameters)
model_weights = linear_layer.weight.data # note other layers in the previous graph do not have parameters
print(model_weights) # prints model weights before GD update

tensor([[-0.1264,  0.0615,  0.1955],
        [ 0.0292, -0.1180,  0.5301],
        [-0.4985, -0.4142,  0.0642]])


In [56]:
# compute the gradients ($\nabla_\theta{L(\theta)}$)
# the whole graph is differentiated w.r.t. the loss, 
# and all Tensors in the graph that has requires_grad=True 
# will have their .grad Tensor accumulated with the gradient.
loss.backward()
print(linear_layer.weight.grad) # prints the gradient w.r.t each parameter for the given data input

tensor([[-0.0540,  0.0602, -0.0344],
        [ 0.0160, -0.0178,  0.0102],
        [ 0.0380, -0.0424,  0.0242]])


In [57]:
# make the GD update
model_weights = model_weights - (1/1.0) * 0.1 * linear_layer.weight.grad # 0.1 is the learning rate, 1.0 is the no. of examples for this update
print(model_weights) # prints model weights after GD update which is 3x3

tensor([[-0.1210,  0.0555,  0.1989],
        [ 0.0276, -0.1162,  0.5290],
        [-0.5023, -0.4100,  0.0617]])


Generally, a GD update using only 1 training example is termed as stochastic gradient descent (SGD). If GD update uses multiple training examples (our update code actually uses 1 training example), then the optimization algorithm is termed as mini-batch gradient descent. A mini-batch gradient descent algorithm typically runs for several passes (or epochs) over your training data and at each pass, it grabs a mini-batch of training examples to perform the update. The size of the mini-batch and the learning rate are hyperparameters to be selected based on the performance of our model on the examples held out from the training data (or validation set).

You can look at the overview of GD optimization algorithms here: https://ruder.io/optimizing-gradient-descent/index.html