# Introduction to PyTorch
---

[PyTorch](https://pytorch.org/docs/stable/index.html) is a framework for building trainable (automatically differentiable) directed acyclic graphs in dynamic manner (in cotrast with e.g. Tensorflow which builds static dags).   

PyTorch's main building block are tensors (and it's highlevel abstractions e.g. `torch.nn` layers) and operations upon those tensors. Using PyTorch we can define minimization problems, which can be solved using `torch` optimization modules.

**Overvoew of PyTorch package**
 - `torch.nn`  Highl-level abstractions useful for designing neural network architectures including various neural network layer types, loss functions and containers for more complex models.
 - `torch.nn.functional`  Similar as torch.nn, not defined in class manner but functional.
 - `torch.nn.init` Set of methods used for initialization of torch Tensor.
 - `torch.optim` Module with various optimizers and learning rate schedulers for training of neural networks.
 - `torch.utils.data` Collection of classes for data manipulation.
 - `torch.autograd`  Reverse automatic differentiation system which enables automatical computation of the gradients using the chain rule.

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

## PyTorch Tensors

### Analogy with Numpy
We can use similar methods as in NumPy to initialze and manipulate with tensors.

In [2]:
import torch
import numpy as np

In [3]:
np.zeros([3, 3])

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [4]:
torch.zeros([3, 3], dtype=torch.long, device=torch.device('cpu'))

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])

In [5]:
np.random.rand(3, 3)

array([[0.60470409, 0.29962302, 0.01461841],
       [0.21101405, 0.96114071, 0.70313904],
       [0.59613189, 0.80732797, 0.88849557]])

In [6]:
torch.rand(3, 3)

tensor([[0.9865, 0.5884, 0.3125],
        [0.5655, 0.4635, 0.6931],
        [0.3914, 0.5660, 0.1078]])

In [7]:
numpy_tensor = np.array([[1, 2] ,[3, 4]], dtype=np.float)
numpy_tensor

array([[1., 2.],
       [3., 4.]])

In [8]:
torch_tensor = torch.tensor([[1, 2] ,[3, 4]], dtype=torch.float)
torch_tensor

tensor([[1., 2.],
        [3., 4.]])

In [9]:
numpy_tensor.shape

(2, 2)

In [10]:
torch_tensor.shape

torch.Size([2, 2])

In [12]:
torch_tensor.numpy()

array([[1., 2.],
       [3., 4.]], dtype=float32)

In [13]:
torch.tensor(numpy_tensor)

tensor([[1., 2.],
        [3., 4.]], dtype=torch.float64)

### Basic operations with tensors

In [14]:
torch_tensor = torch.tensor([[1, 2] ,[3, 4]], dtype=torch.float)
torch_tensor

tensor([[1., 2.],
        [3., 4.]])

In [15]:
torch_tensor + torch_tensor

tensor([[2., 4.],
        [6., 8.]])

In [16]:
torch_tensor + 2

tensor([[3., 4.],
        [5., 6.]])

In [17]:
torch_tensor * torch_tensor

tensor([[ 1.,  4.],
        [ 9., 16.]])

In [18]:
torch_tensor.mm(torch_tensor)

tensor([[ 7., 10.],
        [15., 22.]])

In [21]:
torch.nn.init.normal_(torch_tensor)
torch_tensor

tensor([[ 1.3710, -1.6887],
        [-1.2106, -0.2434]])

### Work with shape

In [20]:
torch_tensor = torch.tensor([[1, 2] ,[3, 4]], dtype=torch.float)
torch_tensor

tensor([[1., 2.],
        [3., 4.]])

In [21]:
torch_tensor.view(-1)

tensor([1., 2., 3., 4.])

In [22]:
torch_tensor[1, :]

tensor([3., 4.])

In [23]:
torch.cat([torch_tensor, torch_tensor], dim=1)

tensor([[1., 2., 1., 2.],
        [3., 4., 3., 4.]])

In [24]:
torch.unsqueeze(torch_tensor, 0)

tensor([[[1., 2.],
         [3., 4.]]])

In [25]:
torch.transpose(torch_tensor, 1, 0)

tensor([[1., 3.],
        [2., 4.]])

### Special tensor properties
All those attributes are related to optimizations we can use over tensors.

 - `.requires_grad`  Indication that we want to compute gradinet for this tensor. Pytorch will start to track all operations on it.
 - `.grad` After calling `y.backward()`, we have in `x.grad` (in case it requires_grad) gradinet defined as $\frac{dy}{dx}$.
 - `.grad_fn` Reference to function that has created the Tensor.

In [28]:
tt = torch.tensor([[1, 2] ,[3, 4]], dtype=torch.float, requires_grad=True)
tt

tensor([[1., 2.],
        [3., 4.]], requires_grad=True)

In [29]:
tt_m = tt * tt
tt_m

tensor([[ 1.,  4.],
        [ 9., 16.]], grad_fn=<MulBackward0>)

In [30]:
tt_m = tt_m.mean()
tt_m

tensor(7.5000, grad_fn=<MeanBackward1>)

In [33]:
tt_m.grad_fn

<MeanBackward1 at 0x7f82783cc5f8>

In [24]:
tt_m.requires_grad

True

In [26]:
tt.grad is None

True

Let's compute gradinet of `tt_m` variable with respect to all `torch.Tensor`s with `.require_grad=True`.
To calculate the gradients, we need to run the `tt_m.backward()`.  
This will calculate the gradient for `tt_m` with respect to `tt`

$$
\frac{\partial tt\_m}{\partial tt_x} = \frac{\partial}{\partial tt_x}\left[\frac{1}{n}\sum_i^n tt_i^2\right] = \frac{2}{n}tt_{i=x}
$$

In [27]:
tt_m.backward()
tt.grad

tensor([[0.5000, 1.0000],
        [1.5000, 2.0000]])

This is way how to stop collecting gradinet information

In [31]:
with torch.no_grad():
    print((tt * tt).requires_grad)

False


## Neural Network Definition
PyTorch enables definition of neural networks with several level of abstraction. Let's eplore them

### Data

In [32]:
input_batch = torch.tensor([[0.20, 0.15],
                            [0.30, 0.20],
                            [0.86, 0.99],
                            [0.91, 0.88]])

label_batch = torch.tensor([[1.],
                            [1.],
                            [-1.],
                            [-1.]])

### Low level approach
Using just `torch.Tensor` and `torch.autograd`.

In [37]:
learning_rate = 1e-3
training_iterations = 55000

In [38]:
w1 = torch.randn(2, 1, dtype=torch.float, requires_grad=True, device=torch.device("cpu"))
w2 = torch.randn(1, 1, dtype=torch.float, requires_grad=True, device=torch.device("cpu"))
w1, w2

(tensor([[1.8436],
         [0.6168]], requires_grad=True),
 tensor([[0.5524]], requires_grad=True))

In [39]:
# After each iteration, we adjust w1 and w2 parameters.
for training_iteration in range(training_iterations):
    # Here is actual forward pass through simple nn with 2 layers defines by w1 and w2.
    prediction = input_batch.mm(w1)
    prediction = torch.tanh(prediction)
    prediction = prediction.mm(w2)
    prediction = torch.tanh(prediction)
    
    # We can calculate err as mean square error, we need to get single scalar number for optimizer.
    loss = (prediction - label_batch).pow(2).mean()
    if training_iteration % 5000 == 0:
        print(training_iteration, loss.item())

    # Here we compute all the gradients of variables
    loss.backward()
    
    # We don't want to collect gradient information for optimization steps.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        # Clear gradients for next interation, we don't want to cummulate it.
        w1.grad.zero_()
        w2.grad.zero_()

0 1.3778938055038452
5000 0.8985980153083801
10000 0.8597437143325806
15000 0.8263626098632812
20000 0.8027422428131104
25000 0.7764884829521179
30000 0.7401544451713562
35000 0.6867016553878784
40000 0.6095514893531799
45000 0.5110574960708618
50000 0.4074777364730835


In [40]:
# Check predictions.
prediction = input_batch.mm(w1)
prediction = torch.tanh(prediction)
prediction = prediction.mm(w2)
prediction = torch.tanh(prediction)
prediction

tensor([[ 0.1628],
        [ 0.3971],
        [-0.9179],
        [-0.5615]], grad_fn=<TanhBackward>)

In [41]:
torch.save({'w1': w1, 'w2': w2}, './ckpt.pth')

In [42]:
state_dict = torch.load('./ckpt.pth')
w1.data = state_dict['w1']
w2.data = state_dict['w2']

### Container approach
Integrating `torch.nn.Module` container.

In [46]:
learning_rate = 1e-3
training_iterations = 55000

In [47]:
class SimpleNN(torch.nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        # In case we use basic tensors, we need to label them as trainable parameters of this Module.
        self.w1 = torch.nn.Parameter(torch.randn(2, 1, dtype=torch.float, requires_grad=True, device=torch.device("cpu")))
        self.w2 = torch.nn.Parameter(torch.randn(1, 1, dtype=torch.float, requires_grad=True, device=torch.device("cpu")))
        
    def forward(self, input_batch):
        prediction = input_batch.mm(self.w1)
        prediction = torch.tanh(prediction)
        prediction = prediction.mm(self.w2)
        prediction = torch.tanh(prediction)
        return prediction

simple_nn = SimpleNN()

In [48]:
list(simple_nn.parameters())

[Parameter containing:
 tensor([[-1.0158],
         [ 0.2822]], requires_grad=True), Parameter containing:
 tensor([[0.2751]], requires_grad=True)]

In [49]:
for training_iteration in range(training_iterations):
    prediction = simple_nn(input_batch)
    
    loss = (prediction - label_batch).pow(2).mean()
    if training_iteration % 5000 == 0:
        print(training_iteration, loss.item())

    simple_nn.zero_grad()
    loss.backward()
    with torch.no_grad():
        for p in simple_nn.parameters():
            p -= p.grad * learning_rate


0 0.9155999422073364
5000 0.8290987610816956
10000 0.8114342093467712
15000 0.7887741923332214
20000 0.7575080394744873
25000 0.7120348215103149
30000 0.6451467871665955
35000 0.5541778206825256
40000 0.45018261671066284
45000 0.3520523011684418
50000 0.27135515213012695


In [50]:
simple_nn(input_batch)

tensor([[ 0.2693],
        [ 0.5811],
        [-0.9699],
        [-0.6380]], grad_fn=<TanhBackward>)

### Container approach with torch.nn and  torch.optim

In [51]:
from torch.optim import SGD
from torch.nn import Linear, MSELoss, Tanh

In [52]:
learning_rate = 1e-3
training_iterations = 55000

In [53]:
class SimpleNN(torch.nn.Module):
    def __init__(self):
        super(SimpleNN, self).__init__()
        self.layer_1 = Linear(2, 1)
        self.layer_2 = Linear(1, 1)
        
    def forward(self, input_batch):
        prediction = self.layer_1(input_batch)
        prediction = torch.tanh(prediction)
        prediction = self.layer_2(prediction)
        prediction = torch.tanh(prediction)
        return prediction

simple_nn = SimpleNN()

In [54]:
list(simple_nn.parameters())

[Parameter containing:
 tensor([[-0.5188, -0.4814]], requires_grad=True), Parameter containing:
 tensor([-0.3630], requires_grad=True), Parameter containing:
 tensor([[-0.2864]], requires_grad=True), Parameter containing:
 tensor([0.6903], requires_grad=True)]

In [56]:
loss_fce = MSELoss(reduction='sum')

In [57]:
optimizer = SGD(simple_nn.parameters(), lr=learning_rate, momentum=0.9)
optimizer

SGD (
Parameter Group 0
    dampening: 0
    lr: 0.001
    momentum: 0.9
    nesterov: False
    weight_decay: 0
)

In [58]:
for training_iteration in range(training_iterations):
    prediction = simple_nn(input_batch)
    
    loss = loss_fce(prediction, label_batch)
    if training_iteration % 5000 == 0:
        print(training_iteration, loss.item())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 6.203975677490234
5000 0.0017448768485337496
10000 0.0008221292519010603
15000 0.0005331849679350853
20000 0.0003930902748834342
25000 0.0003106834483332932
30000 0.00025651470059528947
35000 0.00021826289594173431
40000 0.00018983696645591408
45000 0.00016788342327345163
50000 0.00015043055464047939


In [59]:
simple_nn(input_batch)

tensor([[ 0.9965],
        [ 0.9927],
        [-0.9948],
        [-0.9934]], grad_fn=<TanhBackward>)

In [None]:
simple_nn.load_state_dict(simple_nn.state_dict())

### Container approach with torch.nn.Sequential

In [62]:
learning_rate = 1e-3
training_iterations = 55000

In [61]:
simple_nn_seq = torch.nn.Sequential(
    Linear(2, 1),
    Tanh(),
    Linear(1, 1),
    Tanh()
)

In [63]:
loss_fce = MSELoss(reduction='sum')
optimizer = SGD(simple_nn_seq.parameters(), lr=learning_rate, momentum=0.9)

In [64]:
for training_iteration in range(training_iterations):
    prediction = simple_nn_seq(input_batch)
    
    loss = loss_fce(prediction, label_batch)
    if training_iteration % 5000 == 0:
        print(training_iteration, loss.item())

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

0 4.6270904541015625
5000 0.0017423416720703244
10000 0.0008255556458607316
15000 0.0005363304517231882
20000 0.0003957200678996742
25000 0.0003129152173642069
30000 0.00025844344054348767
35000 0.00021993886912241578
40000 0.0001912871957756579
45000 0.00016916720778681338
50000 0.00015160514158196747


In [65]:
simple_nn_seq(input_batch)

tensor([[ 0.9963],
        [ 0.9927],
        [-0.9948],
        [-0.9934]], grad_fn=<TanhBackward>)

## Custom layers

In [78]:
class CustomReLU(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input):
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input

custom_relu = CustomReLU().apply

In [79]:
custom_relu(torch.tensor([-1,0,1]))

tensor([0, 0, 1])