# Pytorch Demo

This notebook provides a working demo of Pytorch. It is based on official tutorials referenced below.

There are three main sections. First one reviews basics, second provides a skorch overview and third explains details of a broadcast based distributed example.

## Basics Review

In [6]:
import torch
import numpy as np

print( 'initialize from Python list of list' )
data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)
print(x_data)


print( '\ninitialize from NumPy array')
np_array = np.array(data)
x_np = torch.from_numpy(np_array)
print(x_data)

print( '\ninitialize using builtin functions')
shape = (2,3,)
rand_tensor = torch.rand(shape)
print(rand_tensor)
ones_tensor = torch.ones(shape)
print(ones_tensor)
zeros_tensor = torch.zeros(shape)
print(zeros_tensor)

initialize from Python list of list
tensor([[1, 2],
        [3, 4]])

initialize from NumPy array
tensor([[1, 2],
        [3, 4]])

initialize using builtin functions
tensor([[0.5408, 0.1525, 0.6799],
        [0.5660, 0.9491, 0.9543]])
tensor([[1., 1., 1.],
        [1., 1., 1.]])
tensor([[0., 0., 0.],
        [0., 0., 0.]])


In [9]:
print('tensor attributes')
tensor = torch.rand(2,3)
print(tensor.shape)
print(tensor.dtype)
print(tensor.device)

if torch.cuda.is_available():
  tensor = tensor.to('cuda')

tensor attributes
torch.Size([2, 3])
torch.float32
cpu


In [10]:
print('tensor indexing and slicing')
tensor = torch.ones(2,3)
tensor[:,1] = 0
print(tensor)

tensor indexing and slicing
tensor([[1., 0., 1.],
        [1., 0., 1.]])


In [12]:
print('element-wise product')
t1 = torch.ones(2,2)*2
t2 = torch.ones(2,2)*3
print(t1.mul(t2))
print(t1*t2)

element-wise product
tensor([[6., 6.],
        [6., 6.]])
tensor([[6., 6.],
        [6., 6.]])


In [14]:
print('matrix multiplication')
print(t1.matmul(t2))
print(t1 @ t2)

matrix multiplication
tensor([[12., 12.],
        [12., 12.]])
tensor([[12., 12.],
        [12., 12.]])


## Autograd

Many operations are parallelized including autograd. Lets look at one of the basic operations of Neural Networks, differentiation using autograd

In [31]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

Create a new tensor Q depending on a and b $$ Q = 3a^3 - b^2 $$

In [32]:
Q = 3*a**3 - b**2
print(Q)

tensor([-12.,  65.], grad_fn=<SubBackward0>)


Calculate partial derivative w.r.t a and b 

$$ \frac{\partial Q}{\partial a} = 9a^2$$

$$ \frac{\partial Q}{\partial b} = -2b$$

backward function of tensor Q automatically calculates these partial derivatives (gradients) and stores them in respective tensor's .grad attribute.

To calculate the gradients, backward function multiplies its input with the Jacobian matrix of input tensors and its default argument of torch.Tensor([1]). This is not a valid operation because Q is a two element vector. So we need to pass a two dimensional unit vector. 

In [33]:
Q.backward(torch.Tensor([1, 1]))

In [42]:
torch.allclose( a.grad, 9*a**2 )

True

In [44]:
torch.allclose( b.grad, -2*b )

True

## Simple Neural Network

### Define the network

In [45]:
import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


In [46]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

10
torch.Size([6, 1, 3, 3])


In [47]:
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[ 0.0515, -0.0502, -0.1390, -0.0416,  0.1407, -0.0035,  0.0550,  0.0544,
          0.0698, -0.0672]], grad_fn=<AddmmBackward>)


In [48]:
net.zero_grad()
out.backward(torch.randn(1, 10))

### Calculate loss function

In [50]:
output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(1.2178, grad_fn=<MseLossBackward>)


### Backpropogation

In [52]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([-0.0050,  0.0222, -0.0110,  0.0027, -0.0116, -0.0051])


### Update weights

In [57]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

## Skorch Overview

### Scikit-learn style NN classifier 

In [72]:
import numpy as np
from sklearn.datasets import make_classification
from torch import nn

from skorch import NeuralNetClassifier


X, y = make_classification(1000, 20, n_informative=10, random_state=0)
X = X.astype(np.float32)
y = y.astype(np.int64)

class MyModule(nn.Module):
    def __init__(self, num_units=10, nonlin=nn.ReLU()):
        super(MyModule, self).__init__()

        self.dense0 = nn.Linear(20, num_units)
        self.nonlin = nonlin
        self.dropout = nn.Dropout(0.5)
        self.dense1 = nn.Linear(num_units, num_units)
        self.output = nn.Linear(num_units, 2)
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = self.nonlin(self.dense1(X))
        X = self.softmax(self.output(X))
        return X


net = NeuralNetClassifier(
    MyModule,
    max_epochs=10,
    lr=0.1,
    # Shuffle training data on each epoch
    iterator_train__shuffle=True,
)

net.fit(X, y)
y_proba = net.predict_proba(X)

predict_y = net.predict(X)
print('prediction performance %.0f%%' % (sum(predict_y == y) / X.shape[0] * 100 ))

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m0.6946[0m       [32m0.5050[0m        [35m0.6852[0m  0.0550
      2        [36m0.6816[0m       [32m0.5450[0m        [35m0.6788[0m  0.0379
      3        [36m0.6682[0m       [32m0.5850[0m        [35m0.6723[0m  0.0393
      4        [36m0.6606[0m       [32m0.5900[0m        [35m0.6669[0m  0.0403
      5        [36m0.6529[0m       [32m0.5950[0m        [35m0.6573[0m  0.0414
      6        [36m0.6400[0m       [32m0.6000[0m        [35m0.6470[0m  0.0366
      7        [36m0.6350[0m       [32m0.6200[0m        [35m0.6415[0m  0.0381
      8        [36m0.6217[0m       [32m0.6450[0m        [35m0.6330[0m  0.0400
      9        [36m0.6083[0m       [32m0.6650[0m        [35m0.6278[0m  0.0351
     10        [36m0.5967[0m       [32m0.6850[0m        [35m0.6175[0m  0.0363
prediction performance 71%


### Grid search 

In [73]:
from sklearn.model_selection import GridSearchCV

# deactivate skorch-internal train-valid split and verbose logging
net.set_params(train_split=False, verbose=0)
params = {
    'lr': [0.01, 0.02],
    'max_epochs': [10, 20],
    'module__num_units': [10, 20],
}
gs = GridSearchCV(net, params, refit=False, cv=3, scoring='accuracy', verbose=2)

gs.fit(X, y)
print("best score: {:.3f}, best params: {}".format(gs.best_score_, gs.best_params_))

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] max_epochs=10, lr=0.01, module__num_units=10 ....................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ..... max_epochs=10, lr=0.01, module__num_units=10, total=   0.2s
[CV] max_epochs=10, lr=0.01, module__num_units=10 ....................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s


[CV] ..... max_epochs=10, lr=0.01, module__num_units=10, total=   0.2s
[CV] max_epochs=10, lr=0.01, module__num_units=10 ....................
[CV] ..... max_epochs=10, lr=0.01, module__num_units=10, total=   0.3s
[CV] max_epochs=10, lr=0.01, module__num_units=20 ....................
[CV] ..... max_epochs=10, lr=0.01, module__num_units=20, total=   0.3s
[CV] max_epochs=10, lr=0.01, module__num_units=20 ....................
[CV] ..... max_epochs=10, lr=0.01, module__num_units=20, total=   0.2s
[CV] max_epochs=10, lr=0.01, module__num_units=20 ....................
[CV] ..... max_epochs=10, lr=0.01, module__num_units=20, total=   0.3s
[CV] max_epochs=20, lr=0.01, module__num_units=10 ....................
[CV] ..... max_epochs=20, lr=0.01, module__num_units=10, total=   0.5s
[CV] max_epochs=20, lr=0.01, module__num_units=10 ....................
[CV] ..... max_epochs=20, lr=0.01, module__num_units=10, total=   0.4s
[CV] max_epochs=20, lr=0.01, module__num_units=10 ....................
[CV] .

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    9.1s finished


## Broadcast Distributed Example

### Broadcast and gather

Process with rank 0 broadcasts data to all others
![title](img/broadcast.png)

Process with rank 0 gathers all output from all others
![title](img/gather.png)

[demo](https://github.com/cankav/pytorcher/blob/master/run_distributed_broadcast_gather.py)


### Embarrassingly parallel processing

![title](img/pytorch_distributed.png)


## References

* https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py
* https://github.com/skorch-dev/skorch
* https://github.com/cankav/pytorcher
* https://github.com/RyersonU-DataScienceLab/pomdp_solvers/blob/master/dist_utils/distributed_utils.py
* https://pytorch.org/tutorials/intermediate/dist_tuto.html