In [40]:
from __future__ import print_function
import torch

## Tensors

Tensors are the ``PyTorch`` equivalent of ``np.arrays``.

In [41]:
x = torch.empty(5,3)
print(x)

tensor([[2.5226e-18, 6.4825e-10, 1.0186e-11],
        [3.0880e-09, 1.6902e-04, 4.0975e-11],
        [4.1028e-08, 2.9574e-18, 6.7333e+22],
        [1.7591e+22, 1.7184e+25, 4.3222e+27],
        [6.1972e-04, 7.2443e+22, 1.7728e+28]])


In [42]:
torch.rand(5,3)

tensor([[0.2819, 0.5279, 0.3102],
        [0.3723, 0.9303, 0.2323],
        [0.5933, 0.9119, 0.7436],
        [0.9982, 0.1374, 0.9567],
        [0.9522, 0.8285, 0.8325]])

Tensors have different dtypes, including long and short:

In [43]:
x = torch.zeros(5, 3, dtype=torch.long)
print(x)

tensor([[0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0],
        [0, 0, 0]])


To create a tensor from data:

In [44]:
x = torch.tensor([5.5, 3])
print(x)

tensor([5.5000, 3.0000])


Or, to edit existing tensors:

In [45]:
#create tensor filled with ones:
x = x.new_ones(5, 3, dtype=torch.double) #64 bit floating
print(x)
x = torch.randn_like(x, dtype=torch.float) #random sampling
print(x)

tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]], dtype=torch.float64)
tensor([[ 0.0258, -0.7566,  0.9422],
        [-0.7360, -0.2328,  1.2412],
        [-1.1186, -1.7612,  0.4196],
        [ 0.8616,  0.3666, -1.0637],
        [-0.8979,  0.5322,  0.6360]])


``size`` returns the size of the tensor as a **tuple**:

In [46]:
print(x.size())

torch.Size([5, 3])


### Tensor Operations
``PyTorch`` supports all operations with multiple syntaxes:

In [47]:
y = torch.rand(5,3)
print(x+y)

print(torch.add(x,y))

tensor([[ 0.3526,  0.2073,  1.1615],
        [ 0.0977, -0.1242,  2.0381],
        [-0.7623, -1.2992,  0.6476],
        [ 0.8701,  0.8910, -0.5168],
        [-0.8436,  1.4339,  0.8724]])
tensor([[ 0.3526,  0.2073,  1.1615],
        [ 0.0977, -0.1242,  2.0381],
        [-0.7623, -1.2992,  0.6476],
        [ 0.8701,  0.8910, -0.5168],
        [-0.8436,  1.4339,  0.8724]])


In [48]:
result = torch.empty(5,3)
torch.add(x, y, out=result)
print(result)

tensor([[ 0.3526,  0.2073,  1.1615],
        [ 0.0977, -0.1242,  2.0381],
        [-0.7623, -1.2992,  0.6476],
        [ 0.8701,  0.8910, -0.5168],
        [-0.8436,  1.4339,  0.8724]])


In [49]:
y.add_(x)
print(y)

tensor([[ 0.3526,  0.2073,  1.1615],
        [ 0.0977, -0.1242,  2.0381],
        [-0.7623, -1.2992,  0.6476],
        [ 0.8701,  0.8910, -0.5168],
        [-0.8436,  1.4339,  0.8724]])


The above function is called type ``in_place`` and can be used with `copy` (i.e. `x.copy_(y)` or ``x.t_()`` to change ``x``).

## Automatic Differentiation
To all neural networks in PyTorch, the ``autograd`` function is the most important. The package provides automatic differentiation for all operations on tensors. 

``autograd`` requires the attribute ``.requires_grad`` to be set to ``True``. This allows all operations to be tracked. After computation, ``.backward()`` can be called to have all gradients automatically calculated, and the gradient for the specific tensor will be stored in the ``.grad`` attribute.

If you don't want to track history or use memory, you can wrap the block of code using ``with torch.no_grad():``. 

## Functions
``Tensors`` and ``functions`` are interconnected and build the graph that encodes a complete history of computations. Every ``tensor`` has a ``.grad_fn`` that references a ``function`` that creates the ``tensor`` except for those created by the user. If created by the usor, their ``.grad_fn`` is ``None``.

To get the derivatives, ``.backward()`` on a ``tensor``. If the ``tensor`` is a scalar, there are no arguments to ``.backward()``. If it ahs more elements, a ``gradient`` argument has to be specified.

Examples:

In [50]:
#create tensor with gradient
x = torch.ones(2,2, requires_grad=True)
print(x)

tensor([[1., 1.],
        [1., 1.]], requires_grad=True)


In [51]:
y = x + 2
print(y)

tensor([[3., 3.],
        [3., 3.]], grad_fn=<AddBackward0>)


As ``y`` was created through an operation, it has ``grad_fn``.

In [52]:
print(y.grad_fn)

<AddBackward0 object at 0x1251de4d0>


If we do more operations:

In [53]:
z = y * y * 3
out = z.mean()

print(z, out)

tensor([[27., 27.],
        [27., 27.]], grad_fn=<MulBackward0>) tensor(27., grad_fn=<MeanBackward0>)


``.requires_gra_(...)`` changes a tensor's ``require_grad`` flag. If not given, the flag defaults to ``False``.

In [54]:
a = torch.randn(2, 2)
a = ((a * 3) / (a - 1))
print(a.requires_grad)
a.requires_grad_(True)
print(a.requires_grad)
b = (a * a).sum()
print(b.grad_fn)

False
True
<SumBackward0 object at 0x1251de910>


## Gradients
``out`` is a single scalar, so we use ``out.backward()`` as an equivalent to ``out.backward(torch.tensor(1.))``.

In [55]:
out.backward()

In [56]:
print(x.grad)

tensor([[4.5000, 4.5000],
        [4.5000, 4.5000]])


Our function is:
$$3(x_i + 2)^2$$

This function is based off all the operations conducted on our tensor at this point.

Therefore, our derivative should be equal to **4.5**, as shown above, when our actual ``tensor`` is equal to 1.

More generally, ``torch.autograd`` is an engine for coputing vector-Jacobian products. The actual math can be viewed in the documentation, but now an example:

In [57]:
torch.randn(3)

tensor([-0.1798,  1.2019,  2.2834])

In [58]:
x = torch.randn(3, requires_grad=True) #random 3 numbers

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

tensor([ 795.3996, 1075.8384, -813.3011], grad_fn=<MulBackward0>)


``y`` is no longer a scalar, so ``torch.autograd`` could not compute the full Jacobian. If we want the product, we can pass this vector to ``backward`` as an argument:

In [59]:
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(v)

print(x.grad)

tensor([1.0240e+02, 1.0240e+03, 1.0240e-01])


If you don't want the gradients, you can use ``.detach()``.

In [60]:
print(x.requires_grad)
y = x.detach()
print(y.requires_grad)
print(x.eq(y).all())

True
False
tensor(True)


## Neural Networks
Now, we get to put everything together. These can be constructed using the ``torch.nn`` package, depending on the ``autograd`` to define models and differentiate them.

A ``nn.Module`` contains layers and, using the method ``forward(input``, we return the ``output``.

The training procedure for neural networks is:
1. Define the network that has some learnable parameters (weights)
2. Iterate over dataset of inputs
3. Process input through network
4. Compute the loss function (how far output is from being correct)
5. Propogate gradients back into the network parameters
6. Update the weights
    - Typically done by using update rule: 
    - ``weight = weight - learning_rate * gradient``

Now, we for an example:

In [61]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 6 * 6, 120)  # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


In [62]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

10
torch.Size([6, 1, 3, 3])


In [63]:
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[-0.0037,  0.0943,  0.0899, -0.1341, -0.1210, -0.1063, -0.0518,  0.1004,
          0.0416,  0.0547]], grad_fn=<AddmmBackward>)


In [64]:
net.zero_grad()
out.backward(torch.randn(1, 10))

## Loss Functions
The ``loss function`` takes (output, target) for a pair of inputs and computes a value that is an estimate of how far away the output is from the target.

``nn`` has a ton of different loss functions, but the most simple loss is ``nn.MSELoss`` which computes the **mean-squared error**.

In [65]:
output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.9336, grad_fn=<MseLossBackward>)


If we breakdown this the loss function, using ``.grad_fn``, it has a graph of computations like:

input -> conv2d -> relu -> maxpool2d -> conv2d -> relu -> maxpool2d
      -> view -> linear -> relu -> linear -> relu -> linear
      -> MSELoss
      -> loss
      
To illustrate this, we will print some of the steps:

In [66]:
print(loss.grad_fn)
print(loss.grad_fn.next_functions[0][0])
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])

<MseLossBackward object at 0x1251b1690>
<AddmmBackward object at 0x1251b13d0>
<AccumulateGrad object at 0x1251b1690>


## Backprop
Backpropogating the error uses ``loss.backward()`` command. Existing gradients will need to be cleared, other new gradients will be accumulated.

In [67]:
net.zero_grad()

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([ 0.0061,  0.0031, -0.0021, -0.0027, -0.0028, -0.0013])


## Updating Weights
Stochastic Gradient Descent (**SGD**) is the simplest update rule and is:

$$weight = weight - learning\ rate * gradient$$

In [68]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

However, for neural networks, we may need to use different update rules. The package ``torch.optim`` implements these methods:

In [69]:
import torch.optim as optim

#create optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01) #SGD update rule

#in training loop
optimizer.zero_grad() #zeros gradients
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() #updates

## Training Classifiers

Now, *let's talk about data*. 

The process:
- Get data
- convert to ``np`` array
- convert array into ``torch.*Tensor``

The example on PyTorch is for image analysis, so not relevant to our project. I will change the example done in class (5 March 2020) for PyTorch.