# Pytorch Fundamentals, Quick Tutorial

Esto es solo una transcripción del siguiente tutorial para tenerlo todo en un solo lugar.

https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html

In [1]:
# !pip install torchvision

In [2]:
import torch
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(device)

cuda:0


## Tensors

https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html

These are multidimensional arrays that can be stored in GPU.

Los tensores además tienen a heck of a lot of métodos que se pueden usar para diversas cosas. 
Muchos de estos métodos y parámetros de los tensores se usan en autograd para hacer backprop de forma automática.

Mirar https://pytorch.org/docs/stable/tensors.html#tensor-class-reference para una referencia completa de las propiedades de los tensores

Creación de tensores

In [3]:
import numpy as np
x = torch.tensor([[1,2], [3,4]])
print(x)
x_np = np.array([[1,2], [3,4]])
print(torch.from_numpy(x_np))
print(x.dtype)
print(x_np.dtype)

# Por defecto los arreglos de enteros de numpy son de 32 bits, en Pytorch son de 64 bits.

tensor([[1, 2],
        [3, 4]])
tensor([[1, 2],
        [3, 4]], dtype=torch.int32)
torch.int64
int32


In [4]:
print(torch.rand((2,2)))
print(torch.ones((2,2)))
print(torch.zeros((2,2)))

tensor([[0.0356, 0.1533],
        [0.2645, 0.1802]])
tensor([[1., 1.],
        [1., 1.]])
tensor([[0., 0.],
        [0., 0.]])


Atributos de tensores

In [5]:
tensor = torch.rand(3, 4)

print(f"Shape of tensor: {tensor.shape}",tensor.size() ,f"({tensor.shape[0], tensor.shape[1]})",
f"({tensor.size()[0], tensor.size()[1]})", f"({tensor.size(0), tensor.size(1)})")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([3, 4]) torch.Size([3, 4]) ((3, 4)) ((3, 4)) ((3, 4))
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


Operaciones con tensores

In [6]:
if torch.cuda.is_available():
  tensor = tensor.to('cuda')
  print(f"Device tensor is stored on: {tensor.device}")

Device tensor is stored on: cuda:0


In [7]:
# NO SE PUEDE OPERAR SOBRE TENSORES EN DISTINTOS DISPOSITIVOS

cuda_tensor = torch.rand((3, 4), device='cuda')
cpu_tensor = torch.rand((3, 4))

print(cuda_tensor)
print(cpu_tensor)



cuda_tensor + cpu_tensor

tensor([[0.9129, 0.5463, 0.2383, 0.6018],
        [0.4221, 0.3849, 0.2910, 0.6693],
        [0.5347, 0.1561, 0.2432, 0.4349]], device='cuda:0')
tensor([[0.0768, 0.5267, 0.2505, 0.2638],
        [0.2097, 0.7665, 0.7592, 0.1751],
        [0.0771, 0.6114, 0.5066, 0.4600]])


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In [None]:
tensor = torch.ones(4, 4)
tensor[:,1] = 0
t1 = torch.cat([tensor, tensor], dim=1)
print(t1)
t2 = torch.stack([tensor, tensor], dim=0)
print(t2)

tensor([[1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1.],
        [1., 0., 1., 1., 1., 0., 1., 1.]])
tensor([[[1., 0., 1., 1.],
         [1., 0., 1., 1.],
         [1., 0., 1., 1.],
         [1., 0., 1., 1.]],

        [[1., 0., 1., 1.],
         [1., 0., 1., 1.],
         [1., 0., 1., 1.],
         [1., 0., 1., 1.]]])


Multiplication

In [None]:
# This computes the element-wise product
print(f"tensor.mul(tensor) \n {tensor.mul(tensor)} \n")
# Alternative syntax:
print(f"tensor * tensor \n {tensor * tensor}")

tensor.mul(tensor) 
 tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]]) 

tensor * tensor 
 tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])


In [None]:
print(f"tensor.matmul(tensor.T) \n {tensor.matmul(tensor.T)} \n")
# Alternative syntax:
print(f"tensor @ tensor.T \n {tensor @ tensor.T}")

tensor.matmul(tensor.T) 
 tensor([[3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.]]) 

tensor @ tensor.T 
 tensor([[3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.],
        [3., 3., 3., 3.]])


In-place operations Operations that have a _ suffix are in-place. For example: x.copy_(y), x.t_(), will change x.

Operaciones con sufijo _ serán inplace y cambiarán el tensor original

In [None]:
print(tensor, "\n")
tensor.add_(5)
print(tensor)

tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]]) 

tensor([[6., 5., 6., 6.],
        [6., 5., 6., 6.],
        [6., 5., 6., 6.],
        [6., 5., 6., 6.]])


Cuidado. Si creas un tensor desde un np array o un np.array desde un tensor su ubicación en memoria será la misma (Siempre y cuando el tensor no esté en GPU), por lo que cambiar uno afectará el otro.

In [None]:
n = np.ones(5)
t = torch.from_numpy(n)

np.add(n, 1, out=n)
print(f"t: {t}")
print(f"n: {n}")


t: tensor([2., 2., 2., 2., 2.], dtype=torch.float64)
n: [2. 2. 2. 2. 2.]


In [None]:
n = np.ones(5)
t = torch.from_numpy(n).to('cuda') # Esto cambiará la posición en memoria del tensor ya que este estará en la memoria de la GPU

np.add(n, 1, out=n)
print(f"t: {t}")
print(f"n: {n}")

t: tensor([1., 1., 1., 1., 1.], device='cuda:0', dtype=torch.float64)
n: [2. 2. 2. 2. 2.]


## Torch.Autograd

https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html

Acá se hace el *heavy lifting* en el paso de backprop. Esta librería permite calcular gradientes de las funciones de activación y costo. Además, en general es una librería eficiente para cálculo vectorial.

Let’s take a look at a single training step. For this example, we load a pretrained resnet18 model from torchvision. We create a random data tensor to represent a single image with 3 channels, and height & width of 64, and its corresponding label initialized to some random values. Label in pretrained models has shape (1,1000).

*This tutorial work only on CPU and will not work on GPU (even if tensor are moved to CUDA)*

In [None]:
import torch
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)
data = torch.rand(1, 3, 64, 64)
labels = torch.rand(1, 1000)

Next, we run the input data through the model through each of its layers to make a prediction. This is the forward pass.

In [None]:
prediction = model(data) # forward pass

In [None]:
print(prediction) # Note que un atributo del objeto tensor es la función de gradiente (grad_fun)

tensor([[-9.2230e-01, -4.3726e-01, -5.9561e-01, -1.6869e+00, -1.0063e+00,
          4.0592e-02, -4.4441e-01,  6.9468e-01,  4.4974e-01, -6.5165e-01,
         -1.0699e+00, -8.2337e-01, -3.2915e-01, -1.1731e+00, -1.0379e+00,
         -7.7191e-01, -7.1098e-01, -4.0526e-01, -3.8269e-01, -4.6440e-01,
         -1.4476e+00, -1.0471e+00, -1.5277e+00,  2.3398e-02, -8.1703e-01,
         -8.6946e-01, -7.1315e-01, -1.0303e+00, -6.6329e-01, -1.5779e-01,
         -8.8289e-01, -6.3415e-01, -5.2718e-01, -3.0054e-01, -4.1187e-01,
         -2.6593e-01,  9.4039e-01, -6.0331e-01, -5.4432e-01,  2.7224e-01,
         -5.0672e-01, -9.2970e-01, -8.1918e-01, -2.0005e-01, -6.5748e-01,
         -2.4849e-01, -4.6830e-01, -4.8448e-01, -1.2872e+00, -1.1275e+00,
         -1.5276e-01,  5.0956e-01, -1.5372e-01, -4.5040e-01, -1.3581e-01,
         -9.4228e-01, -3.4223e-01, -1.5179e+00, -4.8728e-01, -2.3587e-01,
          8.3719e-01,  7.0344e-02, -3.7713e-01, -2.2337e-02, -3.9677e-01,
         -5.4714e-02, -4.8682e-01, -2.

In [None]:
print(prediction.grad_fn)

<AddmmBackward0 object at 0x000001913DE06CD0>


We use the model’s prediction and the corresponding label to calculate the error (loss). The next step is to backpropagate this error through the network. Backward propagation is kicked off when we call .backward() on the error tensor. Autograd then calculates and stores the gradients for each model parameter in the parameter’s .grad attribute.

In [None]:
loss = (prediction - labels).sum()
loss.backward() # backward pass

In [None]:
print((prediction - labels).grad_fn)
print((prediction - labels).sum().grad_fn) # Note que los tensores van guardando la última función que los generó.

# Si un tensor X representa una capa del modelo, al hacer loss.backward() al tensor del vector de costo, el autograd,
#  automáticamente guardará el gradiente calculado de dicha capa en X.grad, incluyendo los gradientes previos 
# (Hace Backprop con una sola linea de código)

<SubBackward0 object at 0x000001913DA6A7C0>
<SumBackward0 object at 0x000001913DE06F70>


Next, we load an optimizer, in this case SGD with a learning rate of 0.01 and momentum of 0.9. We register all the parameters of the model in the optimizer.

In [None]:
optim = torch.optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

# Este optimizer hace uso de los gradientes previamente calculados con loss.backward() para actualizar los pesos de los parámetros. 
# La forma específica de actualizar los pesos depende de los hiperparámetros del optimizados pero básicamente todos son alguna variación 
# De gradiente descendiente

Finally, we call .step() to initiate gradient descent. The optimizer adjusts each parameter by its gradient stored in .grad.

In [None]:
optim.step()

### Differentiation in Autograd

Let’s take a look at how autograd collects gradients. We create two tensors a and b with requires_grad=True. This signals to autograd that every operation on them should be tracked.

In [None]:
a = torch.tensor([2., 3.], requires_grad=True)
b = torch.tensor([6., 4.], requires_grad=True)

We create another tensor Q from a and b.

$$Q = 3a^3 - b^2$$

In [None]:
Q = 3*a**3 - b**2

Let’s assume a and b to be parameters of an NN, and Q to be the error. In NN training, we want gradients of the error w.r.t. parameters, i.e.

$$\frac{\partial Q}{\partial a} = 3a^2$$
$$\frac{\partial Q}{\partial b} = -2b$$

When we call .backward() on Q, autograd calculates these gradients and stores them in the respective tensors’ .grad attribute.

We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself, i.e

$$\frac{\partial Q}{\partial Q} = 1$$

Equivalently, we can also aggregate Q into a scalar and call backward implicitly, like Q.sum().backward()

In [None]:
external_grad = torch.tensor([1., 1.])
Q.backward(gradient=external_grad)

In [None]:
# Gradients are now deposited in a.grad and b.grad

print(9*a**2 == a.grad)
print(-2*b == b.grad)
print(a.grad, b.grad)

tensor([True, True])
tensor([True, True])
tensor([36., 81.]) tensor([-12.,  -8.])


In [None]:
print(Q.grad_fn)
print((a**3).grad_fn)
print((torch.dot(a,b).grad_fn))

<SubBackward0 object at 0x000001914401ABB0>
<PowBackward0 object at 0x00000191051AA220>
<DotBackward0 object at 0x000001910539E520>


### Computational Graph

Conceptually, autograd keeps a record of data (tensors) & all executed operations (along with the resulting new tensors) in a directed acyclic graph (DAG) consisting of Function objects. In this DAG, leaves are the input tensors, roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

- run the requested operation to compute a resulting tensor, and

- maintain the operation’s gradient function in the DAG.

The backward pass kicks off when .backward() is called on the DAG root. autograd then:

- computes the gradients from each .grad_fn,

- accumulates them in the respective tensor’s .grad attribute, and using the chain rule, propagates all the way to the leaf tensors.

### Exclusion from DAG

torch.autograd tracks operations on all tensors which have their requires_grad flag set to True. For tensors that don’t require gradients, setting this attribute to False excludes it from the gradient computation DAG.

The output tensor of an operation will require gradients even if only a single input tensor has requires_grad=True.

In [None]:
x = torch.rand(5, 5)
y = torch.rand(5, 5)
z = torch.rand((5, 5), requires_grad=True)

a = x + y
print(f"Does `a` require gradients? : {a.requires_grad}")
b = x + z
print(f"Does `b` require gradients?: {b.requires_grad}")

Does `a` require gradients? : False
Does `b` require gradients?: True


In a NN, parameters that don’t compute gradients are usually called frozen parameters. It is useful to “freeze” part of your model if you know in advance that you won’t need the gradients of those parameters (this offers some performance benefits by reducing autograd computations).

Another common usecase where exclusion from the DAG is important is for finetuning a pretrained network. (https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html)

In finetuning, we freeze most of the model and typically only modify the classifier layers to make predictions on new labels. Let’s walk through a small example to demonstrate this. As before, we load a pretrained resnet18 model, and freeze all the parameters.

In [None]:
from torch import nn, optim

model = resnet18(weights=ResNet18_Weights.DEFAULT)

# Freeze all the parameters in the network
for param in model.parameters():
    param.requires_grad = False

In [81]:
# Let’s say we want to finetune the model on a new dataset with 10 labels. 
# In resnet, the classifier is the last linear layer model.fc. 
# We can simply replace it with a new linear layer (unfrozen by default) that acts as our classifier.

model.fc = nn.Linear(512, 10)

# Now all parameters in the model, except the parameters of model.fc, are frozen. 
# The only parameters that compute gradients are the weights and bias of model.fc.

# Optimize only the classifier
optimizer = optim.SGD(model.parameters(), lr=1e-2, momentum=0.9)

# Notice although we register all the parameters in the optimizer, 
# the only parameters that are computing gradients (and hence updated in gradient descent) 
# are the weights and bias of the classifier.


## Simple Neural Network

https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html

Neural networks can be constructed using the torch.nn package.

Now that you had a glimpse of autograd, nn depends on autograd to define models and differentiate them. An nn.Module contains layers, and a method forward(input) that returns the output

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or weights)

- Iterate over a dataset of inputs

- Process input through the network

- Compute the loss (how far is the output from being correct)

- Propagate gradients back into the network’s parameters

- Update the weights of the network, typically using a simple update rule: weight = weight - learning_rate * gradient


In [11]:
import torch
import torch.nn as nn 
import torch.nn.functional as F

class Net(nn.Module): # Extender un nn.Module te permite que torch acceda a métodos útiles.

    # Se definen como parámetros de la clase aquellas capas necesarias para su entrenamiento.
    def __init__(self):
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)  # 5*5 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square, you can specify with a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = torch.flatten(x, 1) # flatten all dimensions except the batch dimension
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# You just have to define the forward function, and the backward function (where gradients are computed) 
# is automatically defined for you using autograd. You can use any of the Tensor operations in the forward function.
# The learnable parameters of a model are returned by net.parameters()

net = Net()
print(net)
print(net.parameters)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)
<bound method Module.parameters of Net(
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=400, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)>


Let’s try a random 32x32 input. Note: expected input size of this net (LeNet) is 32x32. To use this net on the MNIST dataset, please resize the images from the dataset to 32x32.

In [12]:
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out)

tensor([[-0.0895,  0.0399,  0.0213,  0.1287, -0.0391,  0.1222,  0.0357, -0.0255,
          0.0209,  0.1036]], grad_fn=<AddmmBackward0>)


Zero the gradient buffers of all parameters and backprops with random gradients:

In [13]:
net.zero_grad()
out.backward(torch.randn(1, 10))

**Note**:

torch.nn only supports mini-batches. The entire torch.nn package only supports inputs that are a mini-batch of samples, and not a single sample.

For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width.

If you have a single sample, just use input.unsqueeze(0) to add a fake batch dimension

Before proceeding further, let’s recap all the classes you’ve seen so far.

Recap:
- torch.Tensor - A multi-dimensional array with support for autograd operations like backward(). Also holds the gradient w.r.t. the tensor.

- nn.Module - Neural network module. Convenient way of encapsulating parameters, with helpers for moving them to GPU, exporting, loading, etc.

- nn.Parameter - A kind of Tensor, that is automatically registered as a parameter when assigned as an attribute to a Module.

- autograd.Function - Implements forward and backward definitions of an autograd operation. Every Tensor operation creates at least a single Function node that connects to functions that created a Tensor and encodes its history.

At this point, we covered:
Defining a neural network

Processing inputs and calling backward

Still Left:
Computing the loss

Updating the weights of the network

### Loss Function

A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.

There are several different loss functions (https://pytorch.org/docs/nn.html#loss-functions) under the nn package . A simple loss is: nn.MSELoss which computes the mean-squared error between the output and the target.

For example:

In [14]:
output = net(input)
target = torch.randn(10)  # a dummy target, for example

# https://stackoverflow.com/questions/49643225/whats-the-difference-between-reshape-and-view-in-pytorch
# https://stackoverflow.com/questions/26998223/what-is-the-difference-between-contiguous-and-non-contiguous-arrays/26999092#26999092
target = target.view(1, -1)  # make it the same shape as output 
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(0.4964, grad_fn=<MseLossBackward0>)


So, when we call loss.backward(), the whole graph is differentiated w.r.t. the neural net parameters, and all Tensors in the graph that have requires_grad=True will have their .grad Tensor accumulated with the gradient.

For illustration, let us follow a few steps backward

In [15]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

<MseLossBackward0 object at 0x0000023B8005C940>
<AddmmBackward0 object at 0x0000023B89A95D60>
<AccumulateGrad object at 0x0000023B89A9A460>


#### Gradient Acummulation

In this subsection you will see why we need to zero out the gradients before each optimizer step

In [3]:
a = torch.tensor([[1,2], [1,2]], requires_grad=True, dtype=torch.float32)
b = torch.tensor([[1,1], [1,1]], requires_grad=True, dtype=torch.float32)
l = torch.norm(a @ b)
print(l)
l.backward()
# This should be the gradients of the loss function with respect to a and b
print(a.grad)
print(b.grad)

tensor(6., grad_fn=<CopyBackwards>)
tensor([[1., 1.],
        [1., 1.]])
tensor([[1., 1.],
        [2., 2.]])


In [4]:
# If we calculate the gradients again with respect to the same loss they should be the same right ?. Well, no by default
l = torch.norm(a @ b)
print(l)
l.backward()
# This should be the gradients of the loss function with respect to a and b
print(a.grad)
print(b.grad)

tensor(6., grad_fn=<CopyBackwards>)
tensor([[2., 2.],
        [2., 2.]])
tensor([[2., 2.],
        [4., 4.]])


In [6]:
# The new gradient will add up to the previous gradients. So we actually need to set de previous gradients to zero for the correct behaviour
l = torch.norm(a @ b)
a.grad.zero_(); b.grad.zero_() # The optimizer does this automagically for all the trainable parameters of the net
l.backward()
# Now we have the correct gradients
print(a.grad)
print(b.grad)

tensor([[1., 1.],
        [1., 1.]])
tensor([[1., 1.],
        [2., 2.]])


### Backprop and Updating Weights

To backpropagate the error all we have to do is to loss.backward(). You need to clear the existing gradients though, else gradients will be accumulated to existing gradients.

Now we shall call loss.backward(), and have a look at conv1’s bias gradients before and after the backward.

In [16]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([ 0.0109, -0.0071, -0.0042,  0.0141, -0.0094, -0.0060])


The simplest update rule used in practice is the Stochastic Gradient Descent (SGD):

weight = weight - learning_rate * gradient

We can implement this using simple Python code:

In [None]:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

In [21]:
print(next(net.parameters()))

Parameter containing:
tensor([[[[-0.1566,  0.0341, -0.1621,  0.1380, -0.0677],
          [-0.0213, -0.0179, -0.1068, -0.1274, -0.0610],
          [-0.1949,  0.1491, -0.0670,  0.0372, -0.1481],
          [ 0.0995, -0.1996, -0.0411, -0.1645,  0.1799],
          [-0.1841, -0.1797,  0.1347, -0.1499, -0.0646]]],


        [[[ 0.0135,  0.0778, -0.1069,  0.1800, -0.0954],
          [ 0.0735, -0.1567,  0.1787, -0.0599,  0.1315],
          [-0.0134, -0.1391,  0.1753, -0.1073,  0.1062],
          [ 0.1628, -0.0809,  0.0987, -0.0721, -0.0882],
          [-0.1521, -0.1999, -0.0265, -0.1386,  0.0986]]],


        [[[ 0.1482, -0.1414, -0.0380, -0.1731, -0.1218],
          [ 0.0227, -0.1114, -0.0913,  0.0702, -0.1882],
          [-0.0413, -0.1908, -0.1205,  0.0140,  0.1962],
          [-0.0059, -0.0548,  0.1488, -0.0507, -0.0149],
          [-0.1564,  0.1860, -0.0853,  0.1160, -0.0699]]],


        [[[ 0.1244,  0.1516, -0.0403, -0.0849,  0.0339],
          [-0.1446,  0.0289, -0.0119,  0.0756, -0.0778

In [24]:
print(type(next(net.parameters()).data))
print(type(next(net.parameters()).data.sub_))

<class 'torch.Tensor'>
<class 'builtin_function_or_method'>


However, as you use neural networks, you want to use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable this, we built a small package: torch.optim that implements all these methods. Using it is very simple:

In [25]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update