# Tutoriel pytorch - TP3 - IFT725

Tel que mentionné dans l'énoncé du travail, vous devez recopier les blocs de code du tutoriel suivant

https://pytorch.org/tutorials/beginner/pytorch_with_examples.html

en donnant, pour chaque bloc, une description en format "markdown" de son contenu.

In [0]:
# -*- coding: utf-8 -*-
import numpy as np

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # Compute and print loss
    loss = np.square(y_pred - y).sum()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2
#PyTorch: Tensors
#Numpy is a great framework, but it cann

99 353.8751049329006
199 0.6403114503457334
299 0.001799462263410745
399 6.110526757931891e-06
499 2.2928506484629015e-08


Dans cette celule on implémente un réseau de neurones pleinements connecté d'une seule couche caché rien avec la librairie numpy, donc rien qu'avec un cpu. A l'entrée on a comme input x tel que x.shape = (64, 1000) et en sortie les etiquettes sont contenus dans y qui a le shape (64, 10).
Dans ce bout de code, on fait 500 epochs, tel que dans chaque epoch, on fait une forward pass pour calculer y_pred, et une packward pass durant laquelle on calcul les gradiants des paramètres pour faire la mise à jour des poids w1 (w1.shape= (1000, 100)) et w2 (w2.shape = (100, 10)).

In [0]:
# -*- coding: utf-8 -*-

import torch

dtype = torch.float
#device = torch.device("cpu")
device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device, dtype=dtype)
w2 = torch.randn(H, D_out, device=device, dtype=dtype)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()
    if t % 100 == 99:
        print(t, loss)

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

99 219.80177307128906
199 0.5563068389892578
299 0.003096770029515028
399 0.00011128463665954769
499 2.581826993264258e-05


Dans cette celule on implémente un réseau de neurones pleinements connecté d'une seule couche caché avec la librairie pytorch et c'est à nous de choisir si on veut utiliser un CPU ou des GPUs durant l'execution du code, ceci est précisé dans la variable device. A l'entrée on a comme input x qui est ici un tensor de taille: x.shape = (64, 1000) et en sortie les etiquettes sont contenus dans y qui est aussi un tensor ayant le shape (64, 10).
Dans ce bout de code, on fait 500 epochs, tel que dans chaque epoch, on fait une forward pass pour calculer le tensor y_pred, et une packward pass durant laquelle on calcul les gradiants des paramètres pour faire la mise à jour des poids contenus dans les tensors: w1 (w1.shape= (1000, 100)) et w2 (w2.shape = (100, 10)).

In [0]:
# -*- coding: utf-8 -*-
import torch

dtype = torch.float
#device = torch.device("cpu")
device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
# Setting requires_grad=False indicates that we do not need to compute gradients
# with respect to these Tensors during the backward pass.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
# Setting requires_grad=True indicates that we want to compute gradients with
# respect to these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # Forward pass: compute predicted y using operations on Tensors; these
    # are exactly the same operations we used to compute the forward pass using
    # Tensors, but we do not need to keep references to intermediate values since
    # we are not implementing the backward pass by hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Compute and print loss using operations on Tensors.
    # Now loss is a Tensor of shape (1,)
    # loss.item() gets the scalar value held in the loss.
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Manually update weights using gradient descent. Wrap in torch.no_grad()
    # because weights have requires_grad=True, but we don't need to track this
    # in autograd.
    # An alternative way is to operate on weight.data and weight.grad.data.
    # Recall that tensor.data gives a tensor that shares the storage with
    # tensor, but doesn't track history.
    # You can also use torch.optim.SGD to achieve this.
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 500.32025146484375
199 2.181394100189209
299 0.01746232621371746
399 0.00039250816917046905
499 6.059363659005612e-05


La différence entre cette celulle et la précédente est que ici on calcul les gradiants des paramètres w1 et w2 à travers le package autograd. lorsqu'on appel loss.backward(), les gradiants de tous les tensors ayant contribué à la valeur de la variable loss et qui ont comme attribut requires_grad=True vont être calculés à l'aide du graph de computation avec utilisé pour calculer la variable loss, et les valeurs des gradiants sont contenus dans les tensors tensor.grad. Donc après l'appel de la fonctino loss.backward(), on aura les gradiants de w1 et w2 calculé, donc il ne reste que de faire la mise à jour de leurs poids.

In [0]:
# -*- coding: utf-8 -*-
import torch


class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """

    @staticmethod
    def forward(ctx, input):
        """
        In the forward pass we receive a Tensor containing the input and return
        a Tensor containing the output. ctx is a context object that can be used
        to stash information for backward computation. You can cache arbitrary
        objects for use in the backward pass using the ctx.save_for_backward method.
        """
        ctx.save_for_backward(input)
        return input.clamp(min=0)

    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = ctx.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input < 0] = 0
        return grad_input


dtype = torch.float
#device = torch.device("cpu")
device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs.
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    # To apply our Function, we use Function.apply method. We alias this as 'relu'.
    relu = MyReLU.apply

    # Forward pass: compute predicted y using operations; we compute
    # ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    if t % 100 == 99:
        print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    # Update weights using gradient descent
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after updating weights
        w1.grad.zero_()
        w2.grad.zero_()

99 233.4364013671875
199 0.5397834777832031
299 0.0025395341217517853
399 0.0001115476552513428
499 2.7791535103460774e-05


Dans cette celule, on a définit comment on calcul le gradient pour la fonction relu. Ceci est fait en créant la classe MyReLU qui hérite de la class torch.autograd.Function. Dans cette classe on est censé implémenter la fonction forward où on décrit comment l'output de notre fonction sera calculé en fonction de l'input, et on sauvegarde notre input dans une cache pour être utilisé dans la backward pass. On implémente aussi la fonctino backward qui décrit comment le gradient par rapport à l'input sera calculé en fonction de l'output. Pour utiliser cette fonction nouvellement créée on il suffit de l'instancier avec la méthode apply de la classe mère torch.autograd.Function dans une variable, ici "relu".

In [0]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the
    # loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param -= learning_rate * param.grad

99 3.2127058506011963
199 0.053353678435087204
299 0.0017388329142704606
399 7.844710489735007e-05
499 4.248996447131503e-06


Dans cette celulle, on va utiliser le package nn de pytorch qui représente une abstraction sous forme de couches, où chaque couche prend en entrée un tensor en input, et produit un tensor en output, tout en permettant le calculs des différentes paramètres qui sont appréhensifs. Ainsi, la variable modèle contient la définition de notre réseau. C'est une instance de la classe torch.nn.Sequential qui comprénnent trois couches: - couche pleinement connectées qui va produire le hidden layer, couche Relu qui fait la non linéarité relu, et la couche pleinement connectée de sortie pour faire la prédiction y_pred. Faire une forward_pass se fera en y_pred = module(x) où y_pred va contenir le résultat de la forward_pass à trvers tous le réseaux. Pour calculé notre erreur, le package nn se compose de nombreux fonctions d'erreurs fréquement utilisées, doncil suffit d'instancier une ici loss_fn = torch.nn.MSELoss(reduction='sum'), en suite le calcule de la loss se fait en applicant cette fonction d'erreur loss = loss_fn(y_pred, y). Pour faire la propagation arrière, on appel toujour loss.backward() pour calculer les gradiants des paramètres apréhensifs. D'où finalement il ne reste que de faire une mise à jour de chaque paramètre.

In [0]:
# -*- coding: utf-8 -*-
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
    torch.nn.Linear(D_in, H),
    torch.nn.ReLU(),
    torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algoriths. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(500):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)

    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the variables it will update (which are the learnable
    # weights of the model). This is because by default, gradients are
    # accumulated in buffers( i.e, not overwritten) whenever .backward()
    # is called. Checkout docs of torch.autograd.backward for more details.
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model
    # parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its
    # parameters
    optimizer.step()

99 54.0734748840332
199 0.854391872882843
299 0.002743469551205635
399 1.43419219966745e-06
499 3.8414599279334993e-10


Jusqu'à maintenant, la mise à jour des paramètres du réseau est faite manuellement. Dans cette celulle, cette mise à jour sera faite en utilisant le package torch.optim . Dans ce bout de code on a utilisé la méthode de Adam pour faire la mise à jour de nos paramètres, ceci est fait en instansiant la classe torch.optim.Adam comme suit: optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate). Le premier paramètre passé au constructeur de cette classe est les paramètres qu'on veut faire leur mise à jour. Ainsi, après avoir calculé les gradients de chaque paramètre en appelant loss.backward(), il ne reste qu'appeler notre optimisateur pour faire la mise à jour des paramètres. Ceci est donc fait à l'aide de la fonction optimizer.step() .

In [0]:
# -*- coding: utf-8 -*-
import torch


class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

99 2.512120485305786
199 0.04653482511639595
299 0.0018639536574482918
399 0.00010205141734331846
499 6.352063792292029e-06


Dans cette celulle on va définir la classe de notre modèle manuellement. Cette classe ici est nomée TwoLayerNe. Cette dernière hérite de la classe torch.nn.Module. Dans cette classe, on doit préciser nos couches qui seront utilisées dans par le modèle, ainsi que définir comment la propagation avant est faites. Après la suite du code est pratiquement identique que ce qui est déjà vu dans la cellule précédente.

In [0]:
# -*- coding: utf-8 -*-
import random
import torch


class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    if t % 100 == 99:
        print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

99 10.154526710510254
199 15.821758270263672
299 0.5771813988685608
399 0.7820126414299011
499 0.36820468306541443


Cette celulle montre que pytorch présente une possibilité de faire des graphes de calculs dynamiques. Ceci est implémenté ici dans la fonction forward de notre classe définissant notre modèle. Dans cet exemple, la propagation avant se fait en réapplicant la couche intermédiaire pour un nombre arbitraire de fois(de 1 à 4 fois). Ceci se base sur le concept du partage des poids.

Ainsi, avec ces types de modèles dynamiquent, on peut implémenter des réseaux de neurones récurrants comme LSTM, GRU ...

In [0]:
import torch
import numpy as np
# x = np.random.randn(4, 3, 3)
# y = np.random.randn(2, 3, 3)
# x + y
dtype = torch.float
device = torch.device("cuda:0") # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
# N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(64, 286, 286, device=device, dtype=dtype)
y = torch.randn(128, 286, 286, device=device, dtype=dtype)

In [0]:
x.shape[0]

64