# TP 2: Linear Algebra and Feedforward neural network
Master LiTL - 2021-2022

## Requirements
In this section, we will go through some code to learn how to manipulate matrices and tensors, and we will take a look at some PyTorch code that allows to define, train and evaluate a simple neural network. 
The modules used are the the same as in the previous session, *Numpy* and *Scikit*, with the addition of *PyTorch*. They are all already available within colab. 

## Part 1: Linear Algebra

In this section, we will go through some python code to deal with matrices and also tensors, the data structures used in PyTorch.

Sources:    
* Linear Algebra explained in the context of deep learning: https://towardsdatascience.com/linear-algebra-explained-in-the-context-of-deep-learning-8fcb8fca1494
* PyTorch tutorial: https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py
* PyTorch doc on tensors: https://pytorch.org/docs/stable/torch.html


## 1.1 Numpy arrays

NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type


In [None]:
import numpy as np

x = np.array([1,2])
print("Our input vector with 2 elements:\n", x)
print( "x shape:", x.shape)                

print( "x data type", x.dtype)
# Give a list of elements
# a = np.array(1,2,3,4)    # WRONG
# a = np.array([1,2,3,4])  # RIGHT

# Generate a random matrix (with e generator, for reproducible results)
rng = np.random.default_rng(seed=42)
W = rng.random((3, 2))
print("\n Our weight matrix, of shape 3x2:\n", W)
print( "W shape:", W.shape)
print( "W data type", W.dtype)

# Bias, a scalar
b = 1

# Now, try to multiply
h = W.dot(x) + b
print("\n Our h layer:\n", h)
print( "h shape:", h.shape)
print( "h data type", h.dtype)

Our input vector with 2 elements:
 [1 2]
x shape: (2,)
x data type int64

 Our weight matrix, of shape 3x2:
 [[0.77395605 0.43887844]
 [0.85859792 0.69736803]
 [0.09417735 0.97562235]]
W shape: (3, 2)
W data type float64

 Our h layer:
 [2.65171293 3.25333398 3.04542205]
h shape: (3,)
h data type float64


In [None]:
# Useful transformations
h = h.reshape((3,1))
print("\n h reshape:\n", h)
print( "h shape:", h.shape)

h1 = np.transpose(h)
print("\n h transpose:\n", h1)
print( "h shape:", h1.shape)

h2 = h.T
print("\n h transpose:\n", h2)
print( "h shape:", h2.shape)

Wt = W.T
print("\nW:\n", W)
print("\nW.T:\n", Wt)


 h reshape:
 [[2.65171293]
 [3.25333398]
 [3.04542205]]
h shape: (3, 1)

 h transpose:
 [[2.65171293 3.25333398 3.04542205]]
h shape: (1, 3)

 h transpose:
 [[2.65171293 3.25333398 3.04542205]]
h shape: (1, 3)

W:
 [[0.77395605 0.43887844]
 [0.85859792 0.69736803]
 [0.09417735 0.97562235]]

W.T:
 [[0.77395605 0.85859792 0.09417735]
 [0.43887844 0.69736803 0.97562235]]


In [None]:
## numpy code to create identity matrix
import numpy as np
a = np.eye(4)
print(a)

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


## 1.2 Tensors

For neural networks implementation in PyTorch, we use tensors: 
* a specialized data structure that are very similar to arrays and matrices
* used to encode the inputs and outputs of a model, as well as the model’s parameters
* similar to NumPy’s ndarrays, except that tensors can run on GPUs or other specialized hardware to accelerate computing

### 1.2.1 Tensor initialization

▶ **Look at the documentation to create tensors from existing data structures.**

▶ **For each tensor created, print the tensor and its data type.**

In [None]:
import torch
import numpy as np

# Tensor initialization

## from any data. The data type is automatically inferred.
data = [[1, 2], [3, 4]]
# ...
x_data = torch.tensor(data)
print( "x_data", x_data)
print( "data type x_data=", x_data.dtype)

## from a numpy array specifically
np_array = np.array(data)
# ...
x_np = torch.from_numpy(np_array)
print("\nx_np", x_np)
print( "data type, np_array=", np_array.dtype, "x_data=", x_np.dtype)

x_data tensor([[1, 2],
        [3, 4]])
data type x_data= torch.int64

x_np tensor([[1, 2],
        [3, 4]])
data type, np_array= int64 x_data= torch.int64


You can also create tensors filled with specific values, e.g.:

▶ **Create a tensor with only 1s, another one with 0s with the specified *shape*.**

▶ **Create a tensor filled with random values with the specified *shape*.**

In [None]:
shape = (2, 3,) # shape is a tuple of tensor dimensions

ones_tensor = torch.ones(shape)
print(f"Ones Tensor: \n {ones_tensor} \n")

zeros_tensor = torch.zeros(shape)
print(f"Zeros Tensor: \n {zeros_tensor}")

rand_tensor = torch.rand(shape)
print(f"Random Tensor: \n {rand_tensor} \n")

Ones Tensor: 
 tensor([[1., 1., 1.],
        [1., 1., 1.]]) 

Zeros Tensor: 
 tensor([[0., 0., 0.],
        [0., 0., 0.]])
Random Tensor: 
 tensor([[0.3017, 0.3514, 0.5171],
        [0.3135, 0.5903, 0.6895]]) 



You can also override the values in an existing tensor, as below: replacing all values by 1, or by a random value in a range.

In [None]:
## from another tensor
x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f"\nOnes Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")


Ones Tensor: 
 tensor([[1, 1],
        [1, 1]]) 

Random Tensor: 
 tensor([[0.9397, 0.5721],
        [0.7637, 0.9425]]) 



### 1.2.2 Tensor attributes

Summary of the main attributes of a tensor:
* shape
* data type
* device

In [None]:
# Tensor attributes
tensor = torch.rand(3, 4)

print(f"Shape of tensor: {tensor.shape}")
print(f"Datatype of tensor: {tensor.dtype}")
print(f"Device tensor is stored on: {tensor.device}")

Shape of tensor: torch.Size([3, 4])
Datatype of tensor: torch.float32
Device tensor is stored on: cpu


### 1.2.3 Move to GPU

For now, our tensors should be on CPU. We want to move them to GPU.

In [None]:
# We move our tensor to the GPU if available
if torch.cuda.is_available():
  tensor = tensor.to('cuda')
  print(f"Device tensor is stored on: {tensor.device}")
else:
  print("no gpu")

print(tensor)

Device tensor is stored on: cuda:0
tensor([[7.1274e-01, 6.0618e-05, 3.5706e-01, 3.7572e-03],
        [9.8634e-01, 8.7997e-01, 5.6103e-01, 4.0806e-01],
        [1.7913e-01, 2.8863e-01, 6.9963e-01, 7.3479e-01]], device='cuda:0')


**If you’re using Colab, allocate a GPU by going to Edit > Notebook Settings.**

▶▶ **move to GPU, and re run last cells.**

In [None]:
import torch 

# We move our tensor to the GPU if available
if torch.cuda.is_available():
  tensor = tensor.to('cuda')
  print(f"Device tensor is stored on: {tensor.device}")
else:
  print("no gpu")

print(tensor)

Device tensor is stored on: cuda:0
tensor([[7.1274e-01, 6.0618e-05, 3.5706e-01, 3.7572e-03],
        [9.8634e-01, 8.7997e-01, 5.6103e-01, 4.0806e-01],
        [1.7913e-01, 2.8863e-01, 6.9963e-01, 7.3479e-01]], device='cuda:0')


### 1.2.4 Tensor operations

Doc: https://pytorch.org/docs/stable/torch.html

Slicing operations:
▶ **Check that the results of the code below corresponds to what you expected.**

In [None]:
# Tensor operations: similar to numpy arrays

tensor = torch.ones(4, 4)
print(tensor)

# ---------------------------------------------------------
# TODO: What do you expect?
# ---------------------------------------------------------
## Slicing
print("\nSlicing")
tensor[:,1] = 0 
print(tensor)

tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]])

Slicing
tensor([[1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.],
        [1., 0., 1., 1.]])


▶ **Now modify the values in the first column.**

In [None]:
# ---------------------------------------------------------
# TODO: Change the first column with the value in l
# ---------------------------------------------------------
l =[1.,2.,3.,4.] 
l = torch.tensor( l )
tensor[:, 0] = l
print(tensor)

tensor([[1., 0., 1., 1.],
        [2., 0., 1., 1.],
        [3., 0., 1., 1.],
        [4., 0., 1., 1.]])


Below are other important operations on tensors:
* concatenation
* multiplication

In [None]:
## Concatenation
print("\nConcatenate")
t1 = torch.cat([tensor, tensor, tensor], dim=1)
print(t1)

## Multiplication: element_wise
print("\nMultiply")
# This computes the element-wise product
t2 = tensor.mul(tensor)
print(f"tensor.mul(tensor) \n {t2} \n")
# Alternative syntax:
t3 = tensor * tensor
print(f"tensor * tensor \n {t3}")

## Matrix multiplication
t4 = tensor.matmul(tensor.T)
print(f"tensor.matmul(tensor.T) \n {t4} \n")
# Alternative syntax:
t5 = tensor @ tensor.T
print(f"tensor @ tensor.T \n {t5}")


Concatenate
tensor([[1., 0., 1., 1., 1., 0., 1., 1., 1., 0., 1., 1.],
        [2., 0., 1., 1., 2., 0., 1., 1., 2., 0., 1., 1.],
        [3., 0., 1., 1., 3., 0., 1., 1., 3., 0., 1., 1.],
        [4., 0., 1., 1., 4., 0., 1., 1., 4., 0., 1., 1.]])

Multiply
tensor.mul(tensor) 
 tensor([[ 1.,  0.,  1.,  1.],
        [ 4.,  0.,  1.,  1.],
        [ 9.,  0.,  1.,  1.],
        [16.,  0.,  1.,  1.]]) 

tensor * tensor 
 tensor([[ 1.,  0.,  1.,  1.],
        [ 4.,  0.,  1.,  1.],
        [ 9.,  0.,  1.,  1.],
        [16.,  0.,  1.,  1.]])
tensor.matmul(tensor.T) 
 tensor([[ 3.,  4.,  5.,  6.],
        [ 4.,  6.,  8., 10.],
        [ 5.,  8., 11., 14.],
        [ 6., 10., 14., 18.]]) 

tensor @ tensor.T 
 tensor([[ 3.,  4.,  5.,  6.],
        [ 4.,  6.,  8., 10.],
        [ 5.,  8., 11., 14.],
        [ 6., 10., 14., 18.]])


In [None]:
print( t1.device )

tensor = torch.ones(4, 4, device='cuda')
print(tensor)

print("\nConcatenate")
t1 = torch.cat([tensor, tensor, tensor], dim=1)

print( t1.device )


cpu
tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]], device='cuda:0')

Concatenate
cuda:0


A tensor is stored on CPU by default.

▶▶ **Now initialize *tensor* using *device='cuda'*: where are stored t1, ..., t5?**

### 1.2.5 Exercise

▶▶ **Compute the tensor *h = W.x + b*, using the same data for x and W as at the beginning of this TP.**

```
x = np.array([1,2])
rng = np.random.default_rng(seed=42)
W = rng.random((3, 2))
```

In [None]:
# --------------------------------------------------------
# TODO: Write the code to compute h = W.x+b
# --------------------------------------------------------

# h = x.W + b
x = torch.tensor([1,2])
x = x.to( torch.float64) # be careful: using just 'float' here gives float32
## OR
#x = torch.tensor([1,2], dtype=float)
print("Our input vector with 2 elements:\n", x)
print( "x shape:", x.shape)
print( "x type:", x.dtype )

# Generate a random matrix (with e generator, for reproducible results)
rng = np.random.default_rng(seed=42)
W = rng.random((3, 2))
W_t = torch.from_numpy(W)
print("\n Our weight matrix, of shape 3x2:\n", W)
print( "W shape:", W_t.shape)
print( "W type:", W.dtype)

# Bias, a scalar
b = 1.0

# Now, try to multiply
h_t = W_t.matmul(x) + b
print("\n Our h layer:\n", h_t)
print( "h shape:", h_t.shape)

Our input vector with 2 elements:
 tensor([1., 2.], dtype=torch.float64)
x shape: torch.Size([2])
x type: torch.float64

 Our weight matrix, of shape 3x2:
 [[0.77395605 0.43887844]
 [0.85859792 0.69736803]
 [0.09417735 0.97562235]]
W shape: torch.Size([3, 2])
W type: float64

 Our h layer:
 tensor([2.6517, 3.2533, 3.0454], dtype=torch.float64)
h shape: torch.Size([3])


**Note:** when multiplying matrices, we need to have the same data type, e.g. not **x** with *int* and **W** with *float*.

# Part 2: Feedforward Neural Network

In this section, we will explore a simple neural network architecture for NLP applications ; specifically, we will train a feedforward neural network for sentiment analysis, using the same dataset of reviews as in the previous session.  We will also keep the bag of words representation. 


Sources:
* This TP is inspired by a TP by Tim van de Cruys
* https://www.deeplearningwizard.com/deep_learning/practical_pytorch/pytorch_feedforward_neuralnetwork/
* https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
* https://medium.com/swlh/sentiment-classification-using-feed-forward-neural-network-in-pytorch-655811a0913f 
* https://www.deeplearningwizard.com/deep_learning/practical_pytorch/pytorch_feedforward_neuralnetwork/

## 2.1 Read and load the data

First, we need to understand how to use text data. Here we will keep the bag of word representation, as in the previous session. 

You can find different ways of dealing with the input data. The simplest solution is to use the DataLoader from PyTorch:    
* the doc here https://pytorch.org/docs/stable/data.html and here https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
* an example of use, with numpy array: https://www.kaggle.com/arunmohan003/sentiment-analysis-using-lstm-pytorch






You can also find many datasets for text ready to load in pytorch on: https://pytorch.org/text/stable/datasets.html

#### 2.1.1 Build BoW vectors

The code below allows to use scikit methods you already know to generate the bag of word representation.

In [None]:
import pandas as pd
import numpy as np
import re
import sklearn

from sklearn.feature_extraction.text import CountVectorizer

# This will be the size of the vectors reprensenting the input
MAX_FEATURES = 5000 

# Load train and test set
train = pd.read_csv("allocine_train.tsv", header=0,
                    delimiter="\t", quoting=3)
dev = pd.read_csv("allocine_dev.tsv", header=0,
                    delimiter="\t", quoting=3)
test = pd.read_csv("allocine_test.tsv", header=0,
                   delimiter="\t", quoting=3)

print("Creating features from bag of words...")

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(
    analyzer = "word",
    max_features = MAX_FEATURES
) 

# fit_transform() performs two operations; first, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of
# strings.
train_data_features = vectorizer.fit_transform(train["review"])

# output from vectorizer is a sparse array; our classifier needs a
# dense array
x_train = train_data_features.toarray()

# construct a matrix of two columns (one for positive class, one for
# negative class) where the correct class is indicated with 1 and the
# incorrect one with 0
y_train = np.asarray(train["sentiment"])

print( "TRAIN:", x_train.shape )
count_train = x_train.shape[0]

Creating features from bag of words...
TRAIN: (5027, 5000)


#### 2.1.2 Transform to tensors

Now we need to transform our data to tensors, to provide them as input to PyTorch.

* **torch.utils.data.TensorDataset(*tensors)**: Dataset wrapping tensors. Take tensors as inputs, obtained via **torch.from_numpy( an numpy array )**. Note: don't forget to transform tensor type to float, with **to(torch.float)** (or cryptic error saying it was expecting long...).
* **DataLoader**: 

```
DataLoader(
    dataset,
    batch_size=1,
    shuffle=False,
    num_workers=0,
    collate_fn=None,
    pin_memory=False,
 )
 ```


In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor dataset
train_data = TensorDataset(torch.from_numpy(x_train).to(torch.float), torch.from_numpy(y_train))

# dataloaders
batch_size = 1 #no batch, or batch = 1

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)

## 2.2 Neural Network

Now we can build our learning model.

We want here to build a simple feedforward neural network, with one hidden layer.

This network takes as input bag of words vectors, exactly as our 'classic' models: each review is represented by a vector of the size the number of tokens in the vocabulary with '1' when a word is present and '0' for the other words. 

▶▶ **What is the input dimension?** 

▶▶ **What is the output dimension?** 

▶▶ **Now write the code to define the neural network:**

In the __init__(...) function, you need to:
- define a linear function that maps the input to the hidden dimensions (e.g. self.fc1)
- define an activation function, using the non-linear function sigmoid (e.g. self.sigmoid)
- define a second linear function, that takes the output of the hidden layer and maps to the output dimensions (e.g. self.fc2)

In the forward(self, x) function, you need to:
- pass the input *x* through the first linear function
- pass the output of this linear application through the activation function
- pass the final output through the second linear function and return its output

**CHLOE: TODO REMOVE CODE**

In [None]:
import torch
import torch.nn as nn

torch.manual_seed(0) # For reproducibility: https://pytorch.org/docs/stable/notes/randomness.html

class FeedforwardNeuralNetModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FeedforwardNeuralNetModel, self).__init__()
        # Linear function ==> W1
        self.fc1 = nn.Linear(input_dim, hidden_dim)

        # Non-linearity ==> g
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout) ==> W2
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, x):
        '''
        y = g(x.W1+b).W2
        '''
        # Linear function  # LINEAR ==> x.W1+b
        out = self.fc1(x)

        # Non-linearity  # NON-LINEAR ==> h1 = g(x.W1+b)
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR ==> y = h1.W2
        out = self.fc2(out)
        return out

▶▶ **What is the input dimension?** --> MAX FEATURES = 5000

▶▶ **What is the output dimension?** --> number of classes = 2

We need to set up the values for the hyper-parameters, and define the form of the loss and the optimization methods.

Note that we don't use here a SoftMax over the output of the final layer to obtain class probability: this is because this SoftMax application is done in the loss function chosen (*nn.CrossEntropyLoss()*). Be careful, it's not the case of all the loss functions available in PyTorch.

▶▶ **What is the hidden dimension?** 

▶▶ **Note the hyper-parameters that would ne to be optimized, we'll see that later.** 

In [None]:
# Many choices here!
VOCAB_SIZE = MAX_FEATURES
input_dim = VOCAB_SIZE 
hidden_dim = 4
output_dim = 2

learning_rate = 0.1
num_epochs = 5

criterion = nn.CrossEntropyLoss()

▶▶ **What is the hidden dimension?**  --> 4

### Training the network

▶▶ **Now, we're going to train a model:**
* Initialize a model using the class defined above
* Define an optimizer, i.e. define the method we'll use to optimize / find the best parameters of our model: check the doc https://pytorch.org/docs/stable/optim.html and use the **SGD** optimizer. 

**TODO CHLOE REMOVE CODE**

In [None]:
# Initialization of the model
model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

Below is the code to train your model. 

A good indicator that your model is doing what is supposed to, is the loss: it should decrease during training. 
At the same time, the accuracy on the training set should increase.

▶▶ **In the code below, we compute the loss and accuracy at the end of each training step (i.e. for each sample):**
* Print the loss after each epoch during training
* Print the accuracy after each epoch during training

**TODO CHLOE RM CODE**

In [None]:
# Start training
for epoch in range(num_epochs):
    train_loss, total_acc, total_count = 0, 0, 0
    for input, label in train_loader:

        # Clearing the accumulated gradients
        # torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits 
        # = apply all our functions: y = g(x.W1+b).W2
        outputs = model( input )

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, label)

        # Getting gradients w.r.t. parameters
        # Here is the way to find how to modify the parameters in
        # order to lower the loss
        loss.backward()

        # Updating parameters
        optimizer.step()

        # -- a useful print
        # Accumulating the loss over time
        train_loss += loss.item()
        total_acc += (outputs.argmax(1) == label).sum().item()
        total_count += label.size(0)

    # Compute accuracy on train set at each epoch
    # ...
    print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/count_train, total_acc/count_train))
        
    total_acc, total_count = 0, 0
    train_loss = 0

Epoch: 0. Loss: 0.5295901019196577. ACC 0.7232942112592003 
Epoch: 1. Loss: 0.38159644401620024. ACC 0.8325044758305152 
Epoch: 2. Loss: 0.3170426160692677. ACC 0.8669186393475233 
Epoch: 3. Loss: 0.2811994798197802. ACC 0.8854187388104238 
Epoch: 4. Loss: 0.2807111731653653. ACC 0.8838273324050129 


### Evaluate the model 

▶▶ **Process the dev data.** 

In [None]:
# ---------------------------------------------
# TODO: Process the Test data
# ---------------------------------------------

#test = pd.read_csv("allocine_test.tsv", header=0,
#                   delimiter="\t", quoting=3)

dev_data_features = vectorizer.transform(dev["review"])
x_dev = dev_data_features.toarray()
y_dev = np.asarray(dev["sentiment"])

print( "DEV:", x_dev.shape )

# create Tensor datasets
valid_data = TensorDataset(torch.from_numpy(x_dev).to(torch.float), torch.from_numpy(y_dev))
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size)

DEV: (549, 5000)


Below is the code to compute the predictions of the model and compare to the gold labels. As before, we print a classification report. 

▶▶ **Run the code. The scores are not better than a simple Naive Bayes? We need to tune the hyper-parameters! But it could be the case that a simple model is better than a neural based... Always consider running a simple algorithm before going to more complex ones.**

In [None]:
from sklearn.metrics import classification_report
predictions = []
gold = []

# Disabling gradient calculation is useful for inference, 
# when you are sure that you will not call Tensor.backward(). 
with torch.no_grad():
    for input, label in valid_loader:
        probs = model(input)
        predictions.append( torch.argmax(probs, dim=1).cpu().numpy()[0] )
        gold.append(int(label))

print(classification_report(gold, predictions))

              precision    recall  f1-score   support

           0       0.80      0.81      0.81       230
           1       0.86      0.85      0.86       319

    accuracy                           0.84       549
   macro avg       0.83      0.83      0.83       549
weighted avg       0.84      0.84      0.84       549



## 3. Move to GPU

One important issue with NN is their computational cost: here, the code runs fast, but we often need to use GPU instead of CPU to run NN models. See below what changes have to be made when initializing the model and when training it.


In [None]:
## 1- Define the device to be used

# CUDA for PyTorch
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
print(device)

cuda


The method to move a tensor to GPU is *to()*: https://pytorch.org/docs/stable/generated/torch.Tensor.to.html#torch.Tensor.to.

▶▶ **Once initialized, move to model to GPU using:**
```
model = model.to(device)
```

**TODO CHLODE RM CODE**

In [None]:
## 3- Move your model to the GPU

# Initialization of the model
model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

model = model.to(device)

▶▶ **Now move the data, i.e. input and label, using the same method, e.g.:**
```
input = input.to(device)
```

**TODO CHLOE RM CODE**

In [None]:
## 4- Move your data to GPU

# Start training
for epoch in range(num_epochs):
    train_loss, total_acc, total_count = 0, 0, 0
    for input, label in train_loader:
        ## ------------ CHANGE HERE -----------------
        input = input.to(device)
        label = label.to(device)

        # Clear gradients w.r.t. parameters
        optimizer.zero_grad()

        # Forward pass to get output/logits
        outputs = model( input )

        # Calculate Loss: softmax --> cross entropy loss
        loss = criterion(outputs, label)

        # Getting gradients w.r.t. parameters
        loss.backward()

        # Updating parameters
        optimizer.step()

        # Accumulating the loss over time
        train_loss += loss.item()
        total_acc += (outputs.argmax(1) == label).sum().item()
        total_count += label.size(0)

    # Compute accuracy on train set at each epoch
    print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/count_train, total_acc/count_train))
        
    total_acc, total_count = 0, 0
    train_loss = 0

Epoch: 0. Loss: 0.5201658111256726. ACC 0.7336383528943704 
Epoch: 1. Loss: 0.36919306652012396. ACC 0.8400636562562165 
Epoch: 2. Loss: 0.29851252123677274. ACC 0.8768649293813408 
Epoch: 3. Loss: 0.26750923704391744. ACC 0.889596180624627 
Epoch: 4. Loss: 0.2565383207324902. ACC 0.8961607320469465 


We also need to move the data when predicting labels, as you can see below.

In [None]:
# -- 5- Again, move your data to GPU
predictions, gold = [], []

with torch.no_grad():
    for input, label in valid_loader:
        ## ------------ CHANGE HERE -----------------
        input = input.to(device)

        probs = model(input)
        #Here, we need CPU: else, it will generate the following error
        # can't convert cuda:0 device type tensor to numpy. 
        # Use Tensor.cpu() to copy the tensor to host memory first.
        # (if we need a numpy array)
        predictions.append( torch.argmax(probs, dim=1).cpu().numpy()[0] )
        #print( probs )
        #print( torch.argmax(probs, dim=1) ) # Return the index of the max value
        #print( torch.argmax(probs, dim=1).cpu().numpy()[0] )
        gold.append(int(label))

print(classification_report(gold, predictions))

              precision    recall  f1-score   support

           0       0.88      0.64      0.74       230
           1       0.78      0.94      0.85       319

    accuracy                           0.81       549
   macro avg       0.83      0.79      0.80       549
weighted avg       0.83      0.81      0.81       549



# PART 3: using continuous representation

In this part, we will use continuous representations of words, namely Continuous Bag of Words that is randomly initialized embeddings (we'll try pretrained embeddings later).

In order to create the representation of a document, we will take all the embeddings of the words that appear in the document, and sum them together or take their average.
So instead of having an input vector of size 5000, we now have an input vector of size e.g. 50, that represents the ‘average’, combined meaning of all the words in the document taken together. 

Crucially,  the  neural  network  will  also  learn  the  embeddings  during  training :  the  embeddings  of  the network are also parameters that are optimized according to the loss function.

The dataset remains the French set of reviews labeled with sentiment.

We will compare our model to the scores obtained previously with bag of word representations.



## 2.1 Load the data



### Read the data

Here, we're not using the vectorizer from Scikit, the code below allows to extract the text and labels.

In [None]:
train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

# Load train set
train_df = pd.read_csv(train_path, header=0, delimiter="\t", quoting=3)
train_iter = []
for i in train_df.index:
    train_iter.append( tuple( [train_df["sentiment"][i], train_df["review"][i]] ) )

print( '\n'.join( [ str(train_iter[i][0])+'\t'+train_iter[i][1] for i in range(0,10) ] ) )

dev_df = pd.read_csv(dev_path, header=0, delimiter="\t", quoting=3)
dev_iter = []
for i in dev_df.index:
    dev_iter.append( tuple( [dev_df["sentiment"][i], dev_df["review"][i]] ) )

test_iter = []
test_df = pd.read_csv(test_path, header=0, delimiter="\t", quoting=3)
for i in test_df.index:
    test_iter.append( tuple( [test_df["sentiment"][i], test_df["review"][i]] ) )

0	Stephen King doit bien ricaner en constatant cette navrante histoire de disparus, les scénaristes semblent s'être inspirés de ses oeuvres mais ont bien moins son talent que celui du business. Quel perte de temps que de regarder ces personnages perdus au centre d'une histoire sans fin et sans intérêt, où 2 ou 3 épisodes suffisent pour décrocher, à l'inverse d'une série comme Desperate housewives dont les dialogues, les scénarii et les personnages contribuent sans cesse à relancer l'intérêt et le plaisir au fil des épisodes. Pourtant mes goûts initiaux m'auraient porté davantage du côté de la série fantastique. Il ne faut préjuger de rien! A bon entendeur...
1	Excellentissime! Une série à l'apparence toute calme et lisse, qui se révèle être un véritable noeud de problèmes, de secrets, de mensonges... Les actrices sont vraiment toutes très bonnes dans leurs rôles, avec une petite préférence pour Bree, qui pète complètement un câble à la fin de la saison 2!
0	Voir de pareilles évaluation


We need to tokenize our data, and build the corresponding vocabulary (on the train set).

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# splits the string sentence by space.
tokenizer = get_tokenizer( None ) 

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

#### Vocabulary

Here the vocabulary is a specific object in Pytorch, check the existing functions to use it here: https://pytorch.org/text/stable/vocab.html

For example, the vocabulary directly converts a list of tokens into integers.

In [None]:
vocab(['Avant', 'cette', 'série', ','])

[2910, 18, 7, 144]

▶▶ **Now, try to retrieve the indice of the word 'mauvais'.** 

**TODO CHLOE RM CODE**

In [None]:
print( vocab.lookup_indices( ['mauvais'] ))

[246]


#### Text and label pipelines

The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. 

The label pipeline converts the label into integers. 

In [None]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) #simple mapping to self

In [None]:
text_pipeline('Avant cette série, je ne connaissais que Urgence')

[2910, 18, 89, 16, 17, 6120, 8, 10529]

In [None]:
label_pipeline('0')

0

#### Generate data batches and iterator

We also use *torch.utils.data.DataLoader* with an iterable dataset, here a simple list of labels and text reviews, as saved in *train_iter*.

Before sending to the model, we apply a function, *collate_fn*, to our input data:
* The input to *collate_fn* is a batch of data with the batch size in *DataLoader*, 
* *collate_fn* processes them according to the data processing pipelines declared previously. 

In 'collate_batch', we define how we want to pre-process our data.

The function is directly called within *DataLoader*:
```
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)
```

Below: 
* the text entries in the original data batch input are packed into a list and concatenated as a single tensor. 
* the offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor
* Label is a tensor saving the labels of individual text entries.

The offsets are used to retrieve the individual sequences in the each batch (the sequences are concatenated).

In [None]:
import torch

from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label.to(device), text_list.to(device), offsets.to(device)

dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

## 2.2 Define the models

This time the FFNN will have an embedding layer that transforms our input words to vectors of size 'embed_dim' and performs an operation on these vectors to build a representation for each document (default=mean).

More specifically, we'll use the *nn.EmbeddingBag* layer: https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html

* mode (string, optional) – "sum", "mean" or "max". Default=mean.

**Exercise: __init__():**

▶▶ **Define the embedding layer in the __init__() function below.**

▶▶ **Define a FFNN as previously, using one linear function, an activation function, and a linear projection to the output.**

The code of the *forward* function is given. This time, it has a 2nd argument: the 'offsets' are used to retrieve the individual documents (each document is concatenated to the others in a batch, the offsets are used to retrieve the separate documents).

**TODO CHLOE RM CODE**

In [None]:
class FeedforwardNeuralNetModel2(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super(FeedforwardNeuralNetModel2, self).__init__()

        # Embedding layer 
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        # Linear function ==> W1
        self.fc1 = nn.Linear(embed_dim, hidden_dim)

        # Non-linearity ==> g
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout) ==> W2
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        # Linear function  # LINEAR ==> x.W1+b
        out = self.fc1(embedded)

        # Non-linearity  # NON-LINEAR ==> h1 = g(x.W1+b)
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR ==> y = h1.W2
        out = self.fc2(out)
        return out

## 2.3 Train and evaluation

To use the offsets, we need to slightly modify the train and evaluation procedures.

In [None]:
def train_woffset( model, train_loader, optimizer, num_epochs=5 ):
    for epoch in range(num_epochs):
        train_loss, total_acc, total_count = 0, 0, 0
        for label, input, offsets in train_loader:
            input = input.to(device)
            label = label.to(device)
            # Step1. Clearing the accumulated gradients
            optimizer.zero_grad()
            # Step 2. Forward pass to get output/logits
            outputs = model( input, offsets )
            # Step 3. Compute the loss, gradients, and update the parameters by
            # calling optimizer.step()
            # - Calculate Loss: softmax --> cross entropy loss
            loss = criterion(outputs, label)
            # - Getting gradients w.r.t. parameters
            loss.backward()
            # - Updating parameters
            optimizer.step()
            # Accumulating the loss over time
            train_loss += loss.item()
            total_acc += (outputs.argmax(1) == label).sum().item()
            total_count += label.size(0)
        # Compute accuracy on train set at each epoch
        print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/count_train, total_acc/count_train))
        total_acc, total_count = 0, 0
        train_loss = 0

def evaluate_woffset( model, dev_loader ):
    predictions = []
    gold = []
    with torch.no_grad():
        for label, input, offsets in dev_loader:
            input = input.to(device)
            label = label.to(device)
            probs = model(input, offsets)
            # -- MODIFIED to deal with batches
            predictions.extend( torch.argmax(probs, dim=1).cpu().numpy() ) # <-----
            gold.extend([int(l) for l in label])
    print(classification_report(gold, predictions))
    return gold, predictions

## 2.4 Run experiments

The code below uses the FFNN with continuous representations.

In [None]:
# Load data
batch_size = 1
train_loader = DataLoader(train_iter, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)
dev_loader = DataLoader(dev_iter, batch_size=batch_size, shuffle=True, collate_fn=collate_batch)

In [None]:
# Set the values of the hyperparameters
vocab_size = len(vocab)
emb_dim = 300
hidden_dim = 4
output_dim = 2
learning_rate = 0.1
num_epochs = 5
criterion = nn.CrossEntropyLoss()

In [None]:
# Initialize the model
model_ffnn2 = FeedforwardNeuralNetModel2(vocab_size, emb_dim, hidden_dim, output_dim)
optimizer = torch.optim.SGD(model_ffnn2.parameters(), lr=learning_rate)
model_ffnn2 = model_ffnn2.to(device)
# Train the model
train_woffset( model_ffnn2, train_loader, optimizer, num_epochs=5 )
# Evaluate on dev
gold, pred = evaluate_woffset( model_ffnn2, dev_loader )

Epoch: 0. Loss: 0.6422858452885715. ACC 0.6299980107419932 
Epoch: 1. Loss: 0.5580713001348676. ACC 0.7163318082355281 
Epoch: 2. Loss: 0.497581498553172. ACC 0.7529341555599761 
Epoch: 3. Loss: 0.4519847773106927. ACC 0.7869504674756316 
Epoch: 4. Loss: 0.4103179189151622. ACC 0.8088323055500298 
              precision    recall  f1-score   support

           0       0.59      0.90      0.71       230
           1       0.89      0.55      0.68       319

    accuracy                           0.70       549
   macro avg       0.74      0.72      0.69       549
weighted avg       0.76      0.70      0.69       549



▶▶ **Plot the loss during training (i.e. loss wrt to iterations).**

▶▶ **The performance are low: why? what could we do to improve these scores?**