# TP 4: Training a Feedforward neural network (Part 2)
Master LiTL - 2021-2022

## Requirements
In this part, we will use continuous representations of words, namely Continuous Bag of Words that is randomly initialized embeddings, and then pretrained embeddings.
In order to create the representation of a document, we will take all the embeddings of the words that appear in the document, and sum them together (or take their average).
So instead of having an input vector of size 5000, we now have an input vector of size e.g. 50, that represents the ‘average’, combined meaning of all the words in the document taken together. 

Crucially,  the  neural  network  will  also  learn  the  embeddings  during  training :  the  embeddings  of  the network are also parameters that are optimized according to the loss function.

The dataset remains the French set of reviews labeled with sentiment.

We will compare our model to the scores obtained previously with bag of word representations.

Once you've read and understood the code below to read the data, you can run your model and make some experiments, varying the hyper-parameters: 
* size of the embeddings, 
* size of the hidden layer, 
* activation function, 
* number of iterations.

Try to save your results and print some vizualizations: 
+ Plot the cost function during training with different values for the learning rate
+ Plot the accuracy wrt the size of the embedding layer
+ Plot the accuracy wrt to the number of training examples 

In [None]:
import torch

# If you’re using Colab, allocate a GPU by going to Edit > Notebook Settings.
# We move our tensor to the GPU if available
if torch.cuda.is_available():
  print(f"GPU ok")
else:
  print("no gpu")


## 1. Read the data


In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

# Load train set
train_df = pd.read_csv(train_path, header=0, delimiter="\t", quoting=3)
train_iter = []
for i in train_df.index:
    #print( train_df["sentiment"][i], train_df["review"][i])
    train_iter.append( tuple( [train_df["sentiment"][i], train_df["review"][i]] ) )

print( '\n'.join( [ str(train_iter[i][0])+'\t'+train_iter[i][1] for i in range(0,10) ] ) )


dev_df = pd.read_csv(dev_path, header=0, delimiter="\t", quoting=3)
test_df = pd.read_csv(test_path, header=0, delimiter="\t", quoting=3)
dev_iter, test_iter = [], []
for i in dev_df.index:
    dev_iter.append( tuple( [dev_df["sentiment"][i], dev_df["review"][i]] ) )

for i in test_df.index:
    test_iter.append( tuple( [test_df["sentiment"][i], test_df["review"][i]] ) )

This time, we don't directly use the bag of word representation built by scikit.

We need to tokenize our data, and build the corresponding vocabulary (on the train set).

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# splits the string sentence by space.
tokenizer = get_tokenizer( None ) 

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

#### Vocabulary

Here the vocabulary is a specific object in Pytorch, check the existing functions to use it here: https://pytorch.org/text/stable/vocab.html

For example, the vocabulary directly converts a list of tokens into integers.

In [None]:
vocab(['Avant', 'cette', 'série', ','])

▶▶ **Now, try to retrieve the indice of the word 'mauvais'.** 

#### Text and label pipelines

The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. 

The label pipeline converts the label into integers. 

In [None]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) #simple mapping to self

In [None]:
text_pipeline('Avant cette série, je ne connaissais que Urgence')

In [None]:
label_pipeline('0')

#### Generate data batch and iterator

Here we also use *torch.utils.data.DataLoader* with an iterable dataset, here a simple list of labels and text reviews, as saved in *train_iter*.

Before sending to the model, we apply a function, *collate_fn*, to our input data:
* The input to *collate_fn* is a batch of data with the batch size in *DataLoader*, 
* *collate_fn* processes them according to the data processing pipelines declared previously. 

In 'collate_batch', we define how we want to pre-process our data.

The function is directly called within *DataLoader*:
```
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)
```

Below: 
* the text entries in the original data batch input are packed into a list and concatenated as a single tensor. 
* the offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor
* Label is a tensor saving the labels of individual text entries.

In [None]:
import torch

from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label.to(device), text_list.to(device), offsets.to(device)

dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

## 2. Define the model

### Bag of embeddings

The model is composed of the nn.EmbeddingBag layer: https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html

* mode (string, optional) – "sum", "mean" or "max". Default=mean.

### Weight initialization

Weight initialization is done by default uniformly. You can also specify the initialization as done below, and choose among varied options: https://pytorch.org/docs/stable/nn.init.html. Further info there https://stackoverflow.com/questions/49433936/how-to-initialize-weights-in-pytorch or there https://discuss.pytorch.org/t/clarity-on-default-initialization-in-pytorch/84696/3

In [None]:
import torch
import torch.nn as nn

class FeedforwardNeuralNetModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(FeedforwardNeuralNetModel, self).__init__()

        # Define the parameters that you will need. 
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True) # <----

        # Linear function
        self.fc1 = nn.Linear(embed_dim, hidden_dim)

        # Non-linearity
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout)
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc1.weight.data.uniform_(-initrange, initrange)
        self.fc2.bias.data.fill_(0.01)
        self.fc2.weight.data.uniform_(-initrange, initrange)
        self.fc2.bias.data.zero_()

    def forward(self, text, offsets):

        embedded = self.embedding(text, offsets)

        # Linear function  # LINEAR
        out = self.fc1(embedded)

        # Non-linearity  # NON-LINEAR
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR
        out = self.fc2(out)
        return out

### Training and evaluation functions

Other functions for training and evaluating, but roughly doing the same as previously.

In [None]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count


## 3. Running an experiment 

### Adjusting the learning rate

The code below uses a *scheduler*: *torch.optim.lr_scheduler* provides several methods to adjust the learning rate based on the number of epochs.

Learning rate scheduling should be applied after optimizer’s update.

* torch.optim.lr_scheduler.StepLR: Decays the learning rate of each parameter group by gamma every step_size epochs.

https://pytorch.org/docs/stable/optim.html

In [None]:
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emb_dim = 50
hid_dim = 64
model = FeedforwardNeuralNetModel(vocab_size, emb_dim, hid_dim, num_class).to(device)

In [None]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1) # <----
total_accu = None
train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(dev_iter, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_iter, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val: # <----
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

In [None]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

▶▶ **Make some additional experiments varying**: 
* the size of the embeddings, e.g. 300, 3000
* the size of the hidden layer, 
* the activation function, 
* the number of iterations.

## 4. Using pretrained embeddings

Upload the file *cc.fr.300.10000.vec*: the first 10000 lines of the FastText embeddings for French, https://fasttext.cc/docs/en/crawl-vectors.html.

### Load the vectors

▶▶ **Write a function that load the vectors**, i.e. a dictionary mapping a word to its vector, as defined in the fasttext file. 

▶▶ **Print the vocabulary and the size of the embeddings.**

### Build the weight matrix

NOw we build a matrix over the dataset associating each word to its vector. For each word in dataset’s vocabulary, we check if it is on FastText’s vocabulary:
* if yes: load its pre-trained word vector. 
* else: we initialize a random vector.

In [None]:
emb_dim = 300
matrix_len = len(vocab)
weights_matrix = np.zeros((matrix_len, emb_dim))
words_found = 0

for i in range(0, len(vocab)):
    word = vocab.lookup_token(i)
    try: 
        weights_matrix[i] = vectors[word]
        words_found += 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size=(emb_dim, ))
weights_matrix = torch.from_numpy(weights_matrix)
print( weights_matrix.size())

### Embedding layer

The code below defines a function that builds the embedding layer using the pretrained embeddings.

In [None]:
def create_emb_layer(weights_matrix, non_trainable=False):
    num_embeddings, embedding_dim = weights_matrix.size()
    emb_layer = nn.Embedding(num_embeddings, embedding_dim)
    emb_layer.load_state_dict({'weight': weights_matrix}) # <----
    if non_trainable:
        emb_layer.weight.requires_grad = False

    return emb_layer, num_embeddings, embedding_dim

### Model definition

Now we can modify our model to add this embedding layer. 

Note that the embedding bag now takes pre initialized weights. 

In [None]:
class FeedforwardNeuralNetModel2(nn.Module):
    def __init__(self, embed_dim, hidden_dim, output_dim, weights_matrix):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(FeedforwardNeuralNetModel2, self).__init__()

        # Define the parameters that you will need. 
        self.embedding, num_embeddings, embedding_dim = create_emb_layer(weights_matrix, True)

        self.embedding_sum = nn.EmbeddingBag.from_pretrained(  self.embedding.weight, mode='sum')

        # Linear function
        self.fc1 = nn.Linear(embed_dim, hidden_dim)

        # Non-linearity
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout)
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, text, offsets):

        embedded = self.embedding_sum(text, offsets)

        # Linear function  # LINEAR
        out = self.fc1(embedded)

        # Non-linearity  # NON-LINEAR
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR
        out = self.fc2(out)
        return out

In [None]:
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training
emb_dim = 300
hidden_dim = 128

num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
model = FeedforwardNeuralNetModel2(emb_dim, hidden_dim, num_class, weights_matrix).to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1) # <----
total_accu = None
train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(dev_iter, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_iter, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val: # <----
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

In [None]:
print(model.embedding.weight[vocab.lookup_indices(['mauvais'])])

Try to save your results and print some vizualizations:

* Plot the cost function during training with different values for eg the learning rate 
* Plot the accuracy wrt the size of the embedding layer
* Plot the accuracy wrt to the number of training examples