# TP 4: Training a Feedforward neural network (Part 2)
Master LiTL - 2021-2022

## Requirements
In this part, we will use continuous representations of words, namely Continuous Bag of Words that is randomly initialized embeddings, and then pretrained embeddings.
In order to create the representation of a document, we will take all the embeddings of the words that appear in the document, and sum them together (or take their average).
So instead of having an input vector of size 5000, we now have an input vector of size e.g. 50, that represents the ‘average’, combined meaning of all the words in the document taken together. 

Crucially,  the  neural  network  will  also  learn  the  embeddings  during  training :  the  embeddings  of  the network are also parameters that are optimized according to the loss function.

The dataset remains the French set of reviews labeled with sentiment.

We will compare our model to the scores obtained previously with bag of word representations.

Once you've read and understood the code below to read the data, you can run your model and make some experiments, varying the hyper-parameters: 
* size of the embeddings, 
* size of the hidden layer, 
* activation function, 
* number of iterations.

Try to save your results and print some vizualizations: 
+ Plot the cost function during training with different values for the learning rate
+ Plot the accuracy wrt the size of the embedding layer
+ Plot the accuracy wrt to the number of training examples 

In [None]:
import torch

# If you’re using Colab, allocate a GPU by going to Edit > Notebook Settings.
# We move our tensor to the GPU if available
if torch.cuda.is_available():
  print(f"GPU ok")
else:
  print("no gpu")

GPU ok



## 1. Read the data


In [None]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer

train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

# Load train set
train_df = pd.read_csv(train_path, header=0, delimiter="\t", quoting=3)
train_iter = []
for i in train_df.index:
    #print( train_df["sentiment"][i], train_df["review"][i])
    train_iter.append( tuple( [train_df["sentiment"][i], train_df["review"][i]] ) )

print( '\n'.join( [ str(train_iter[i][0])+'\t'+train_iter[i][1] for i in range(0,10) ] ) )


dev_df = pd.read_csv(dev_path, header=0, delimiter="\t", quoting=3)
test_df = pd.read_csv(test_path, header=0, delimiter="\t", quoting=3)
dev_iter, test_iter = [], []
for i in dev_df.index:
    dev_iter.append( tuple( [dev_df["sentiment"][i], dev_df["review"][i]] ) )

for i in test_df.index:
    test_iter.append( tuple( [test_df["sentiment"][i], test_df["review"][i]] ) )

0	Stephen King doit bien ricaner en constatant cette navrante histoire de disparus, les scénaristes semblent s'être inspirés de ses oeuvres mais ont bien moins son talent que celui du business. Quel perte de temps que de regarder ces personnages perdus au centre d'une histoire sans fin et sans intérêt, où 2 ou 3 épisodes suffisent pour décrocher, à l'inverse d'une série comme Desperate housewives dont les dialogues, les scénarii et les personnages contribuent sans cesse à relancer l'intérêt et le plaisir au fil des épisodes. Pourtant mes goûts initiaux m'auraient porté davantage du côté de la série fantastique. Il ne faut préjuger de rien! A bon entendeur...
1	Excellentissime! Une série à l'apparence toute calme et lisse, qui se révèle être un véritable noeud de problèmes, de secrets, de mensonges... Les actrices sont vraiment toutes très bonnes dans leurs rôles, avec une petite préférence pour Bree, qui pète complètement un câble à la fin de la saison 2!
0	Voir de pareilles évaluation

This time, we don't directly use the bag of word representation built by scikit.

We need to tokenize our data, and build the corresponding vocabulary (on the train set).

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

# splits the string sentence by space.
tokenizer = get_tokenizer( None ) 

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])

#### Vocabulary

Here the vocabulary is a specific object in Pytorch, check the existing functions to use it here: https://pytorch.org/text/stable/vocab.html

For example, the vocabulary directly converts a list of tokens into integers.

In [None]:
vocab(['Avant', 'cette', 'série', ','])

[2910, 18, 7, 144]

▶▶ **Now, try to retrieve the indice of the word 'mauvais'.** 

In [None]:
print( vocab.lookup_indices( ['mauvais'] ))

[246]


#### Text and label pipelines

The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. 

The label pipeline converts the label into integers. 

In [None]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) #simple mapping to self

In [None]:
text_pipeline('Avant cette série, je ne connaissais que Urgence')

[2910, 18, 89, 16, 17, 6120, 8, 10529]

In [None]:
label_pipeline('0')

0

#### Generate data batch and iterator

Here we also use *torch.utils.data.DataLoader* with an iterable dataset, here a simple list of labels and text reviews, as saved in *train_iter*.

Before sending to the model, we apply a function, *collate_fn*, to our input data:
* The input to *collate_fn* is a batch of data with the batch size in *DataLoader*, 
* *collate_fn* processes them according to the data processing pipelines declared previously. 

In 'collate_batch', we define how we want to pre-process our data.

The function is directly called within *DataLoader*:
```
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)
```

Below: 
* the text entries in the original data batch input are packed into a list and concatenated as a single tensor. 
* the offset is a tensor of delimiters to represent the beginning index of the individual sequence in the text tensor
* Label is a tensor saving the labels of individual text entries.

In [None]:
import torch

from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label.to(device), text_list.to(device), offsets.to(device)

#train_iter = train_list 
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

## 2. Define the model

### Bag of embeddings

The model is composed of the nn.EmbeddingBag layer: https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html

* mode (string, optional) – "sum", "mean" or "max". Default=mean.

### Weight initialization

Weight initialization is done by default uniformly. You can also specify the initialization as done below, and choose among varied options: https://pytorch.org/docs/stable/nn.init.html. Further info there https://stackoverflow.com/questions/49433936/how-to-initialize-weights-in-pytorch or there https://discuss.pytorch.org/t/clarity-on-default-initialization-in-pytorch/84696/3

In [None]:
import torch
import torch.nn as nn

class FeedforwardNeuralNetModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(FeedforwardNeuralNetModel, self).__init__()

        # Define the parameters that you will need. 
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True) # <----

        # Linear function
        self.fc1 = nn.Linear(embed_dim, hidden_dim)

        # Non-linearity
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout)
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

        print( 'embed_dim', embed_dim, 'hidden_dim', hidden_dim)

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc1.weight.data.uniform_(-initrange, initrange)
        self.fc2.bias.data.fill_(0.01)
        self.fc2.weight.data.uniform_(-initrange, initrange)
        self.fc2.bias.data.zero_()

    def forward(self, text, offsets):

        embedded = self.embedding(text, offsets)

        # Linear function  # LINEAR
        out = self.fc1(embedded)

        # Non-linearity  # NON-LINEAR
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR
        out = self.fc2(out)
        return out

### Training and evaluation functions

Other functions for training and evaluating, but roughly doing the same as previously.

In [None]:
import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count


## 3. Running an experiment 

### Adjusting the learning rate

The code below uses a *scheduler*: *torch.optim.lr_scheduler* provides several methods to adjust the learning rate based on the number of epochs.

Learning rate scheduling should be applied after optimizer’s update.

* torch.optim.lr_scheduler.StepLR: Decays the learning rate of each parameter group by gamma every step_size epochs.

https://pytorch.org/docs/stable/optim.html

In [None]:
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emb_dim = 3000
hid_dim = 64
model = FeedforwardNeuralNetModel(vocab_size, emb_dim, hid_dim, num_class).to(device)

embed_dim 3000 hidden_dim 64


In [None]:
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1) # <----
total_accu = None
train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(dev_iter, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_iter, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val: # <----
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

-----------------------------------------------------------
| end of epoch   1 | time:  0.52s | valid accuracy    0.628 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   2 | time:  0.49s | valid accuracy    0.665 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   3 | time:  0.48s | valid accuracy    0.789 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   4 | time:  0.48s | valid accuracy    0.794 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   5 | time:  0.47s | valid accuracy    0.738 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   6 | time:  0.50s |

In [None]:
print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.680


## 4. Using pretrained embeddings

Upload the file *cc.fr.300.10000.vec': the first 10000 lines of the FastText embeddings for French, https://fasttext.cc/docs/en/crawl-vectors.html.

### Load the vectors

▶▶ **Write a function that load the vectors**, i.e. a dictionary mapping a word to its vector, as defined in the fasttext file. 

▶▶ **Print the vocabulary and the size of the embeddings.**

In [None]:
import io

def load_vectors(fname):
    fin = io.open(fname, 'r', encoding='utf-8', newline='\n', errors='ignore')
    n, d = map(int, fin.readline().split())
    print(n,d) #here in fact only 10000 words
    data = {}
    for line in fin:
        tokens = line.rstrip().split(' ')
        data[tokens[0]] = [float(t) for t in tokens[1:]]
    return data

embed_file='cc.fr.300.10000.vec'
vectors = load_vectors( embed_file )
print(vectors.keys() )
print( vectors['de'] )

2000000 300
dict_keys([',', 'de', '.', '</s>', 'la', 'et', ':', 'à', 'le', '"', 'en', '’', 'les', 'des', ')', '(', 'du', 'est', 'un', "l'", "d'", 'une', 'pour', '/', '|', 'dans', 'sur', 'que', 'par', 'au', 'a', 'l', 'qui', '-', 'd', 'il', 'pas', '!', 'avec', '_', 'plus', "'", 'Le', 'ce', 'ou', 'La', 'ne', 'se', '»', '...', '?', 'vous', 'sont', 'son', '«', 'je', 'Les', 'Il', 'aux', '1', ';', 'mais', "qu'", 'on', "n'", 'comme', '2', 'sa', 'cette', 'y', 'nous', 'été', 'tout', 'fait', 'En', "s'", 'bien', 'ses', 'très', 'ont', 's', 'être', 'votre', 'ai', 'elle', 'n', '3', 'même', "L'", 'deux', 'faire', "c'", 'aussi', '>', 'leur', '%', 'si', 'entre', 'qu', '€', '&', '4', 'sans', 'Je', "j'", 'était', '10', 'autres', 'tous', 'peut', 'France', 'ces', '…', '5', 'lui', 'me', ']', '[', 'où', 'ans', '6', '#', 'après', '+', 'ils', 'dont', 'Pour', '°', '–', 'temps', '*', 'sous', 'Un', 'avoir', 'L', 'A', '}', 'site', 'peu', 'mon', 'encore', '12', 'depuis', '0', 'ça', 'fois', '2017', 'ainsi', 'alors', 

### Build the weight matrix

NOw we build a matrix over the dataset associating each word to its vector. For each word in dataset’s vocabulary, we check if it is on FastText’s vocabulary:
* if yes: load its pre-trained word vector. 
* else: we initialize a random vector.

In [None]:
emb_dim = 300
matrix_len = len(vocab)
weights_matrix = np.zeros((matrix_len, emb_dim))
words_found = 0

for i in range(0, len(vocab)):
    word = vocab.lookup_token(i)
    try: 
        weights_matrix[i] = vectors[word]
        words_found += 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size=(emb_dim, ))
weights_matrix = torch.from_numpy(weights_matrix)
print( weights_matrix.size())

torch.Size([43072, 300])


### Embedding layer

The code below defines a function that builds the embedding layer using the pretrained embeddings.

In [None]:
def create_emb_layer(weights_matrix, non_trainable=False):
    num_embeddings, embedding_dim = weights_matrix.size()
    emb_layer = nn.Embedding(num_embeddings, embedding_dim)
    emb_layer.load_state_dict({'weight': weights_matrix}) # <----
    if non_trainable:
        emb_layer.weight.requires_grad = False

    return emb_layer, num_embeddings, embedding_dim

### Model definition

Now we can modify our model to add this embedding layer. 

Note that the embedding bag now takes pre initialized weights. 

In [None]:
class FeedforwardNeuralNetModel2(nn.Module):
    def __init__(self, embed_dim, hidden_dim, output_dim, weights_matrix):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(FeedforwardNeuralNetModel2, self).__init__()

        # Define the parameters that you will need. 
        self.embedding, num_embeddings, embedding_dim = create_emb_layer(weights_matrix, True)

        self.embedding_sum = nn.EmbeddingBag.from_pretrained(  self.embedding.weight, mode='sum')

        # Linear function
        self.fc1 = nn.Linear(embed_dim, hidden_dim)

        # Non-linearity
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout)
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, text, offsets):

        embedded = self.embedding_sum(text, offsets)

        # Linear function  # LINEAR
        out = self.fc1(embedded)

        # Non-linearity  # NON-LINEAR
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR
        out = self.fc2(out)
        return out

In [None]:
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training
emb_dim = 300
hidden_dim = 64

num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
model = FeedforwardNeuralNetModel2(emb_dim, hidden_dim, num_class, weights_matrix).to(device)

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1) # <----
total_accu = None
train_dataloader = DataLoader(train_iter, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(dev_iter, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_iter, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val: # <----
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

-----------------------------------------------------------
| end of epoch   1 | time:  0.26s | valid accuracy    0.448 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   2 | time:  0.26s | valid accuracy    0.645 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   3 | time:  0.26s | valid accuracy    0.663 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   4 | time:  0.27s | valid accuracy    0.572 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   5 | time:  0.27s | valid accuracy    0.661 
-----------------------------------------------------------
-----------------------------------------------------------
| end of epoch   6 | time:  0.26s |

In [None]:
print(model.embedding.weight[vocab.lookup_indices(['mauvais'])])

tensor([[ 1.3040,  1.0493,  0.4355,  ..., -0.3605,  0.5034, -1.3554]],
       grad_fn=<IndexBackward0>)


* We can also explore the embeddings that are created by the architecture. Run the script in interactive mode, and issue the following commands at the python prompt :
```
m = model.layers[0].get_weights()[0]
tp3_utils.calcSim(’mauvais’,  w2i, i2w, m)
```
The first line extract the embedding matrix from the model, and the second line computes the most similar embeddings for the word 'mauvais', using cosine similarity. Do the results make sense ? Try another word with a positive connotation.

In [None]:

train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

In [None]:
import pandas as pd
import numpy as np
import re
import sklearn
import torch
from torch.utils.data import TensorDataset, DataLoader

from sklearn.feature_extraction.text import CountVectorizer

train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

# This will be the size of the vectors reprensenting the input
MAX_FEATURES = 5000 

# Load train set
train_df = pd.read_csv(train_path, header=0, delimiter="\t", quoting=3)
    
# -- VECTORIZE
print("Creating features from bag of words...")  
vectorizer = CountVectorizer( analyzer = "word", max_features = MAX_FEATURES ) 
train_data_features = vectorizer.fit_transform(train_df["review"])
# -- TO DENSE
x_train = train_data_features.toarray()
y_train = np.asarray(train_df["sentiment"])
print( "TRAIN:", x_train.shape )

dev_df = pd.read_csv(dev_path, header=0, delimiter="\t", quoting=3)
dev_data_features = vectorizer.transform(dev_df["review"])
x_dev = dev_data_features.toarray()
y_dev = np.asarray(dev_df["sentiment"])
print( "DEV:", x_dev.shape )

test_df = pd.read_csv(test_path, header=0, delimiter="\t", quoting=3)
test_data_features = vectorizer.transform(test_df["review"])
x_test = test_data_features.toarray()
y_test = np.asarray(test_df["sentiment"])
print( "TEST:", x_test.shape )

count_train = x_train.shape[0]

Creating features from bag of words...
TRAIN: (5027, 5000)
DEV: (549, 5000)
TEST: (544, 5000)


### 1.2 Load the data

Note that batch size is chosen here.

In [None]:
# Load data into TENSORS

def load_data( x_train, y_train, x_dev, y_dev, x_test, y_test, batch_size=1 ):
    #batch_size = 1 # == no batch
    # create Tensor dataset
    train_data = TensorDataset(torch.from_numpy(x_train).to(torch.float), torch.from_numpy(y_train))

    # dataloaders
    # make sure to SHUFFLE your data
    train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
    
    dev_data = TensorDataset(torch.from_numpy(x_dev).to(torch.float), torch.from_numpy(y_dev))
    dev_loader = DataLoader(dev_data, shuffle=True, batch_size=batch_size)

    test_data = TensorDataset(torch.from_numpy(x_test).to(torch.float), torch.from_numpy(y_test))
    test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

    return train_loader, dev_loader, test_loader

### 1.3 Neural Network Definition

Now we can build our learning model.

▶▶**What are the elements that can be changed here?**

### SOLUTION
Note that here you can change: hidden_dim, number of hidden layers, activation function.

In [None]:
import torch
import torch.nn as nn

class FeedforwardNeuralNetModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(FeedforwardNeuralNetModel, self).__init__()

        # Define the parameters that you will need. 
        # Linear function
        self.fc1 = nn.Linear(input_dim, hidden_dim)

        # Non-linearity
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout)
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, x):
        # Linear function  # LINEAR
        out = self.fc1(x)

        # Non-linearity  # NON-LINEAR
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR
        out = self.fc2(out)
        return out

### 1.4 Training function

▶▶**What are the elements that can be changed here?**

### SOLUTION

Here you can change: the total number of epochs, the criterion / loss, the optimizer which also includes the learning rate.

In [None]:
# TRAINING
def train( model, train_loader, optimizer, num_epochs=5 ):
    for epoch in range(num_epochs):
        train_loss, total_acc, total_count = 0, 0, 0
        for input, label in train_loader:
            # Step1. Clearing the accumulated gradients
            optimizer.zero_grad()

            # Step 2. Forward pass to get output/logits
            outputs = model( input )

            # Step 3. Compute the loss, gradients, and update the parameters by
            # calling optimizer.step()
            # - Calculate Loss: softmax --> cross entropy loss
            loss = criterion(outputs, label)
            # - Getting gradients w.r.t. parameters
            loss.backward()
            # - Updating parameters
            optimizer.step()

            # Accumulating the loss over time
            train_loss += loss.item()
            total_acc += (outputs.argmax(1) == label).sum().item()
            total_count += label.size(0)

        # Compute accuracy on train set at each epoch
        print('Epoch: {}. Loss: {}. ACC {} '.format(epoch, train_loss/count_train, total_acc/count_train))
        
        total_acc, total_count = 0, 0
        train_loss = 0

### 1.5 Evaluation

In [None]:
from sklearn.metrics import classification_report


def evaluate( model, dev_loader ):
    predictions = []
    gold = []

    with torch.no_grad():
        for input, label in dev_loader:
            probs = model(input)
            predictions.append( torch.argmax(probs, dim=1).cpu().numpy()[0] )
            gold.append(int(label))

    print(classification_report(gold, predictions))



## 2. Runing an experiment

Run the model as last week and save the score on dev for future comparison.

### TEST #1

▶▶**Describe the setting of the 'default' experiment**

### SOLUTION

**BoW, 5000 features, Batch size: 1, 1 hidden layer, hidden size: 4, activation: sigmoid, learning rate: 0.1, optimizer: SGD, max epochs: 5**

In [None]:
# Load data
train_loader, dev_loader, test_loader = load_data( x_train, y_train, x_dev, y_dev, x_test, y_test, batch_size=1 )

In [None]:
# Many choices here!
VOCAB_SIZE = MAX_FEATURES
input_dim = VOCAB_SIZE 
hidden_dim = 4
output_dim = 2

learning_rate = 0.1
num_epochs = 5

criterion = nn.CrossEntropyLoss()

# Initialization of the model
model_bow = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

optimizer = torch.optim.SGD( model_bow.parameters(), lr=learning_rate )

# Train and evaluate

train( model_bow, train_loader, optimizer )

evaluate( model_bow, dev_loader )

Epoch: 0. Loss: 0.525575307463908. ACC 0.7344340560970758 
Epoch: 1. Loss: 0.3772354137683169. ACC 0.8370797692460712 
Epoch: 2. Loss: 0.30906129108996105. ACC 0.8687089715536105 
Epoch: 3. Loss: 0.26841267688326875. ACC 0.8899940322259797 
Epoch: 4. Loss: 0.23668165647298592. ACC 0.9047145414760295 
              precision    recall  f1-score   support

           0       0.76      0.86      0.81       230
           1       0.89      0.81      0.85       319

    accuracy                           0.83       549
   macro avg       0.83      0.83      0.83       549
weighted avg       0.84      0.83      0.83       549



## 3. Exercises

Now, try to change:
* Batch size 1 --> 50, 100, 1000
* Increase max number of epochs (with best batch size)
* Change the size of the hidden layer
* Try with 1 additional layers 
* Change the number of input features

How does this affect the loss and the performance of the model?

In [None]:
# -----------------------------------------------------
# Change Batch size
# -----------------------------------------------------

import pandas as pd
import numpy as np
import re
import sklearn
import torch
from torch.utils.data import TensorDataset, DataLoader

from sklearn.feature_extraction.text import CountVectorizer


batch_size = 50


train_path = "allocine_train.tsv"
dev_path = "allocine_dev.tsv"
test_path = "allocine_test.tsv"

# This will be the size of the vectors reprensenting the input
MAX_FEATURES = 5000 

# Load train set
train_df = pd.read_csv(train_path, header=0, delimiter="\t", quoting=3)
    
# -- VECTORIZE
print("Creating features from bag of words...")  
vectorizer = CountVectorizer( analyzer = "word", max_features = MAX_FEATURES ) 
train_data_features = vectorizer.fit_transform(train_df["review"])
# -- TO DENSE
x_train = train_data_features.toarray()
y_train = np.asarray(train_df["sentiment"])
print( "TRAIN:", x_train.shape )

dev_df = pd.read_csv(dev_path, header=0, delimiter="\t", quoting=3)
dev_data_features = vectorizer.transform(dev_df["review"])
x_dev = dev_data_features.toarray()
y_dev = np.asarray(dev_df["sentiment"])
print( "DEV:", x_dev.shape )

count_train = x_train.shape[0]


# -- TO TENSORS
# create Tensor dataset
train_data = TensorDataset(torch.from_numpy(x_train).to(torch.float), torch.from_numpy(y_train))

# dataloaders

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)
    
dev_data = TensorDataset(torch.from_numpy(x_dev).to(torch.float), torch.from_numpy(y_dev))
dev_loader = DataLoader(dev_data, shuffle=True, batch_size=batch_size)

test_data = TensorDataset(torch.from_numpy(x_test).to(torch.float), torch.from_numpy(y_test))
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size)

# Many choices here!
VOCAB_SIZE = MAX_FEATURES
input_dim = VOCAB_SIZE 
hidden_dim = 4
output_dim = 2

learning_rate = 0.1
num_epochs = 5

criterion = nn.CrossEntropyLoss()

# Initialization of the model
model = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# Train and evaluate

train( model, train_loader )

evaluate( model, dev_loader )

Creating features from bag of words...
TRAIN: (5027, 5000)
DEV: (549, 5000)
TEST: (544, 5000)
Epoch: 0. Loss: 0.013512300513054852. ACC 0.6230356077183211 
Epoch: 1. Loss: 0.012526845573928405. ACC 0.7189178436443207 
Epoch: 2. Loss: 0.011339280912172778. ACC 0.7642729261985279 
Epoch: 3. Loss: 0.010151959115334385. ACC 0.8058484185398846 
Epoch: 4. Loss: 0.009153070276955475. ACC 0.8273324050129301 


ValueError: ignored

### Answers: Batch size

Increasing the batch size leads to a faster training a degradation in terms of performance

see e.g. (Keskar et al. 2016) https://arxiv.org/abs/1609.04836 :
* *It has been observed in practice that when using a larger batch there is a degradation in the quality of the model, as measured by its ability to generalize ... large-batch methods tend to converge to sharp minimizers of the training and testing functions-and as is well known, sharp minima lead to poorer generalization*

Trade-off between faster training and generalization ability.

People often test typical values of32, 64, 128, 256, 512 and 1024 (but it depends also on the size of the training data, here we have a very small dataset).



In [None]:
# -----------------------------------------------------
# Increase Batch size AND epochs
# -----------------------------------------------------

import torch
from torch.utils.data import TensorDataset, DataLoader

# create Tensor dataset
train_data = TensorDataset(torch.from_numpy(x_train).to(torch.float), torch.from_numpy(y_train))

# dataloaders
batch_size = 100 #no batch, or batch = 1

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)

# Initialization of the model
model_bow_b8 = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

optimizer = torch.optim.SGD(model_bow_b8.parameters(), lr=learning_rate)

train( model_bow_b8, train_loader, num_epochs=10 )

evaluate( model_bow_b8, dev_loader )

Epoch: 0. Loss: 0.007128170216228181. ACC 0.502884424109807 
Epoch: 1. Loss: 0.006857696894696344. ACC 0.5637557191167695 
Epoch: 2. Loss: 0.006673726334542433. ACC 0.6033419534513627 
Epoch: 3. Loss: 0.0064349481948972435. ACC 0.6415357071812214 
Epoch: 4. Loss: 0.006109899072588285. ACC 0.6803262383131092 
Epoch: 5. Loss: 0.005702022247435868. ACC 0.7396061269146609 
Epoch: 6. Loss: 0.005377006935082067. ACC 0.7708374776208474 
Epoch: 7. Loss: 0.005075840197573226. ACC 0.7885418738810424 
Epoch: 8. Loss: 0.004838178296295008. ACC 0.7943107221006565 
Epoch: 9. Loss: 0.004533350378755122. ACC 0.8183807439824945 
              precision    recall  f1-score   support

           0       0.78      0.75      0.76       230
           1       0.82      0.85      0.83       319

    accuracy                           0.81       549
   macro avg       0.80      0.80      0.80       549
weighted avg       0.80      0.81      0.80       549



In [None]:
# -----------------------------------------------------
# Change Hidden layer size
# -----------------------------------------------------

hidden_dim = 3000

# create Tensor dataset
train_data = TensorDataset(torch.from_numpy(x_train).to(torch.float), torch.from_numpy(y_train))

# dataloaders
batch_size = 1 #no batch, or batch = 1

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)

# Initialization of the model
model_bow_b8_h50 = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

optimizer = torch.optim.SGD(model_bow_b8_h50.parameters(), lr=learning_rate)

train( model_bow_b8_h50, train_loader )

evaluate( model_bow_b8_h50, dev_loader )

Epoch: 0. Loss: 2.9019488983119452. ACC 0.6333797493534912 
Epoch: 1. Loss: 0.893522029110294. ACC 0.7777998806445195 
Epoch: 2. Loss: 0.666859273842346. ACC 0.8265367018102248 
Epoch: 3. Loss: 0.4748491511739731. ACC 0.8557787945096479 
Epoch: 4. Loss: 0.4348698714605433. ACC 0.8744778197732246 
              precision    recall  f1-score   support

           0       0.77      0.84      0.80       230
           1       0.88      0.82      0.85       319

    accuracy                           0.83       549
   macro avg       0.82      0.83      0.83       549
weighted avg       0.83      0.83      0.83       549



## Answer: Hidden layer size

Increase the performance in terms of macro F1, maybe? Longer training time.

No magic number, that needs to be optimized. In general, a few typical values are tested:  
* *Using too few neurons in the hidden layers will result in something called underfitting. Underfitting occurs when there are too few neurons in the hidden layers to adequately detect the signals in a complicated data set.* 
* *too many neurons in the hidden layers may result in overfitting. Overfitting occurs when the neural network has so much information processing capacity that the limited amount of information contained in the training set is not enough to train all of the neurons in the hidden layers. A second problem [is increasing] the time it takes to train the network.*
* "Rules"
  * *The number of hidden neurons should be between the size of the input layer and the size of the output layer.* 
  * *The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.* 
  * *The number of hidden neurons should be less than twice the size of the input layer.*

## Change the size of the input layer

In [None]:
import pandas as pd
import numpy as np
import re
import sklearn
import torch
from torch.utils.data import TensorDataset, DataLoader

from sklearn.feature_extraction.text import CountVectorizer

# This will be the size of the vectors reprensenting the input
MAX_FEATURES = 500

# Load train set
train_df = pd.read_csv("allocine_train.tsv", header=0,
                    delimiter="\t", quoting=3)

print("Creating features from bag of words...")  
vectorizer = CountVectorizer(
    analyzer = "word",
    max_features = MAX_FEATURES
) 
train_data_features = vectorizer.fit_transform(train_df["review"])

x_train = train_data_features.toarray()
y_train = np.asarray(train_df["sentiment"])
print( "TRAIN:", x_train.shape )
count_train = x_train.shape[0]


# Transform to tensors
# create Tensor dataset
train_data = TensorDataset(torch.from_numpy(x_train).to(torch.float), torch.from_numpy(y_train))

# dataloaders
batch_size = 1 #no batch, or batch = 1

# make sure to SHUFFLE your data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size)

# Process dev data

dev = pd.read_csv("allocine_dev.tsv", header=0,
                   delimiter="\t", quoting=3)
dev_data_features = vectorizer.transform(dev["review"])
x_dev = dev_data_features.toarray()
y_dev = np.asarray(dev["sentiment"])
print( "DEV:", x_dev.shape )

dev_data = TensorDataset(torch.from_numpy(x_dev).to(torch.float), torch.from_numpy(y_dev))
dev_loader = DataLoader(dev_data, shuffle=True, batch_size=batch_size)

Creating features from bag of words...
TRAIN: (5027, 500)
DEV: (549, 500)


In [None]:
# -----------------------------------------------------
# Change Hidden layer size
# -----------------------------------------------------

hidden_dim = 50
input_dim = 500

# Initialization of the model
model_bow_500 = FeedforwardNeuralNetModel(input_dim, hidden_dim, output_dim)

optimizer = torch.optim.SGD(model_bow_500.parameters(), lr=learning_rate)

train( model_bow_500, train_loader )

evaluate( model_bow_500, dev_loader )

Epoch: 0. Loss: 0.5523177005176193. ACC 0.7115575890192958 
Epoch: 1. Loss: 0.4214836642410772. ACC 0.8012731251243286 
Epoch: 2. Loss: 0.37971238855068423. ACC 0.8257409986075194 
Epoch: 3. Loss: 0.34775492108362466. ACC 0.8436443206683907 
Epoch: 4. Loss: 0.3202604617960704. ACC 0.8567734235130297 
              precision    recall  f1-score   support

           0       0.73      0.76      0.75       230
           1       0.82      0.80      0.81       319

    accuracy                           0.78       549
   macro avg       0.78      0.78      0.78       549
weighted avg       0.78      0.78      0.78       549



## 4. Continuous bag of words

The feature representation we use above – in which a word is either ‘on’ or ‘off’ – works well, but it does not allow the network to learn similarity between words. 
As we’ll see during the lectures, we can also make use of embeddings (a rich vector representation for each word, that implicitly captures the similarity between words). 
This is the approach we’ll take here: each individual word is represented by an embedding vector of 50 dimensions. 

In order to create the representation of a document, we will take all the embeddings of the words that appear in the document, and sum them together (or take their average).
So instead of having an input vector of size 5000, we now have an input vector of size 50, that represents the ‘average’, combined meaning of all the words in the document taken together. 

Crucially,  the  neural  network  will  also  learn  the  embeddings  during  training :  the  embeddings  of  the network are also parameters that are optimized according to the loss function.


### Exercise
* Open the file *sentiment_cbow.py*, and read through the comments. Make sure you understand what each part of the script is doing.
* Run the code. Examine loss and accuracy. Is the result better or worse than the previous network architecture ? Why ?
* The architecture in the script does not contain a hidden layer: it directly feeds the input layer (the average  document  embedding)  to  the  output  layer.  Add  a  hidden  layer  of  16  neurons  and  a  relu activation function. How does this change the results ?
* We can also explore the embeddings that are created by the architecture. Run the script in interactive mode, and issue the following commands at the python prompt :
```
m = model.layers[0].get_weights()[0]
tp3_utils.calcSim(’mauvais’,  w2i, i2w, m)
```
The first line extract the embedding matrix from the model, and the second line computes the most similar embeddings for the word 'mauvais', using cosine similarity. Do the results make sense ? Try another word with a positive connotation.

In [None]:
import torch
import torch.nn as nn

EMBEDDING_DIM = 50

class FeedforwardNeuralNetModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(FeedforwardNeuralNetModel, self).__init__()
        # Linear function
        self.fc1 = nn.Linear(input_dim, hidden_dim)

        # Non-linearity
        self.sigmoid = nn.Sigmoid()

        # Linear function (readout)
        self.fc2 = nn.Linear(hidden_dim, output_dim)  

    def forward(self, x):
        # Linear function  # LINEAR
        out = self.fc1(x)

        # Non-linearity  # NON-LINEAR
        out = self.sigmoid(out) 

        # Linear function (readout)  # LINEAR
        out = self.fc2(out)
        return out

# Sources
* https://www.scholars.northwestern.edu/en/publications/on-large-batch-training-for-deep-learning-generalization-gap-and-
* https://arxiv.org/abs/1303.2314
* https://mydeeplearningnb.wordpress.com/2019/02/23/convnet-for-classification-of-cifar-10/
* Introduction to Neural Networks for Java (second edition) by Jeff Heaton: https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
 


# Recall

Before starting the practical session, let's take a look again at some code in pytorch.

## PyTorch tutorial: BoW classifier
* From: https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html

This is a simple logistic regression, that is we map BoW representations to log probabilities over labels: we pass the input through an affine map and then do log softmax.

We compute: log Softmax(Ax+b)

In [None]:
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2


class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        return F.log_softmax(self.linear(bow_vec), dim=1)


def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])


model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the PyTorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
    print(param)

# To run the model, pass in a BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    sample = data[0]
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    log_probs = model(bow_vector)
    print(log_probs)

In [None]:
# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])