
---

# Natural Language Processing project

---


### Tristan Basler - Clément Boulay <br>
CentraleSupélec

### Setting up a Multi-Layer Perceptron (MLP) for Dialog-Act prediction

### Requirements

We suggest that you run this notebook in <a href="https://colab.research.google.com">Google Colab</a>. 

In [None]:
!pip install torchtext
!pip install datasets
!pip install torchinfo

In [50]:
from datasets import load_dataset
from nltk.tokenize import TreebankWordTokenizer, TweetTokenizer
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.utils.tensorboard import SummaryWriter
from torchinfo import summary
from torchtext.vocab import GloVe, vocab, FastText

In [51]:
# init TensorBoard writer
writer = SummaryWriter()

### Fetching the dataset from HuggingFace

Please refer to the notebook `utils.ipynb` for an extensive dataset exploration.

In [3]:
# download the DailyDialog Act Corpus  from Silicon Dataset from HuggingFace
dataset = load_dataset("silicone", "dyda_da")

Downloading builder script:   0%|          | 0.00/25.8k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/44.3k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/23.0k [00:00<?, ?B/s]

Downloading and preparing dataset silicone/dyda_da to /root/.cache/huggingface/datasets/silicone/dyda_da/1.0.0/af617406c94e3f78da85f7ea74ebfbd3f297a9665cb54adbae305b03bc4442a5...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/2.23M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/206k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/202k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87170 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8069 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7740 [00:00<?, ? examples/s]

Dataset silicone downloaded and prepared to /root/.cache/huggingface/datasets/silicone/dyda_da/1.0.0/af617406c94e3f78da85f7ea74ebfbd3f297a9665cb54adbae305b03bc4442a5. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

### Defining the `SiliconeDataset` Class

In [4]:
class SiliconeDataset(Dataset):
    """
    Class used for convenient handling of the dyda_da corpus from the Silicone Dataset.
    """
    def __init__(self, data):
      self.data = data

    def __len__(self):
      return len(self.data)
    
    def __getitem__(self, index):
      return self.data[index]

### Data preprocessing

There is a bunch of preprocessing actions we shall perform on the dataset utterances before being able to define and train a Deep Learning model. <br>
First, we need to pad or clip utterances to a fixed length, as the neural nets we are going to implement do not handle sequences with varying length. <br>
Second, we need to tokenize utterances, at a level that will be discussed thereafter, using a pre-defined vocabulary. In our case, we will use Word2Vec embeddings. <br>

#### Sequence padding/clipping, tokenizing and embedding

In [5]:
class UnsupportedTokenizingMethodError(Exception):
  """
  Exception raised when an unsupported tokenizing method is queried in the preprocessing pipeline. 
  """
  def __init__(self):
    super().__init__()
    pass

In [6]:
class UnsupportedEmbeddingMethodError(Exception):
  """
  Exception raised when an unsupported embedding method is queried in the preprocessing pipeline. 
  """
  def __init__(self):
    super().__init__()
    pass

In [7]:
class PreprocessingPipeline():
  """
  Class that implements the full preprocessing pipeline.
  """
  def __init__(self,  tokenizer_used: str = "treebank", embedding_method: str = "fasttext", max_length: int = 20):

    self.max_length = max_length
    self. supported_tokenizers = ["tweet", "treebank"]
    self.supported_embeddings = ["fasttext", "glove"]

    if tokenizer_used not in self.supported_tokenizers:
      raise UnsupportedTokenizingMethodError
    if tokenizer_used == "tweet":
      self.tokenizer = TweetTokenizer()
    elif tokenizer_used == "treebank":
      self.tokenizer = TreebankWordTokenizer()

    if embedding_method not in self.supported_embeddings:
      raise UnsupportedTokenizingMethodError
    else:
      self.embedding_method = embedding_method
    if self.embedding_method == "fasttext":
      self.pretrained_vectors = FastText(language='en')
    elif self.embedding_method == "glove":
      self.pretrained_vectors = GloVe(name="6B", dim='50')
    
    # get the vocabulary that corresponds to the used embeddings
    self.pretrained_vocab = vocab(self.pretrained_vectors.stoi)

    # add the <unk> and <pad> tokens to the vocabulary
    unk_token = "<unk>"
    unk_index = 0
    pad_token = '<pad>'
    pad_index = 1
    self.pretrained_vocab.insert_token("<unk>",unk_index)
    self.pretrained_vocab.insert_token("<pad>", pad_index)
    self.pretrained_vocab.set_default_index(unk_index)

    self.vocab_stoi = self.pretrained_vocab.get_stoi()

  def clip_or_pad_sequence(self, tokenized_input_sequence: list):
    """
    Perform a padding or clipping operation on the tokenized input sequence, depending on its length. 
    """
    if len(tokenized_input_sequence) == self.max_length:
      return tokenized_input_sequence
    elif len(tokenized_input_sequence) < self.max_length:
      # need to pad the tokenized sequence up to max_length
      padded_sequence = tokenized_input_sequence + [self.vocab_stoi['<pad>'] for i in range(len(tokenized_input_sequence), self.max_length)]
      return padded_sequence
    else:
      clipped_sequence = tokenized_input_sequence[:self.max_length]
      return clipped_sequence

  def tokenize_input_sequence(self, input_sequence: str):
    """ 
    Tokenize the input_sequence using the tokenazing_method.
    """
    tokenized_sequence = [self.vocab_stoi[token] if token in self.vocab_stoi else self.vocab_stoi['<unk>'] for token in self.tokenizer.tokenize(input_sequence.lower())]
    return tokenized_sequence

  def preprocess_input(self, input_sequence: str ):
    """
    Define the whole data preprocessing pipeline. 
    """
    tokenized_input_sequence  = self.tokenize_input_sequence(input_sequence)
    clipped_sequence = self.clip_or_pad_sequence(tokenized_input_sequence)
    return clipped_sequence

In [None]:
preprocessing_pipeline = PreprocessingPipeline()

In [9]:
# testing the preprocessing pipeline
test_seq = "I love natural language processing."
preprocessed_seq = preprocessing_pipeline.preprocess_input(test_seq)

In [10]:
preprocessed_seq

[29, 568, 915, 321, 3786, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

We see that the original sentence has been padded to match a length of 20. The chosen tokenizer (TreebankWordTokenizer) has also kept the dot at the end of the utterance for embedding. The <pad> token index in the vocabulary is 1, which is correct since we have inserted in beforehand.

In [None]:
# testing the preprocessing pipeline
test_seq_2 = "France has got the best cheese in the world, which is why so many tourists come to France for their annual vacation."
preprocessed_seq_2 = preprocessing_pipeline.preprocess_input(test_seq_2)

In [None]:
preprocessed_seq_2

[480,
 40,
 919,
 3,
 220,
 7313,
 7,
 3,
 98,
 0,
 38,
 14,
 540,
 97,
 131,
 7217,
 725,
 12,
 480,
 18]

In [None]:
print("Length of the original utterance was {} but the length of the preprocessed sequence is {}".format(len(test_seq_2.split(" ")), len(preprocessed_seq_2)))

Length of the original utterance was 22 but the length of the preprocessed sequence is 20


### Preprocessing the Silicon Dataset

In [11]:
def batch_preprocessing(entries):
  """ 
  Apply the preprocessing pipeline to a batch of input utterances.
  """
  preprocessed_batch = {}
  preprocessed_batch["Utterance"] = [preprocessing_pipeline.preprocess_input(entry) for entry in entries['Utterance']]
  preprocessed_batch['Label'] = entries['Label']
  return preprocessed_batch

In [12]:
dataset['train'] = dataset['train'].map(lambda utt: batch_preprocessing(utt), batched=True)

Map:   0%|          | 0/87170 [00:00<?, ? examples/s]

In [13]:
dataset['validation'] = dataset['validation'].map(lambda utt: batch_preprocessing(utt), batched=True)

Map:   0%|          | 0/8069 [00:00<?, ? examples/s]

In [14]:
dataset['test'] = dataset['test'].map(lambda utt: batch_preprocessing(utt), batched=True)

Map:   0%|          | 0/7740 [00:00<?, ? examples/s]

In [None]:
len(dataset['train'])

87170

### Creating DataLoaders

In [16]:
# instanciate the 3 DataLoaders
train_loader = DataLoader(SiliconeDataset(dataset['train']), batch_size=4, num_workers=1, shuffle=False, drop_last=False)
val_loader   = DataLoader(SiliconeDataset(dataset['validation']), batch_size=4, num_workers=1, shuffle=False, drop_last=False)
test_loader  = DataLoader(SiliconeDataset(dataset['test']), batch_size=4, num_workers=1, shuffle=False, drop_last=False)

### Implementing a simple Multi-Layer Perceptron model

In [17]:
# get the number of classes in the DYDA_DA corpus
classes = list(set(dataset["train"]["Label"]))
nb_classes = len(classes)

In [18]:
# maximal length of input tokenized utterances
max_length = 20 
batch_size = 4
word2vec_dim  = 300

In [19]:
class MultiLayerPerceptron(torch.nn.Module):
  """
  Implement a simple Multi-Layer Perceptron with 2 hidden layers. 
  For documentation on the Linear layer please refer to: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear
  """
  def __init__(self, pretrained_vectors):
        super(MultiLayerPerceptron, self).__init__()

        self.name = "Multi-Layer Perceptron model"
        self.hidden_dim = word2vec_dim
        self.output_dim = nb_classes
        self.use_gpu = torch.cuda.is_available()
        if self.use_gpu:
          self.device = 'cuda'
        else:
          self.device = 'cpu'

        self.ebd = torch.nn.Embedding.from_pretrained(pretrained_vectors, freeze=True)
        self.dropout = nn.Dropout(p=0.2)
        self.linear1 = torch.nn.Linear(self.hidden_dim, self.hidden_dim, bias=True)
        self.linear2 = torch.nn.Linear(self.hidden_dim, self.output_dim, bias=True)
        self.softmax = nn.Softmax(dim=1)

  def forward(self, input_tensor):
        """
       The forward method accepts an input Tensor of data and returns an output Tensor. 
       For documentation on the ReLU activation please refer to:  https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU
        """
        # the input tensor has size (batch_size, max_length)
        ebd = self.ebd(input_tensor)
        # the embedding has size (batch_size, max_length, embedding_dim)
        ebd = ebd.mean(1)
        # take the mean embedding for each input in the batch
        ebd_regularized =  self.dropout(ebd)
        l1_output = torch.relu(self.linear1(ebd_regularized))
        l2_output = torch.relu(self.linear2(l1_output))
        logits = self.softmax(l2_output)
        return logits

  def __str__(self):
      """
      Implement the class behavior when used within a print statement.
      Note: must return a string, hence the below trick.
      For more info on model summary please refer to: https://github.com/TylerYep/torchinfo
      """
      print(summary(self))
      return ""

In [21]:
mlp = MultiLayerPerceptron(preprocessing_pipeline.pretrained_vectors.vectors)
mlp.to(mlp.device)

MultiLayerPerceptron(
  (ebd): Embedding(2519370, 300)
  (dropout): Dropout(p=0.2, inplace=False)
  (linear1): Linear(in_features=300, out_features=300, bias=True)
  (linear2): Linear(in_features=300, out_features=4, bias=True)
  (softmax): Softmax(dim=1)
)

In [22]:
print(mlp)

Layer (type:depth-idx)                   Param #
MultiLayerPerceptron                     --
├─Embedding: 1-1                         (755,811,000)
├─Dropout: 1-2                           --
├─Linear: 1-3                            90,300
├─Linear: 1-4                            1,204
├─Softmax: 1-5                           --
Total params: 755,902,504
Trainable params: 91,504
Non-trainable params: 755,811,000



### Defining training and validation

In [52]:
def run_one_epoch(model, optimizer, epoch_id):
    """
    Function used to train the model for one full epoch.
    For documentation on the loss function used, please refer to: https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss
    """
    # set the model into a training mode: the model's weights and parameters will be updated
    model.train()
    # initialize empty lists for losses and accuracies
    loss_history = []
    accuracy_history = []
    # start the loop over all the training batches (one full epoch)
    for it, batch in enumerate(train_loader):
      if it%100 == 0:
        print(f"\nBatch: {it}/{len(train_loader)}")
      # print(batch)
      batch_tensors = batch["Utterance"]
      batch_labels = batch["Label"]
      stack_batch =  torch.vstack(batch_tensors)
      stack_batch = torch.transpose(stack_batch, 0, 1)

      batch = {'Utterance': stack_batch.to(model.device), 'Label': batch_labels.to(model.device)}

      # put parameters of the model and the optimizer to zero before doing another iteration. this prevents the gradient accumulation through batches
      optimizer.zero_grad()
      # apply the model on the batch
      logits = model(batch['Utterance'])

      # to deal with unbalanced data in the batch, we calculate the weights according to their inverse frequency
      # b_counter = Counter(batch['Label'].detach().cpu().tolist())
      # b_weights = torch.tensor( [ sum(batch['Label'].detach().cpu().tolist()) / b_counter[label] if b_counter[label] > 0 else 0 for label in list(range(args['num_class'])) ] )
      # b_weights = b_weights.to(device)

      # we choose the CrossEntropyLoss, suitable for multiclass classification
      #loss_function = nn.CrossEntropyLoss(weight=b_weights)
      loss_function = nn.CrossEntropyLoss()
 
      loss = loss_function(logits, batch['Label'])
      loss.backward()
      optimizer.step()

      # append the value of the loss for the current iteration (it). .item() retrieve the nuclear value as a int/long
      loss_history.append(loss.item())
      writer.add_scalar("Training loss", loss.item(), epoch_id)
      # get the predicted tags using the maximum probability from the softmax
      _, tag_seq  = torch.max(logits, 1)
    
      # Those 3 lines compute the accuracy and then append it the same way as the loss above
      correct = (tag_seq.flatten() == batch['Label'].flatten()).float().sum()
      acc = correct / batch['Label'].flatten().size(0)
      accuracy_history.append(acc.item())

      if it%100 == 0:
         print(f"Training loss: {round(sum(loss_history)/len(loss_history), 3)} \nTraining accuracy: {round(sum(accuracy_history)/len(accuracy_history), 3)}")

    # simple averages of losses and accuracies for this epoch
    loss_it_avg = sum(loss_history)/len(loss_history)
    acc_it_avg = sum(accuracy_history)/len(accuracy_history)
  
    # print useful information about the training progress and scores on this training set's full pass (i.e. 1 epoch)
    print(f"Training loss: {loss_it_avg} \nTraining accuracy: {acc_it_avg}")

In [26]:
def inference(target, loader, model):
  """
    Args:
      target (str): modify the display, usually either 'validation' or 'test'
  """

  # set the model into a evaluation mode : the model's weights and parameters will NOT be updated!
  model.eval()

  # intialize empty list to populate later on
  loss_it, acc_it, f1_it = list(), list(), list()
  # preds = predicted values ; trues = true values .... obviously~
  preds, trues = list(), list()

  # loop over the loader batches
  for it, batch in enumerate(loader):
    batch_tensors = batch["Utterance"]
    batch_labels = batch["Label"]
    stack_batch =  torch.vstack(batch_tensors)
    stack_batch = torch.transpose(stack_batch, 0, 1)

    
    with torch.no_grad():
       # do not use compute gradients at validation time; saves computation power and memory

      # put the batch to the correct device
      batch = {'Utterance': stack_batch.to(model.device), 'Label': batch_labels.to(model.device)}

      # apply the model
      logits = model(batch['Utterance'])

      # to deal with unbalanced data in the batch, we calculate the weights according to their inverse frequency
      # b_counter = Counter(batch['label'].detach().cpu().tolist())
      # b_weights = torch.tensor( [ sum(batch['label'].detach().cpu().tolist()) / b_counter[label] if b_counter[label] > 0 else 0 for label in list(range(20)) ] )
      # b_weights = b_weights.to(device)

      # loss_function = nn.CrossEntropyLoss(weight=b_weights)
      loss_function = nn.CrossEntropyLoss()
      loss = loss_function(logits, batch['Label'])

      # no need to backward() and other training stuff. Directly store the loss in the list
      loss_it.append(loss.item())

      # get the predicted tags using the maximum probability from the softmax
      _, tag_seq  = torch.max(logits, 1)
      
      # compute the accuracy and store it
      correct = (tag_seq.flatten() == batch['Label'].flatten()).float().sum()
      acc = correct / batch['Label'].flatten().size(0)
      acc_it.append(acc.item())
      
      # extend the predictions and true labels lists so we can compare them later on
      # note how we first ensure the tensor are on cpu (.cpu()), then we detach() the gradient from the tensor, before transforming it to a simple python list (.tolist())
      preds.extend(tag_seq.cpu().detach().tolist())
      trues.extend(batch['Label'].cpu().detach().tolist())

  # compute the average loss and accuracy accross the iterations (batches)
  loss_it_avg = sum(loss_it)/len(loss_it)
  acc_it_avg = sum(acc_it)/len(acc_it)
  
  # print useful information. Important during training as we want to know the performance over the validation set after each epoch
  print(f"Validation loss: {sum(loss_it)/len(loss_it)} \nValidation accuracy: {sum(acc_it) / len(acc_it)}")

  # return the true and predicted values with the losses and accuracies
  return trues, preds, loss_it_avg, acc_it_avg, loss_it, acc_it

In [53]:
def run_n_epochs(model, learning_rate: float = 0.001, nb_epochs: int = 10):
  """
  Train the model for n_epochs.
  """
  # we set the optimizer as Adam with the learning rate (lr) set in the arguments
  # you can look at the different optimizer available here: https://pytorch.org/docs/stable/optim.html
  optimizer = optim.Adam(model.parameters(), lr = learning_rate)

  # define an empty list to store validation losses for each epoch
  validation_loss_history = []
  # iterate over the number of max epochs set in the arguments
  for epoch in range(nb_epochs):
    print(f"\nEpoch: {epoch + 1}/{nb_epochs}")
    # run a training epoch with the model
    run_one_epoch(model, optimizer, epoch)
    # run inference with the trained model and evaluate its validation performance
    trues, preds, val_loss_it_avg, val_acc_it_avg, val_loss_it, val_acc_it = inference("validation", val_loader, model)
    # append the validation losses (good losses should normally go down)
    validation_loss_history.append(val_loss_it_avg)

  # return the list of epoch validation losses in order to use it later to create a plot
  return validation_loss_history

### Running training (and validation)

In [None]:
run_n_epochs(mlp)


Epoch: 1/10

Batch: 0/21793
Training loss: 1.492 
Training accuracy: 0.25

Batch: 100/21793
Training loss: 0.987 
Training accuracy: 0.752

Batch: 200/21793
Training loss: 0.955 
Training accuracy: 0.785

Batch: 300/21793
Training loss: 0.972 
Training accuracy: 0.768

Batch: 400/21793
Training loss: 0.975 
Training accuracy: 0.764

Batch: 500/21793
Training loss: 0.973 
Training accuracy: 0.767

Batch: 600/21793
Training loss: 0.97 
Training accuracy: 0.769

Batch: 700/21793
Training loss: 0.973 
Training accuracy: 0.766


### Running inference

### Conclusion

To conclude, we can say that...

In [None]:
loss_list_val = run_epochs(dialog_act_model, args)

Epoch 0::  34%|███▍      | 5988/17434 [02:00<05:00, 38.06it/s]Exception ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f9adb693430>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1466, in __del__
    self._shutdown_workers()
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/data/dataloader.py", line 1449, in _shutdown_workers
    if w.is_alive():
  File "/usr/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
Epoch 0:: 100%|██████████| 17434/17434 [05:53<00:00, 49.32it/s]

Epoch 0/10 : Training : (loss 1.094846753525983) (acc 0.7019616965180118)



validation:: 100%|██████████| 1614/1614 [00:30<00:00, 52.28it/s]

validation : (loss 1.0860104187122803) (acc 0.654770768837267)



Epoch 1:: 100%|██████████| 17434/17434 [05:51<00:00, 49.67it/s]

Epoch 1/10 : Training : (loss 1.0716707083122357) (acc 0.7181828733419516)



validation:: 100%|██████████| 1614/1614 [00:30<00:00, 52.54it/s]

validation : (loss 1.0850437638780295) (acc 0.6554523055641329)



Epoch 2:: 100%|██████████| 17434/17434 [06:13<00:00, 46.67it/s]

Epoch 2/10 : Training : (loss 1.0671217704504417) (acc 0.7216932552064164)



validation:: 100%|██████████| 1614/1614 [00:30<00:00, 52.42it/s]

validation : (loss 1.0661444233769672) (acc 0.6741016235530303)



Epoch 3:: 100%|██████████| 17434/17434 [06:49<00:00, 42.62it/s]

Epoch 3/10 : Training : (loss nan) (acc 0.2536537840482058)



validation:: 100%|██████████| 1614/1614 [00:31<00:00, 51.13it/s]

validation : (loss nan) (acc 0.1146220588846573)



Epoch 4:: 100%|██████████| 17434/17434 [07:13<00:00, 40.19it/s]

Epoch 4/10 : Training : (loss nan) (acc 0.09270391338367562)



validation:: 100%|██████████| 1614/1614 [00:31<00:00, 51.74it/s]

validation : (loss nan) (acc 0.11462205883849509)



Epoch 5::   2%|▏         | 351/17434 [00:08<06:36, 43.12it/s]


KeyboardInterrupt: ignored