                                DEEP LEARNING PROJECT - STORY TELLING BOT - WORDS TO WORDS - V2                               ▕⃝⃤
---
For the past years, the feeling of loneliness has become more important in our society due to the lockdown induced by the COVID-19 pandemic. Thus, many of us have find comfort in reading books. However, this comfort was only short-lived as many of the books ending left us on cliff-hangers or maybe unsatisfied regarding the love story of the main characters. Therefore, a solution to this problem would be to create a story telling bot (**SBOT** for short) that would allow us to extrapolate the rest of the story or even feed our imagination with an unlikely romance between the main and some background characters. In fact, who has never wanted to know more about the forbidden love between Harry Potter and Dobby the house elf ? Or maybe explore the day to day life of Ron Weasley and Hermione Granger as a married couple ?



> Authors : **Axelle Schyns**, **Lucie Navez** and **Victor Mangeleer**
---

                                                          Before you start !



Everyting you need for the project is freely available using this [link](https://drive.google.com/drive/u/0/folders/1v0QSu0MAqM63yFb7Rcs7w1LJpndfW34_). There, you will be able to access :

- The [theoretical model](https://drive.google.com/drive/folders/1ttg5-byTLQjP8-XNAkF7v-QsGxZKgN4L?usp=sharing) of our SBOT.

- The different [models](https://drive.google.com/drive/folders/1F0jvLfWxms3RpwknnOUBGWsIculgOMfW?usp=sharing) obtained throughout our training sessions.

- The [books](https://drive.google.com/drive/folders/12l2gGdNfpkBaIwETFEtRM4-WmNH_18pG?usp=sharing) used to find useful insigths regarding RNN and how to code them.

- The [Harry Potter books](https://drive.google.com/drive/folders/1IxTdwvTO3mk7fuehgV_DVLQzOzwvgImw?usp=sharing) in a .txt format used for the training of the neural network.

- The building blocks of our [embedder](https://drive.google.com/drive/u/0/folders/1xFYFTdkK7W9NtaYRdCdziG_VCf0Favbi) named GLOVE. If you'd like to download the complete setup for GLOVE, it is possible using this [download link](http://nlp.stanford.edu/data/wordvecs/glove.6B.zip) (1.1gb).


In [None]:
# Define the path to the user's folder to save all the results
user_path = 'results'

# Define the platform used for the training
gradient = True

                                                       Parameters of the SBOT




In [None]:
#-------------------------------------------------------------------------------
#                              TUNING PARAMETERS
#-------------------------------------------------------------------------------
# Define if the SBOT uses attention
use_attention = True

# Define if the encoder's architecture is uni-directionnal or bi-directionnal.
bidirectional = True

# Size of the recurrent states (h should be somewhere in [5, 64])
hidden_size = 2048

# Define the probability that the exact word is used as input for the decoder
teacher_forcing_ratio = 0.5

# Number of epochs
nb_epoch = 200

# Define the checkpoints that will be used to save the model at different states
checkpoints = [10, 25, 50, 75, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, nb_epoch]

#-------------------------------------------------------------------------------
#                                      DATA
#-------------------------------------------------------------------------------
# Define the books that will be used for training
Books_training = [1, 4, 5, 6, 7, 8, 9]

# Define the number of words used as inputs (to capture the context)
input_size  = 8

# Define the minimum sentence length in the dataset
min_sentence_length = 10

# Define the maximum sentence length in the dataset
max_sentence_length = 30

# Define the number of sentences to concatenate in the dataset
Paragraph_size = 1

# Define the percentage of the book that will be used as a train set
train_size  = 90

# Define the percentage of the book that will be used as a validation set
valid_size  = 10

#-------------------------------------------------------------------------------
#                               SBOT and TRAINING
#-------------------------------------------------------------------------------
# Size of the bach in the dataloader
batch_size = 64

# Define the learning rate of ADAM
learning_rate = 0.001

# Number of layers in the encoder
num_layers_encoder = 2

# Number of layers in the decoder
num_layers_decoder = 2

# Define the probability that a weight is set to 0 (induces randomness)
dropout = 0.1

The different **Harry Potter books** used for trainnig are indexed as follows:


Index |Name of the book| Total number of words | Size of the vocabulary
:-------------------|:---------------|:---------------:|:---------------:|
**1**       | Harry Potter and the philosopher's stone | 78519 | 6112 |
**2**       | Harry Potter and the philosopher's stone (**Short version**) | 6520 | 1458 |
**3**       | Harry Potter and the philosopher's stone (**Really short version**) | 877 | 376 |
**4**       | Harry Potter and the chamber of secret | 86402 | 7313 |
**5**       | Harry Potter and the prizonner of Azkaban | 108516 | 8097 |
**6**       | Harry Potter and the goblet of fire | 192331 | 11290 |
**7**       | Harry Potter and the order of the phoenix | 259635 | 13231 |
**8**       | Harry Potter and the half-blood prince | 170682 | 11022 |
**9**       | Harry Potter and the deathly hallows | 199561 | 11930 |


                                                      Into the heart of the SBOT

##  SBOT - Initialization
In this section, the goal is to first **install** and **import** all the **librairies** needed to create our storry telling bot ! Then, the connection with your **google drive** is established in order to be able to import all the books and data needed for training !

### Librairies - Install

In [None]:
! pip install seaborn
! pip install tensorflow
! pip install keras
! pip install contractions

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


### Librairies - Import


In [None]:
# Others
import re
import os
import time
import glob
import nltk
import pickle
import random
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import contractions

# Pytorch
import torch
import torch.nn as nn
import torch.optim as optim

#import torchvision
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

# Keras
import keras

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Seting up Google Drive or Gradient


In [None]:
if not gradient: 

  # Access to your google driv
  from google.colab import drive

  # Define your path to the root of the folder
  path = "/content/drive/MyDrive/Deep Learning - SBOT/"

  # Mounting the drive
  drive.mount('/content/drive')

else:

  # Define your path to the root of the folder
  path = '/notebooks/root/'

### Access to the GPUs


In [None]:
if torch.cuda.is_available():
    device = torch.device("cuda") 
else:
    device = torch.device("cpu")

### Parameters check

In [None]:
if input_size >= min_sentence_length:
  input_size = min_sentence_length - 1

## SBOT - Functions
In this section, all the function that will be used along the notebook are grouped into categories !

### Books - Loading and cleaning

In [None]:
def loadBook(book_path, Dic_contraction):

  # Contains the fully prepared book
  book_final = []
  
  # Load the book as a huge string
  book_Raw = open(book_path, 'r', encoding= 'utf-8').read()

  # Useless thing to be removed (Do not remove "." so we can detect sentences)
  to_be_removed = [",","-",";",":","&","\"","”","“", "\n"]

  # Removing ponctuation
  for char in to_be_removed:
      book_Raw = book_Raw.replace(char, "")

  # Removing contraction of word
  for word in book_Raw.split():
    book_Raw = book_Raw.replace(word, contractions.fix(word))

  # Splitting the string into sentences
  book = nltk.tokenize.sent_tokenize(book_Raw)

  # Splitting the sentences into words and adding tokens
  for s in book:
    s = s.replace(".", "")
    s = s.replace("!", "")
    s = s.replace("?", "")

    # Adding sentence and tokens to final book
    book_final.append(s.lower().split())

  return book_final


def bookToDataset(book, inputSize, min_sentence, max_sentence, size_paragraph):

  # Contain the transformed book
  input  = []
  output = []

  # Looping over all the sentences
  for i in range(len(book) - size_paragraph):

    # To concatenate a set of S sentences 
    sentences = ["<SOS>"]

    # Concatenates current sentence and the S-1 following sentences 
    for offset in range(size_paragraph):
      sentence = book[i + offset]
      for el in sentence:
        sentences.append(el)
      sentences = sentences + ["<END>"]

    # Remove start token since it is not part of the real sentence
    real_size = len(sentences) - 1

    # Removing too short and too long sentences
    if  min_sentence < real_size and real_size < max_sentence:

      # Completing the end of sentence with nothing to do tokens
      nb_NTD = max_sentence - len(sentences)
      input.append(sentences[:inputSize])
      output.append(sentences[inputSize:] + ["<NTD>" for i in range(nb_NTD)])
    
    else:
      continue

  return input, output


def shufflePairs(X_train, y_train):

    # Links the dataset together using zip
    linked_set = list(zip(X_train, y_train))

    # Shuffle all the elements
    random.shuffle(linked_set)

    # Reconstruction of the datasets
    X_train, y_train = zip(*linked_set)
    X_train, y_train = list(X_train), list(y_train)

    return X_train, y_train

### Books - Conversion from list to tensor

In [None]:
def inputToTensor(inputs, embedder, embedder_size = 50):
  #--------------
  # Documentation
  #--------------
  # This function converts the input for the encoder (a list of 
  # sentences) into a readable input for the SBOT (tensor of tensors)
  #
  # inputs        : list of sentences
  #
  # embedder      : our embedder containing the harry potter vocabulary
  #
  # embedder_size : the size of the vector produced by the embedder
  #
  # Information regarding the input
  batch_size = len(inputs)
  seq_len    = len(inputs[0])

  # Stores the converted input
  tensor_input = torch.zeros(batch_size, seq_len, embedder_size)

  # Looping over the column first
  for b in range(batch_size):
    
    # Current sentence
    word_list = inputs[b]

    # Converting each word of the sentence
    for w in range(len(word_list)):

      # Known word by the embedder
      if word_list[w] in embedder:
        tensor_input[b][w] = torch.tensor(embedder[word_list[w]])

      # Unknown word = Unknown token
      else:
        tensor_input[b][w] = torch.tensor(embedder["<UKN>"])

  return tensor_input


def getMaskFromList(inputs):

  # List of padding tokens to mask
  masked = ["<NTD>", "<UKN>"]

  # Information regarding the input
  batch_size = len(inputs)
  seq_len    = len(inputs[0])

  # Stores the converted input
  mask = torch.zeros(batch_size, seq_len, dtype = bool)

  # Looping over the column first
  for b in range(batch_size):
    
    # Current sentence
    word_list = inputs[b]

    # Converting each word of the sentence
    for w in range(len(word_list)):

      # Known word by the embedder
      if word_list[w] in masked:
        mask[b][w] = False

      # Unknown word = Unknown token
      else:
        mask[b][w] = True

  return mask


### Tensor - Probabilities


In [None]:
def WordToId(word, embedder_WordToId):
    #--------------
    # Documentation
    #--------------
    # Retrieves from the dictionnary embedder_WordToId the id of the corresponding word
    #
    return embedder_WordToId.get(word)

def IdToWord(Id, embedder_IdToWord):
    #--------------
    # Documentation
    #--------------
    # Retrieves from the dictionnary embedder_IdToWord the word corresponding to the given id 
    #
    return embedder_IdToWord[Id.item()]

def outputToId(outputs, embedder_WordToId):
  #--------------
  # Documentation
  #--------------
  # This function converts a list of words into a tensor of probability vectors (used for CE).
  #
  # outputs           : list of sentences
  #
  # embedder_WordToId : the embedder that convert a word to its ID
  #
  # NOTE : If the word is uknown, all the values are set to 0.
  #
  # Information regarding the input
  batch_size = len(outputs)
  seq_len    = len(outputs[0])

  # Stores the converted input
  tensor_input = torch.zeros(batch_size, seq_len)

  # Looping over the column first
  for b in range(batch_size):
    
    # Current sentence
    word_list = outputs[b]

    # Converting each word of the sentence
    for w in range(seq_len):

      # Index of the word
      word_id = WordToId(word_list[w], embedder_WordToId)
      
      # If the word is unknown, keep the vector with 0s
      if word_id == None:
        word_id = WordToId("<UKN>", embedder_WordToId)

      # Placing index
      tensor_input[b][w] = word_id

  tensor_input = tensor_input.type(torch.LongTensor) 
  
  return tensor_input

def idToProba(outputs_id, embedder_WordToId):
  #--------------
  # Documentation
  #--------------
  # This function converts a list of words into a tensor of probability vectors (used for CE).
  #
  # outputs           : list of sentences
  #
  # embedder_WordToId : the embedder that convert a word to its ID
  #
  # NOTE : If the word is uknown, all the values are set to 0.
  #
  # Information regarding the input
  batch_size = len(outputs_id)
  seq_len    = len(outputs_id[0])

  # Stores the converted input
  tensor_input = torch.zeros(batch_size, seq_len, len(embedder_WordToId))

  # Looping over the column first
  for b in range(batch_size):
    
    # Current sentence
    word_list = outputs_id[b]

    # Converting each word of the sentence
    for w in range(seq_len):

      # Index of the word
      word_id = outputs_id[b][w]
      
      # Placing index
      tensor_input[b][w][int(word_id.item())] = 1

  return tensor_input


def getPredictedWord(predictions, embedder_IdToWord):
    #--------------
    # Documentation
    #--------------
    # This functions converts the prediction made by the SBOT into words
    #
    # predictions       : output of the SBOT
    #
    # embedder_IdToWord : embedder dictionnary that gives ID -> Word
    #
    # Contains ALL the words that have been predicted
    words_predicted = []

    # Information
    batch_size  = len(predictions)
    output_size = len(predictions[0])

    # Looping over the columns
    for b in range(batch_size):

      # Contains the words predicted for an element of the batch
      words_current = []

      # Converting each word of the current sentence
      for w in range(output_size):

          # Current word
          word = predictions[b][w]

          # Index of highest probability
          word_index = torch.argmax(torch.softmax(word, 0))

          # Adding the converted word
          words_current.append(IdToWord(word_index, embedder_IdToWord))

      # Storing the sentence
      words_predicted.append(words_current)

    return words_predicted

### Testing the model

In [None]:
def trainingInfo(num_layers_encoder, num_layers_decoder, hidden_size, total_params, teacher_forcing_ratio, bidirectional, nb_epoch, batch_size, attention, nb_input_words):
  print("---------------------------------------------------------------------")
  print("                       SBOT - Training session                       ")
  print("---------------------------------------------------------------------")
  print("----------")
  print("Parameters")
  print("----------")
  print("Number of layers (Encoder) : " + str(num_layers_encoder))
  print("")
  print("Number of layers (Decoder) : " + str(num_layers_decoder))
  print("")
  print("Size of reccurent states   : " + str(hidden_size))
  print(" ")
  print("Number of parameters       : " + str(total_params))
  print(" ")
  print("Attention                  : " + str(attention))
  print(" ")
  print("Nb. Inputs Words           : " + str(nb_input_words))
  print(" ")
  print("Teacher forcing ratio      : " + str(teacher_forcing_ratio))
  print(" ")
  print("Bidirectional              : " + str(bidirectional))
  print(" ")
  print("Number of epochs           : " + str(nb_epoch))
  print(" ")
  print("Size of the batch          : " + str(batch_size))
  print(" ")
  print("-------")
  print("Traning")
  print("-------")


def progressBar(loss_training, loss_validation, estimated_time, percent, width = 40):

    # Setting up the useful information
    left  = width * percent // 100
    right = width - left
    tags = "#" * int(left)
    spaces = " " * int(right)
    percents = f"{percent:.2f} %"
    loss_training = f"{loss_training * 1:.6f}"
    loss_validation = f"{loss_validation * 1:.6f}"
    estimated_time = f"{estimated_time:.2f} s"

    # Displaying a really cool progress bar !
    print("\r[", tags, spaces, "] - ", percents, " | Loss (Training) = ", loss_training, " | Loss (Validation) = ", 
          loss_validation,  " | Time left : ", estimated_time ,sep="", end="", flush = True)


def saveTrainingInfo(path, loss_train, loss_valid, parameters):
  
    # Creation of the dictionnary story all the results
    results = {
        "Loss Train"       : loss_train,
        "Loss Valid"       : loss_valid,
        "Layers Encoder"   : parameters[0],
        "Layers Decoder"   : parameters[1],
        "Hidden Size"      : parameters[2],
        "Total Parameters" : parameters[3],
        "Attention"        : parameters[4],
        "Input Words"      : parameters[5],
        "Teacher Forcing"  : parameters[6],
        "Bidirectional"    : parameters[7],
        "Number of epochs" : parameters[8],
        "Batch Size"       : parameters[9],
      }

    # Creation of the model's name
    number_models = glob.glob(folder_path + "*.pth")
    results_name  = "SBOTV2_" + str(len(number_models)) + "_Parameters.pkl"

    # Saving the dictionnary
    new_dic = open(path + results_name, "wb")

    # write the python object (dict) to pickle file
    pickle.dump(results, new_dic)

    # close file
    new_dic.close()


def loadTrainingInfo(path, model_index):

    # Charging the dictionnary in pickle format
    results_pkl = open(path + "SBOTV2_" + str(model_index) + "_Parameters.pkl", "rb")

    # Loading the dictionnary
    results = pickle.load(results_pkl)

    return results

def loadTrainingParameters(results):

    parameters = []
    parameters.append(results["Loss Train"])
    parameters.append(results["Loss Valid"])
    parameters.append(results["Layers Encoder"])
    parameters.append(results["Layers Decoder"])
    parameters.append(results["Hidden Size"])
    parameters.append(results["Total Parameters"])
    parameters.append(results["Attention"])
    parameters.append(results["Input Words"])
    parameters.append(results["Teacher Forcing"])
    parameters.append(results["Bidirectional"])
    parameters.append(results["Number of epochs"])
    parameters.append(results["Batch Size"])

    return parameters

def saveModel(path, model, optimizer):

    # Creation of the model's name
    model_number = glob.glob(folder_path + "*.pth")
    model_name   = "SBOTV2_" + str(len(model_number)) + ".pth"

    # Saves all the state information
    checkpoint = {
    'state_dict': model.state_dict(),
    'optimizer' : optimizer.state_dict()
    }

    # Saving everything everything
    torch.save(checkpoint, path + model_name)


def loadModel(path, model, optimizer, model_index):

    # Loading the corresponding checkpoint
    checkpoint = torch.load(path + "SBOTV2_" + str(model_index) + ".pth")

    # Transfering the parameters
    model.load_state_dict(checkpoint['state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer'])

    return model, optimizer


def showModel(model_index, parameters_model, loss_train_model, loss_valid_model):
    print("----------------------------------------------------")
    print("                 Model's information                ")
    print("----------------------------------------------------\n")

    # Displaying the parameters
    print("Model Name       : SBOTV2_" + str(model_index) + "\n\n")

    for p in parameters_model:
      print(p)

    nb_epoch_vector = [i for i in range(len(loss_train_model))]

    # Changing plot style
    sns.set_style("darkgrid")

    # Training loss
    plt.figure()
    plt_training = sns.lineplot(nb_epoch_vector, loss_train_model)
    plt_training.set_xlabel("Number of epochs", fontsize = 20)
    plt_training.set_ylabel("Training loss", fontsize = 20)

    plt.figure()
    plt_training = sns.lineplot(nb_epoch_vector, loss_valid_model)
    plt_training.set_xlabel("Number of epochs", fontsize = 20)
    plt_training.set_ylabel("Validation loss", fontsize = 20)


def testModel(model, text, sentence_length, device):
  print("----------------------------------------------------")
  print("                 Generated Text                     ")
  print("----------------------------------------------------\n")

  # Preparation of the text
  text = text.lower()

  # Useless thing to be removed (Do not remove "." so we can detect sentences)
  to_be_removed = [",","-","!",";",":","?","&","\"","”","“", "\n"]

  # Removing ponctuation
  for char in to_be_removed:
      text_corrected = text.replace(char, "")

  text_corrected = text.replace(".", " <END>")
  text_corrected = text_corrected.split()
  text_corrected = ["<SOS>"] + text_corrected
  text_corrected = [text_corrected]
  text_input     = inputToTensor(text_corrected, embedder).to(device)

  # Making predictions
  pred = model(text_input, sentence_length, embedder, embedder_IdToWord, training_mode = False)

  # Creation of the full text paragraph
  index = 0
  sentence = ""

  for p in pred:

      current_word = p[0][0]

      # Checks the current word expression
      if current_word == "<END>":
        current_word = "."
  
      if current_word == "<NTD>":
        current_word = ""
        continue

      # Append the final setence
      if index < 10:
        sentence = sentence + current_word + " "
        index = index + 1
      else:
        sentence = sentence + current_word + "\n"
        index = 0

  return sentence

### Others

In [None]:
def Harry_Logo():
  print("                                         _ __")
  print("        ___                             | '  \ ")
  print("   ___  \ /  ___         ,'\_           | .-. \        /|")
  print("   \ /  | |,'__ \  ,'\_  |   \          | | | |      ,' |_   /|")
  print(" _ | |  | |\/  \ \ |   \ | |\_|    _    | |_| |   _ '-. .-',' |_   _")
  print("// | |  | |____| | | |\_|| |__    //    |     | ,'_`. | | '-. .-',' `. ,'\_")
  print(" \\_| |_,' .-, _  | | |   | |\ \  //    .| |\_/ | / \ || |   | | / |\  \|   \ ")
  print(" `-. .-'| |/ / | | | |   | | \ \//     |  |    | | | || |   | | | |_\ || |\_|")
  print("   | |  | || \_| | | |   /_\  \ /      | |`    | | | || |   | | | .---'| |")
  print("   | |  | |\___,_\ /_\ _      //       | |     | \_/ || |   | | | |  /\| |")
  print("   /_\  | |           //_____//       .||`  _   `._,' | |   | | \ `-' /| |")
  print("        /_\           `------'        \ |             `.\   | |  `._,' /_\ ")
  print(" ")
  print(" ")

## SBOT - Books and GLOVE
In this section, all the books are loaded into the notebook as well as the glove embedder. Therefore, one must only **load once** this section ! 

### Loading the books

In [None]:
# Loading dictionnary to replace contractions
HP_dictionnary = open(path + "Glove/Glove_HarryPotter/HP_Contraction.pkl", "rb")

# Loading the dictionnary
HP_contraction = pickle.load(HP_dictionnary)

# Path to the books
book_11 = path + "Data/HP1.txt"
book_12 = path + "Data/HP1_Dummy.txt"
book_13 = path + "Data/HP1_Really_Dummy.txt"
book_2  = path + "Data/HP2.txt"
book_3  = path + "Data/HP3.txt"
book_4  = path + "Data/HP4.txt"
book_5  = path + "Data/HP5.txt"
book_6  = path + "Data/HP6.txt"
book_7  = path + "Data/HP7.txt"

# Loading the books
book_11 = loadBook(book_11, HP_contraction)
book_12 = loadBook(book_12, HP_contraction)
book_13 = loadBook(book_13, HP_contraction)
book_2  = loadBook(book_2,  HP_contraction)
book_3  = loadBook(book_3,  HP_contraction)
book_4  = loadBook(book_4,  HP_contraction)
book_5  = loadBook(book_5,  HP_contraction)
book_6  = loadBook(book_6,  HP_contraction)
book_7  = loadBook(book_7,  HP_contraction)

# Grouping all the information regarding the books
HP_books  = [book_11,  book_12, book_13, 
             book_2 ,  book_3 ,  book_4, 
             book_5 ,  book_6 ,  book_7]

### Loading GLOVE


In [None]:
#-------------------------------------------------------------------------------
#
#                               EMBEDDER TOKENS
#
#-------------------------------------------------------------------------------
# Start token (used as first input for the decoder)
sos_value = 1

# Unknown value (used if the words is unknown to the embedder)
ukn_value = 0.75

# Nothing to do (used to complete output if not enough words)
ntd_value = 0.25

# End token (used as last ouput of the decoder)
eos_value = 0

#-------------------------------------------------------------------------------
#
#                              LOADING THE EMBEDDER
#
#-------------------------------------------------------------------------------
# Opening the file on Google Drive
HP_dictionnary = open(path + "Glove/Glove_HarryPotter/HP_Dictionnary.pkl", "rb")

# Loading the dictionnary
embedder = pickle.load(HP_dictionnary)

# Adding missing tokens
embedder["<SOS>"] = np.ones(50) * sos_value
embedder["<END>"] = np.ones(50) * eos_value
embedder["<NTD>"] = np.ones(50) * ntd_value
embedder["<UKN>"] = np.ones(50) * ukn_value

#-------------------------------------------------------------------------------
#
#                     CREATION OF REVERSE EMBEDDER (ID to WORD)
#
#-------------------------------------------------------------------------------
# Creation of the ID -> Word dictionnary
embedder_WordToId = dict()
cpt = 0

# Retrieves each possible word 
for d in embedder:

  # Encodes the pair (d, cpt) in the dictionnary 
  embedder_WordToId[d] = cpt
  cpt += 1

#-------------------------------------------------------------------------------
#
#                   CREATION OF REVERSE EMBEDDER 2 (WORD to ID)
#
#-------------------------------------------------------------------------------
# Creation of the ID -> Word dictionnary
embedder_IdToWord = dict()
cpt = 0

# Retrieves each possible word 
for d in embedder:

  # encode the pair (cpt, d) in the dictionnary 
  embedder_IdToWord[cpt] = d
  cpt += 1

## SBOT - Creation of the dataset
In this section, the purpose is to create the dataloader for the training, validation and test sets. Therefore, if **some parameters are changed**, one must always **re-run this section** !

In [None]:
# Used to easily store our datasets
class HPDataset(Dataset):
  def __init__(self, x, y, y_p, y_m):
      self.x = x
      self.y = y
      self.y_proba = y_p
      self.y_mask  = y_m
      
  def __getitem__(self, index):
      return (self.x[index, :, :], self.y[index, :, :], self.y_proba[index, :],  self.y_mask[index, :])
  
  def __len__(self):
      return self.x.shape[0]

In [None]:
#-------------------------------------------------------------------------------
#
#                     Creation of the dataset as lists
#
#-------------------------------------------------------------------------------
# Contains the dataset but still in string format
X = []
X_train = [] 
X_valid = []

y = []
y_train = []
y_valid = []

for i in range(len(Books_training)):

    # Current book, size and max id
    book = HP_books[Books_training[i] - 1]

    # Creation of input/output
    X_curr, y_curr = bookToDataset(book, input_size, min_sentence_length, max_sentence_length, Paragraph_size)
    X = X + X_curr
    y = y + y_curr

# Shuffling everything ! (Better for generalization)
X, y = shufflePairs(X, y)

# Dimensions of the sets
index_train = int(len(X) * (train_size/100))

# Creation of the sets
X_train = X_train + X[:index_train]
X_valid = X_valid + X[index_train:]

y_train = y_train + y[:index_train]
y_valid = y_valid + y[index_train:]

#-------------------------------------------------------------------------------
#                           Conversion of the masks
#-------------------------------------------------------------------------------
mask_train = getMaskFromList(y_train)
mask_valid = getMaskFromList(y_valid)

#-------------------------------------------------------------------------------
#                     Conversion of lists to torch vectors
#-------------------------------------------------------------------------------
X_train_torch = inputToTensor(X_train, embedder)
y_train_torch = inputToTensor(y_train, embedder)

X_valid_torch = inputToTensor(X_valid, embedder)
y_valid_torch = inputToTensor(y_valid, embedder)

#-------------------------------------------------------------------------------
#                 Conversion of lists to probability vectors
#-------------------------------------------------------------------------------
y_train_prob = outputToId(y_train, embedder_WordToId)
y_valid_prob = outputToId(y_valid, embedder_WordToId)

#-------------------------------------------------------------------------------
#                           Creation of the dataloader
#-------------------------------------------------------------------------------
dataset_train = HPDataset(X_train_torch, y_train_torch, y_train_prob, mask_train)
dataset_valid = HPDataset(X_valid_torch, y_valid_torch, y_valid_prob, mask_valid)

HP_train  = DataLoader(dataset_train, batch_size = batch_size)
HP_valid  = DataLoader(dataset_valid, batch_size = batch_size)

## SBOT - Architecture

In this section, we are going to first build the architecture of our SBOT by defining the **encoder-decoder** and the **attention mechanisms** !

In [None]:
#-------------------------------------------------------------------------------
#                                   ENCODER
#-------------------------------------------------------------------------------
class Encoder_SBOT(nn.Module):

    def __init__(self, input_size, hid_dimensions, num_layers, dropout = 0, bidirectional = False):
        super(Encoder_SBOT, self).__init__()

        self.gru = nn.GRU(input_size, hid_dimensions, num_layers = num_layers, dropout = dropout, bidirectional = bidirectional, batch_first = True)
        self.bidirectional = bidirectional 
        
    def forward(self, x): 
         
        outputs, hidden = self.gru(x)

        if self.bidirectional == True: 
          h1 = hidden[-2]
          h2 = hidden[-1]
          hidden = torch.cat((h1, h2), dim = 1)
          hidden = torch.unsqueeze(hidden, 0)
        else:
          hidden = hidden[-1]
          hidden = torch.unsqueeze(hidden, 0)

        return outputs, hidden

#-------------------------------------------------------------------------------
#                                    DECODER
#-------------------------------------------------------------------------------
class Decoder_SBOT(nn.Module):

    def __init__(self, input_size, hid_dimensions, num_layers, output_size, dropout):
        super(Decoder_SBOT, self).__init__()

        self.gru = nn.GRU(input_size, hid_dimensions, num_layers = num_layers, dropout = dropout, batch_first = True)
        self.mlp = nn.Linear(hid_dimensions, output_size)

    def forward(self, x, hidden):
        
        output, hidden = self.gru(x, hidden)
        output = self.mlp(output)

        return output, hidden

#-------------------------------------------------------------------------------
#                                    ATTENTION
#-------------------------------------------------------------------------------
class Attention_SBOT(nn.Module):

    def __init__(self, input_size):
        super(Attention_SBOT, self).__init__()

        # Used to compute weights
        self.SM = nn.Softmax(dim = 1)

    def forward(self, output_encoder, hidden_decoder):

        # Transformation of hidden (Batch first, vector product)
        hidden_decoder = torch.transpose(hidden_decoder, 0, 1)
        hidden_decoder = torch.transpose(hidden_decoder, 1, 2)
          
        # Computing the score (e_{i,j}, vector product)
        score = torch.bmm(output_encoder, hidden_decoder)
        
        # Final preparation of the weights
        weights = torch.transpose(self.SM(score), 1, 2)

        # Computing attention reccurent layer
        hidden_attention = torch.transpose(torch.bmm(weights, output_encoder), 0, 1)

        return hidden_attention.contiguous()

#-------------------------------------------------------------------------------
#                                    SBOT
#-------------------------------------------------------------------------------
class SBOT_V2(nn.Module):

    def __init__(self, input_size, hidden_size, output_size, num_layers_encoder, 
                 num_layers_decoder, device, dropout, bidirectional, use_attention, nb_input_words):
        super(SBOT_V2, self).__init__()

        # Initialization of the main components
        self.encoder = Encoder_SBOT(input_size, hidden_size, num_layers_encoder, dropout, bidirectional).to(device)

        if bidirectional == True:
          self.decoder = Decoder_SBOT(input_size, hidden_size * 2, num_layers_decoder, output_size, dropout).to(device)
        else:
          self.decoder = Decoder_SBOT(input_size, hidden_size, num_layers_decoder, output_size, dropout).to(device)

        self.attention = Attention_SBOT(nb_input_words).to(device)

        # Other parameters
        self.device = device
        self.use_att = use_attention
        self.num_layers_decoder = num_layers_decoder

    def forward(self, input, output_size, embedder, embedder_IdToWord, outputs = [], teacher_forcing_ratio = 0, training_mode = False):

        # Stores all the predictions made by the SBOT
        predictions = torch.zeros(len(input), output_size, len(embedder_IdToWord)).to(self.device)

        #-----------------------------------------------------------------------
        #                               ENCODING
        #-----------------------------------------------------------------------
        out_encoder, hidden = self.encoder(input)

        #-----------------------------------------------------------------------
        #                               ATTENTION
        #-----------------------------------------------------------------------
        if self.use_att == True:
          hidden = self.attention(out_encoder, hidden)

        # Preparation of hidden vector if decoder's layers > 1
        hidden = hidden.repeat(self.num_layers_decoder, 1, 1)

        #-----------------------------------------------------------------------
        #
        #                             Training mode
        #
        #-----------------------------------------------------------------------
        if training_mode == True:

          # The initial input for the decoder is ALWAYS a SOS token !
          input = [["<SOS>"] for el in range(len(outputs))]
          input = inputToTensor(input, embedder).to(self.device)

          # Making the predictions
          for p in range(output_size):

              #-----------------------------------------------------------------
              #                             DECODER
              #-----------------------------------------------------------------
              # Make a prediction
              current_prediction, current_hidden = self.decoder(input, hidden)

              # Saves the current prediction(s)
              for predic in range(len(current_prediction)):
                predictions[predic, p, :] = current_prediction[predic, :]

              #-----------------------------------------------------------------
              #                            ATTENTION
              #-----------------------------------------------------------------
              # Computes attention if needed
              if self.use_att == True:
                current_hidden = self.attention(out_encoder, current_hidden)

              # Update of the reccurent state for future predictions
              hidden = current_hidden

              #-----------------------------------------------------------------
              #                 PREPARATION OF THE NEXT INPUT
              #-----------------------------------------------------------------
              # CASE 1 - No teacher forcing (No taining or simply not used)
              if teacher_forcing_ratio == 0:
                
                    # The next input is the prediction made by the decoder
                    input = getPredictedWord(current_prediction, embedder_IdToWord)
                    input = inputToTensor(input, embedder).to(self.device)

              # CASE 2 - Teacher forcing is activated
              else:
                    # Decides if teacher_force has to be used or not
                    teacher_force = random.random() < teacher_forcing_ratio

                    # The input for the next prediction is the exact word
                    if teacher_force:
                      
                      # The next input is the exact word
                      input = outputs[:, p, :].unsqueeze(1)

                    else:
                      # The next input is the prediction made by the decoder
                      input = getPredictedWord(current_prediction, embedder_IdToWord)
                      input = inputToTensor(input, embedder).to(self.device)

          return predictions

        #-----------------------------------------------------------------------
        #
        #                                 Test mode
        #
        #-----------------------------------------------------------------------
        if training_mode == False:

          # Stores the predicted sentence
          sentence = []

          # The initial input for the decoder is ALWAYS a SOS token !
          input = [["<SOS>"]]
          input = inputToTensor(input, embedder).to(self.device)

          # Making the predictions
          for p in range(output_size):

              #-----------------------------------------------------------------
              #                             DECODER
              #-----------------------------------------------------------------
              # Make a prediction
              current_prediction, current_hidden = self.decoder(input, hidden)

              # Saves the current prediction(s)
              for predic in range(len(current_prediction)):
                predictions[predic, p, :] = current_prediction[predic, :]

              #-----------------------------------------------------------------
              #                            ATTENTION
              #-----------------------------------------------------------------
              # Computes attention if needed
              if self.use_att == True:
                current_hidden = self.attention(out_encoder, current_hidden)

              # Update of the reccurent state for future predictions
              hidden = current_hidden

              #-----------------------------------------------------------------
              #                 PREPARATION OF THE NEXT INPUT
              #-----------------------------------------------------------------
              # The next input is the prediction made by the decoder
              input = getPredictedWord(current_prediction, embedder_IdToWord)
              sentence.append(input)
              input = inputToTensor(input, embedder).to(self.device)

          return sentence

### Documentation

In [None]:
#-------------------------------------------------------------------------------
#
#                                   ENCODER
#
#-------------------------------------------------------------------------------
#
# input_size     = size of the vectors from the embedder
#
# hid_dimensions = size of the reccurent states
#
# num_layers     = number of GRUs stacked on one another
#
#------------
# Shape Guide
#------------
#
# embedded = [batch size, sequence length, embeder dimension]
#
# outputs  = [src len, batch size, hid dim * n directions]
#
# hidden   = [n layers * n directions, batch_size, hid dim]
#
#-------------------------------------------------------------------------------
#
#                                   DECODER
#
#-------------------------------------------------------------------------------
#
# input_size     = size of the vectors from the embedder
#
# hid_dimensions = size of the reccurent states (same as for encoder)
#
# output_size    = size of the output vector from the MLP
#
# num_layers     = number of GRUs stacked on one another in the DECODER ! It is important for shaping !
#
# NOTE : Les prédictions y faites par le décodeur sont y = sigma(h) donc la dimension
#        de la prédiction est égale à hid_dimens
#
#------------
# Shape Guide
#------------
#
# input  = [batch size, seq_len, emb dim]
#
# output = [batch size, seq len, , hid dim * n directions]
#
# hidden = [n layers * n directions, batch size, hid dim]
#
# After MLP
#
# output = [batch size, output dim]
#
#-------------------------------------------------------------------------------
#
#                                     SBOT
#
#-------------------------------------------------------------------------------
#--------------
# Documentation
#--------------
# input            = input sentences  ([seq_len, batch_size])
#
# output_size      = number of words to be predicted
#
# embedder         = glove embedder
#
# embedder_reverse = Dic : ID ->  Word
#
# outputs          = List of correct words to be predicted (Used for
#                    force teaching ONLY DURING TRAINING)
#
# teacher_forcing  = Probability to use teacher forcing
#
# Stores all the predictions made by the SBOT ! The shape of it is :
#
#            [batch_size, output_size, size_HP_vocabulary]
#

#--------------------------------------------------------------------------------------------------------------------------------------------- 
                                                        Training of the SBOT

In [None]:
#-------------------------------------------------------------------------------
#                              Training parameters
#-------------------------------------------------------------------------------
# Define if the sbot trained is new or must be loaded
new_sbot = True

# Number of the model to load 
sbot_index = 0

# Path to the correct folder to save the model and results
folder_path = path + "Models/" + user_path + "/"


In [None]:
#-------------------------------------------------------------------------------
#                           Intialization of the SBOT
#-------------------------------------------------------------------------------
if new_sbot == True:

  sbot = SBOT_V2(50, hidden_size, len(embedder_IdToWord), num_layers_encoder, num_layers_decoder, device, dropout, bidirectional, use_attention, input_size)

else:
  
  results_dic = loadTrainingInfo(path + "Models/" + user_path + "/", sbot_index)
  results     = loadTrainingParameters(results_dic)
  sbot = SBOT_V2(50, results[4], len(embedder_IdToWord), results[2], results[3], device, dropout, results[9], results[6], results[7])

total_params = sum(p.numel() for p in sbot.parameters() if p.requires_grad)

criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(sbot.parameters(), lr = learning_rate)

scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode = 'min', factor = 0.5, patience = 5, threshold = 1e-2)

#-------------------------------------------------------------------------------
#                                Loading the SBOT
#-------------------------------------------------------------------------------
if new_sbot == False:
  sbot, optimizer = loadModel(path + "Models/" + user_path + "/", sbot, optimizer, sbot_index)


In [None]:
#-------------------------------------------------------------------------------
#                                  Training
#-------------------------------------------------------------------------------
# First training information
losses_train_total = []
losses_valid_total = []
offset_epoch = 0

# Adjusting training information
if new_sbot == False:
  losses_train_total = results[0]
  losses_valid_total = results[1]
  offset_epoch = results[10]

# Displaying all the parameters of the current model undergoing training
if new_sbot == True:
  trainingInfo(num_layers_encoder, num_layers_decoder, hidden_size, total_params, teacher_forcing_ratio, bidirectional, nb_epoch, batch_size, use_attention, input_size)
else:
  trainingInfo(results[2], results[3], results[4], results[5], results[8], results[9], results[10] + offset_epoch, results[11], results[6], results[7])

# Number of output to be produced by the decoder
output_size = max_sentence_length - input_size

# Other parameters
t_size = len(X_train)
v_size = len(X_valid)
estimated_time = 0

# Going through epochs
for epoch in range(nb_epoch):

    # Used to compute the average loss value over the all epoch
    train_losses = []
    valid_losses = []

    # Display useful information
    print("\nEpoch : ", epoch + 1 + offset_epoch, "/", nb_epoch + offset_epoch, "\n")
    index = batch_size
    start = time.time()

    #---------------------------------------------------------------------------
    #                                   Training
    #---------------------------------------------------------------------------
    for x, y, y_iD, y_mask in HP_train:

        # Passing the data to the GPU
        x = x.to(device)
        y = y.to(device)
        y_iD = y_iD.to(device)

        # Computing SBOT prediction
        pred = sbot(x, output_size, embedder, embedder_IdToWord, y, teacher_forcing_ratio, training_mode = True)
        
        # Computing the loss
        loss = criterion(pred[y_mask], y_iD[y_mask]) 

        # Adding the loss
        train_losses.append(loss.detach().item())

        # Reseting the gradients
        optimizer.zero_grad()

        # Backward pass
        loss.backward()

        # Optimizing the parameters
        optimizer.step()

        # Removing data from the GPU
        x.to('cpu')
        y.to('cpu') 
        y_iD.to('cpu')

        # Update the progress bar
        time_left = estimated_time - (time.time() - start)
        progressBar(loss, 0, time_left, (index/t_size)*100)
        index = index + batch_size

    # Computing mean loss
    mean_loss = sum(train_losses)/len(train_losses)
    losses_train_total.append(mean_loss)

    # Display useful information
    estimated_time = time.time() - start
    progressBar(mean_loss, 0, 0, 100)

    #---------------------------------------------------------------------------
    #                                 Validation
    #---------------------------------------------------------------------------
    index_validation = batch_size

    with torch.no_grad():  
        
        for x, y, y_iD, y_mask in HP_valid:

          # Passing the data to the GPU
          x = x.to(device)
          y = y.to(device)
          y_iD = y_iD.to(device)

          # Computing SBOT prediction
          pred = sbot(x, output_size, embedder, embedder_IdToWord, y, teacher_forcing_ratio, training_mode = True)
          
          # Computing the loss
          loss_2 = criterion(pred[y_mask], y_iD[y_mask]) 

          # Adding the loss
          valid_losses.append(loss_2.detach().item())
          
          # Removing data from the GPU
          x.to('cpu')
          y.to('cpu') 
          y_iD.to('cpu')

          # Update the progress bar
          progressBar(mean_loss, loss_2, time_left, (index_validation/v_size)*100)
          index_validation = index_validation + batch_size

    # Update of the scheduler
    scheduler.step(mean_loss)
    
    # Computing mean loss
    mean_loss_2 = sum(valid_losses)/len(valid_losses)

    # Adding the final loss of the epoch
    losses_valid_total.append(mean_loss_2)

    # Display useful information
    estimated_time = time.time() - start
    progressBar(mean_loss, mean_loss_2, 0, 100)
    print("\n")

    if (epoch + 1) in checkpoints:
      
      # Saving the results
      saveTrainingInfo(folder_path, losses_train_total, losses_valid_total, 
                      [num_layers_encoder, num_layers_decoder, hidden_size, total_params, use_attention, input_size, teacher_forcing_ratio, bidirectional, nb_epoch + offset_epoch, batch_size])

      saveModel(folder_path, sbot, optimizer)

# Information over terminal
print("---------------------------------------------------------------------")
print("                 Finish training and model saved                     ")
print("---------------------------------------------------------------------")