Our first approach is to generate, by hand, a simple collections of sentences organised by intents and associated answers.
We will then process that data to make it suitable for NLP applications, encode it using "bag of word" and train a neural network to predict user intent from an utterance.

Then, we wil used pre-trained word embeddings

Then, we will generate training data using OpenAI GPT-3 API and train it, using both bag of words and word embeddings
We can also try TF-IDF to compare it with the NN.

A chatbot needs to understand intents in users' utterances. For this purpose, we train a classifier.

I use the IMDB review dataset for basic testing

# Import / Generate data

In this section, we import our dataset, made of hand-crafted sentences and the corresponding intent.

In [1]:
import json

import numpy as np
import pandas as pd
from sklearn import tree, svm, naive_bayes
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import sklearn.metrics
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from pprint import pprint
from time import time
from collections import defaultdict
import spacy
from spacy.tokens import DocBin
nlp = spacy.load("en_core_web_lg")

In [2]:
data_file = open('intents.json').read()
intents = json.loads(data_file)


data = []
for intent in intents['intents']:
    for pattern in intent['patterns']:
        data.append([pattern, intent['tag']])

df = pd.DataFrame(data, columns=['text','intent'])

In [3]:
df_imdb = pd.read_csv("IMDB Dataset.csv", nrows=1000)
print(df_imdb)

                                                review sentiment
0    One of the other reviewers has mentioned that ...  positive
1    A wonderful little production. <br /><br />The...  positive
2    I thought this was a wonderful way to spend ti...  positive
3    Basically there's a family where a little boy ...  negative
4    Petter Mattei's "Love in the Time of Money" is...  positive
..                                                 ...       ...
995  Nothing is sacred. Just ask Ernie Fosselius. T...  positive
996  I hated it. I hate self-aware pretentious inan...  negative
997  I usually try to be professional and construct...  negative
998  If you like me is going to see this in a film ...  negative
999  This is like a zoology textbook, given that it...  negative

[1000 rows x 2 columns]


The intents that we want the chatbot to recognize are :

In [None]:
df["intent"].unique()

# Data pre-processing

In this section, we define functions to preprocess our text (parse it using a SpaCy pipeline) and to process it (extract tokens, lemmas or embeddings depending on the application).
We save the preprocessed data to disk to avoid repeating this computationally expensive task.

In [None]:
# fine-tune preprocessing for spaCy word embeddings using this method : https://www.kaggle.com/code/christofhenkel/how-to-preprocessing-when-using-embeddings

In [14]:
# Helper functions

def lemmatize_text(text, preprocessed=True):
    return process_text(text, "lemmatize", preprocessed)

def tokenize_text(text, preprocessed=True):
    return process_text(text, "tokenize", preprocessed)

def process_text(text, mode: str, preprocessed=True):
    if not preprocessed:
        text = nlp(text)
    if mode == "tokenize":
        processed_text = [token.text for token in text] # token and embed must have the same processing + SpaCy provides embeddings for punctuation
    elif mode == "embed":
        processed_text = [token.vector for token in text] # token and embed must have the same processing
    elif mode == "lemmatize":
        processed_text = [token.lemma_ for token in text
                               if not token.is_punct and not token.is_space and not token.like_url and not token.like_email]
    else:
        raise ValueError("Mode not supported")
    return processed_text

def save_preprocessed(raw_text, save_path):
    doc_bin = DocBin(attrs=["LEMMA", "ENT_IOB", "ENT_TYPE"], store_user_data=True)
    for doc in nlp.pipe(raw_text):
        doc_bin.add(doc)
    # save DocBin to a file on disc
    doc_bin.to_disk(save_path)

In [5]:
file_name_spacy = 'preprocessed_imdb.spacy'
#save_preprocessed(raw_text=df_imdb["review"], save_path=file_name_spacy)

# Load DocBin at later time or on different system from disc or bytes object
doc_bin = DocBin().from_disk(file_name_spacy)
df_imdb["doc"] = list(doc_bin.get_docs(nlp.vocab))

In [6]:
print(df_imdb)

                                                review sentiment  \
0    One of the other reviewers has mentioned that ...  positive   
1    A wonderful little production. <br /><br />The...  positive   
2    I thought this was a wonderful way to spend ti...  positive   
3    Basically there's a family where a little boy ...  negative   
4    Petter Mattei's "Love in the Time of Money" is...  positive   
..                                                 ...       ...   
995  Nothing is sacred. Just ask Ernie Fosselius. T...  positive   
996  I hated it. I hate self-aware pretentious inan...  negative   
997  I usually try to be professional and construct...  negative   
998  If you like me is going to see this in a film ...  negative   
999  This is like a zoology textbook, given that it...  negative   

                                                   doc  
0    (One, of, the, other, reviewers, has, mentione...  
1    (A, wonderful, little, production, ., <, br, /...  
2    (I, tho

# Data preparation

In this section, we create different training datasets, processing them using SpaCy and our helper functions :

- `X_train` is a pandas Series made of all preprocessed sentence
- `X_train_embedded` pandas Series, each sentence is a list of embeddings
- `X_train_embedded_avg` panda Series, each sentence is the average of each of its words' embedding (using the sum would give embeddings of different magnitude depending of the sentence's length)
- `X_train_embedded_avg_tfidf` The previous average is weighted using TF-IDF coefficient (trained on ngrams of 1 token)

In [7]:
train, test = train_test_split(df_imdb, test_size=0.3)

X_train = train["doc"].reset_index(drop=True)
y_train = train["sentiment"].reset_index(drop=True)

X_test = test["doc"].reset_index(drop=True)
y_test = test["sentiment"].reset_index(drop=True)

X_train_embedded = train["doc"].apply(process_text, args=("embed", True,))
X_train_embedded_avg = X_train_embedded.apply(np.mean, axis=0).apply(pd.Series)

X_test_embedded = test["doc"].apply(process_text, args=("embed", True,))
X_test_embedded_avg = X_test_embedded.apply(np.mean, axis=0).apply(pd.Series)

In [10]:
# The following code block construct a sentence representation as the average of all embeddings of the words in it, weighted by their tfidf score
# This is not practical in our chatbot : using word embeddings is one way of mitigating the small dataset size, as words close in meaning should have similar embeddings
# Weighing by tf-idf score would "delete" unknown world from the vocabulary, which we do not want

vectorizer = TfidfVectorizer(ngram_range=(1, 1), lowercase=False, tokenizer=tokenize_text, max_features=10000)
X_train_tfidf = vectorizer.fit_transform(X_train) # Maybe not ? Bias, vocab for test in vect // But that would be dumb to not use the vocab for the final one // BEST : Only do vocab on X_train, but if tfidf selected train final on FULL dataset
weighted_averages = []
for (idxRow, sentence) in X_train.items():
    sum_embeddings = 0
    for idxWord, word in enumerate(sentence):
        try:
            tfidf_idx = vectorizer.vocabulary_[word]
        except(KeyError):
            continue
        sum_embeddings += (X_train_tfidf.toarray())[idxRow][tfidf_idx] * X_train_embedded.iloc[idxRow][idxWord]
    weighted_averages.append(sum_embeddings/len(sentence))

X_train_embedded_avg_tfidf = pd.Series(weighted_averages).apply(pd.Series)

X_test_tfidf = vectorizer.transform(X_test)
weighted_averages = []
for (idxRow, sentence) in X_test.items():
    sum_embeddings = 0
    for idxWord, word in enumerate(sentence):
        try:
            tfidf_idx = vectorizer.vocabulary_[word]
        except(KeyError):
            continue
        sum_embeddings += (X_test_tfidf.toarray())[idxRow][tfidf_idx] * X_test_embedded.iloc[idxRow][idxWord]
    weighted_averages.append(sum_embeddings/len(sentence))

X_test_embedded_avg_tfidf = pd.Series(weighted_averages).apply(pd.Series)

KeyboardInterrupt: 

TODO : Explain alternatives (sense2vec, Doc2vec)

# Classic ML

Our first approach to create our classifier is to use traditional ML algorithms.

We will use several algorithms and 3 different approach to represent our training data :

- A classic TF-IDF representation (with or without IDF, which is equivalent to a bag-of-words approach)
- A "sentence2vec" (or s2v) approach, where a sentence is the average of its words' embedding.
- A TF-IDF weighted average of word embeddings, s2v_tfidf

## Models preparation

We use the sklearn implementation of GridSearchCV, which optimises the parameters of an estimator (here, our classifiers) by cross-validated grid-search over a parameter grid.
We select different algorithms, define a pipeline and a set of parameters for each of those.
The use of the pipeline allows us to select the best parameters for the TF-IDF vectorization.

The size of our dataset does not contrains us in the choice of the algorithm, as training time is not a concern (No need to swap SVC for LinearSVC or SGDClassifier, for example)

GridSearchCV uses K-fold as the cross-validation method. Here, we used 5-fold stratified K-fold.

List parameters for clf and vectorizer ?

Below are defined the models and their corresponding hyperparameters to tune for the TF-IDF approach

In [14]:
vect = TfidfVectorizer(lowercase=False, tokenizer=lemmatize_text, max_features=3000)

gs_dict_tfidf = defaultdict(dict)

dectree = tree.DecisionTreeClassifier() # CART
svm_clf = svm.SVC()
multi_nb = naive_bayes.MultinomialNB() # Not suitable for negative values (thus not suitable for word embeddings)
log_reg = LogisticRegression()
random_forest = RandomForestClassifier()
skboost = GradientBoostingClassifier()

gs_dict_tfidf['dectree']['pipeline'] = Pipeline([
    ('vect', vect),
    ('dectree', dectree)])
gs_dict_tfidf['svm_clf']['pipeline'] = Pipeline([
    ('vect', vect),
    ('svm_clf', svm_clf)])
gs_dict_tfidf['multi_nb']['pipeline'] = Pipeline([
    ('vect', vect),
    ('multi_nb', multi_nb)])
gs_dict_tfidf['log_reg']['pipeline'] = Pipeline([
    ('vect', vect),
    ('log_reg', log_reg)])
gs_dict_tfidf['random_forest']['pipeline'] = Pipeline([
    ('vect', vect),
    ('random_forest', random_forest)])
gs_dict_tfidf['skboost']['pipeline'] = Pipeline([
    ('vect', vect),
    ('skboost', skboost)])

gs_dict_tfidf['dectree']['params'] = {
    "dectree__max_depth": [4, 40],
    "vect__ngram_range": ((1, 1), (1, 2), (1,3), (1,4)),
    "vect__use_idf": (True, False),
    "vect__binary": (True, False),
}
gs_dict_tfidf['svm_clf']['params'] = {
    "svm_clf__kernel": ["linear", "rbf"],
    "vect__ngram_range": ((1, 1), (1, 2), (1,3), (1,4)),
    "vect__use_idf": (True, False),
    "vect__binary": (True, False),
}
gs_dict_tfidf['multi_nb']['params'] = {
    "multi_nb__alpha": [0.00001, 0.0001, 0.001, 0.1, 1, 10, 100,1000],
    "vect__ngram_range": ((1, 1), (1, 2), (1,3), (1,4)),
    "vect__use_idf": (True, False),
    "vect__binary": (True, False),
}
gs_dict_tfidf['log_reg']['params'] = {
    "vect__ngram_range": ((1, 1), (1, 2), (1,3), (1,4)),
    "vect__use_idf": (True, False),
    "vect__binary": (True, False),
}
gs_dict_tfidf['random_forest']['params'] = {
    "vect__ngram_range": ((1, 1), (1, 2), (1,3), (1,4)),
    "vect__use_idf": (True, False),
    "vect__binary": (True, False),
}
gs_dict_tfidf['skboost']['params'] = {
    "vect__ngram_range": ((1, 1), (1, 2), (1,3), (1,4)),
    "vect__use_idf": (True, False),
    "vect__binary": (True, False),
}

Below are defined the models to be used with the two embeddings approaches

In [19]:
gs_dict_embeddings = defaultdict(dict)
# classifiers to use
dectree = tree.DecisionTreeClassifier()
svm_clf = svm.SVC()

gs_dict_embeddings['dectree']['pipeline'] = Pipeline([
    ('dectree', dectree)])
gs_dict_embeddings['svm_clf']['pipeline'] = Pipeline([
    ('svm_clf', svm_clf)])
gs_dict_embeddings['log_reg']['pipeline'] = Pipeline([
    ('log_reg', log_reg)])
gs_dict_embeddings['random_forest']['pipeline'] = Pipeline([
    ('random_forest', random_forest)])
gs_dict_embeddings['skboost']['pipeline'] = Pipeline([
    ('skboost', skboost)])

gs_dict_embeddings['dectree']['params'] = {
    "dectree__max_depth": [4, 10],
}
gs_dict_embeddings['svm_clf']['params'] = {
    "svm_clf__kernel": ["linear", "rbf"],
}
gs_dict_embeddings['log_reg']['params'] = {

}
gs_dict_embeddings['random_forest']['params'] = {

}
gs_dict_embeddings['skboost']['params'] = {

}

## Model Selection

In [16]:
# Helper functions

def perform_grid_search(X_train, y_train, pipeline, parameters, scoring):
    gs_clf = GridSearchCV(pipeline, parameters, n_jobs=1, verbose=1, cv=3, scoring=scoring) # Issue when n_jobs = -1 OR > 1
    # I believe this may be because we use a custom tokenizer in TfidfVectorizer(), can't find how to solve it
    print("\n------------------------------------------------------------------------\n")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)

    t0 = time()

    gs_clf.fit(X_train, y_train)

    print("\nDone in %0.3fs" % (time() - t0))
    print()

    print("Best score: %0.3f" % gs_clf.best_score_)
    print("Best parameters set:")
    best_parameters = gs_clf.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print(f"\t'{param_name}': '{best_parameters[param_name]}'")
    return gs_clf

def best_estimator_per_clf(X_train, y_train, gs_dict: defaultdict, scoring):
    for clf in dict(gs_dict):
        gs_dict[clf]['gs'] = perform_grid_search(
            X_train,
            y_train,
            gs_dict[clf]['pipeline'],
            gs_dict[clf]['params'],
            scoring
        )

In [17]:
best_estimator_per_clf(X_train, y_train, gs_dict_tfidf, scoring="accuracy")
best_estimator_per_clf(X_train_embedded_avg, y_train, gs_dict_embeddings, scoring="accuracy")

# TODO : Use random state in gsCV and XGBoost for reporductibility

# TODO : implement multiple scoring : https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py

# TODO : extract metrics https://scikit-learn.org/stable/auto_examples/model_selection/plot_multi_metric_evaluation.html#sphx-glr-auto-examples-model-selection-plot-multi-metric-evaluation-py


------------------------------------------------------------------------

pipeline: ['vect', 'dectree']
parameters:
{'dectree__max_depth': [4, 40],
 'vect__binary': (True, False),
 'vect__ngram_range': ((1, 1), (1, 2), (1, 3), (1, 4)),
 'vect__use_idf': (True, False)}
Fitting 3 folds for each of 32 candidates, totalling 96 fits

Done in 48.211s

Best score: 0.691
Best parameters set:
	'dectree__max_depth': '40'
	'vect__binary': 'False'
	'vect__ngram_range': '(1, 1)'
	'vect__use_idf': 'True'

------------------------------------------------------------------------

pipeline: ['vect', 'svm_clf']
parameters:
{'svm_clf__kernel': ['linear', 'rbf'],
 'vect__binary': (True, False),
 'vect__ngram_range': ((1, 1), (1, 2), (1, 3), (1, 4)),
 'vect__use_idf': (True, False)}
Fitting 3 folds for each of 32 candidates, totalling 96 fits

Done in 75.772s

Best score: 0.829
Best parameters set:
	'svm_clf__kernel': 'linear'
	'vect__binary': 'False'
	'vect__ngram_range': '(1, 4)'
	'vect__use_idf': 'True

ValueError: 
All the 3 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/sklearn/pipeline.py", line 378, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/sklearn/pipeline.py", line 336, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/joblib/memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/sklearn/pipeline.py", line 870, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 2079, in fit_transform
    X = super().fit_transform(raw_documents)
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1338, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary_)
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1209, in _count_vocab
    for feature in analyze(doc):
  File "/home/matthieu/miniconda3/envs/chatbot-sdia/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 113, in _analyze
    doc = tokenizer(doc)
  File "/tmp/ipykernel_17605/2133088996.py", line 4, in lemmatize_text
    return process_text(text, "lemmatize", preprocessed)
  File "/tmp/ipykernel_17605/2133088996.py", line 17, in process_text
    processed_text = [token.lemma_ for token in text
TypeError: 'int' object is not iterable


In [21]:
best_estimator_per_clf(X_train_embedded_avg, y_train, gs_dict_embeddings, scoring="accuracy")


------------------------------------------------------------------------

pipeline: ['dectree']
parameters:
{'dectree__max_depth': [4, 10]}
Fitting 3 folds for each of 2 candidates, totalling 6 fits

Done in 0.412s

Best score: 0.671
Best parameters set:
	'dectree__max_depth': '4'

------------------------------------------------------------------------

pipeline: ['svm_clf']
parameters:
{'svm_clf__kernel': ['linear', 'rbf']}
Fitting 3 folds for each of 2 candidates, totalling 6 fits

Done in 0.225s

Best score: 0.829
Best parameters set:
	'svm_clf__kernel': 'linear'

------------------------------------------------------------------------

pipeline: ['log_reg']
parameters:
{}
Fitting 3 folds for each of 1 candidates, totalling 3 fits

Done in 0.064s

Best score: 0.816
Best parameters set:

------------------------------------------------------------------------

pipeline: ['random_forest']
parameters:
{}
Fitting 3 folds for each of 1 candidates, totalling 3 fits

Done in 1.477s

Best

In [None]:
test = sum(process_text(nlp("I want to print 76 page of a document"), mode="embed"))

In [None]:
model = gs_dict_embeddings['svm_clf']['gs'].best_estimator_
model.predict([test])

# Neural networks

## Model preparation

## Model selection

idea : skew text classification if name entities are found (either by multiple channels NN or by adding a feature to the data passed)

In [None]:
# LSTM : https://www.tensorflow.org/text/tutorials/text_classification_rnn
len(nlp.vocab.vectors.keys())

In [None]:
import tensorflow as tf

max_words = 30 # Max number of words in a sentence

raw_inputs = X_train_embedded
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(
    X_train_embedded,
    maxlen=max_words,
    padding="pre",
    truncating="pre",
    dtype="float32",
)

In [None]:
padded_inputs.shape

In [None]:
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Input, Masking
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y_encoded = le.fit_transform(y_train)
number_classes = len(y_train.unique())

model=Sequential()
#model.add(Embedding(vocab_size,300,input_length=max_words))
model.add(Masking(mask_value=0, input_shape=(None, 300)))
model.add(LSTM(units=128,
               return_sequences=False,
               input_shape=(None, 300)
               ))
model.add(Dense(number_classes, activation='softmax'))

print(model.summary())
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(tf.convert_to_tensor(padded_inputs), y_encoded, epochs=20)

In [None]:
test = process_text("tell me your name", mode="embed", preprocessed=False)
predict = model.predict(np.asarray([test]))
predicted_class = np.argmax(predict)
predicted_class = le.inverse_transform([predicted_class])
predicted_class

In [None]:
# Makes no sense to train LSTM / CNN on Tf-Idf : They preserve spatial / temporal information, but that information is lost with tfidf
# Does not play to their strength, not more relevant than a classic classifier

In [None]:
# Helper function

import matplotlib.pyplot as plt

def plot_graphs(history, metric):
  plt.plot(history.history[metric])
  plt.plot(history.history['val_'+metric], '')
  plt.xlabel("Epochs")
  plt.ylabel(metric)
  plt.legend([metric, 'val_'+metric])

plot_graphs(history, "accuracy")

In [None]:


def model_to_optimize(num_filters, kernel_size):
    model = Sequential([
    tf.keras.layers.Conv1D(num_filters, kernel_size, activation='relu'),
    tf.keras.layers.GlobalMaxPooling1D(),
    Dense(10, activation='relu'),
    Dense(number_classes, activation='softmax')])
    model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    return model

params = {
    "num_filters":[32, 64, 128],
    "kernel_size":[3, 5, 7],
}

model = tf.keras.wrappers.scikit_learn.KerasClassifier(build_fn=model_to_optimize,
                            epochs=20,
                           batch_size=10,
                            verbose=False)

from sklearn.model_selection import GridSearchCV
search = GridSearchCV(estimator=model, param_grid=params,
                              cv=2, verbose=1)
search_result = search.fit(padded_inputs, y_encoded)

In [None]:
search_result.best_params_
pd.DataFrame(search.cv_results_)

In [None]:
from sklearn import model_selection
import tensorflow as tf
from keras.layers import TextVectorization
from keras.layers import Embedding
from keras import layers

max_words = 30 # Max number of words in a sentence

raw_inputs = X_train_embedded
padded_inputs = tf.keras.preprocessing.sequence.pad_sequences(
    X_train_embedded,
    maxlen=max_words,
    padding="pre",
    truncating="pre",
    dtype="float32",
)

# Affichage des scores moyens par pli
print('---------------------------------------------------------------------')
print('Scores par pli')
for i in range(0, len(acc_per_fold)):
  print('---------------------------------------------------------------------')
  print(f'> Pli {i+1} - Loss: {loss_per_fold[i]:.2f}',
        f'- Accuracy: {acc_per_fold[i]:.2f}%')
print('---------------------------------------------------------------------')
print('Scores moyens pour tous les plis :')
print(f'> Accuracy: {np.mean(acc_per_fold):.2f}',
      f'(+- {np.std(acc_per_fold):.2f})')
print(f'> Loss: {np.mean(loss_per_fold):.2f}')
print('---------------------------------------------------------------------')

In [None]:
accuracy_data = []
loss_data = []
for i, h in enumerate(histories):
  acc = h.history['acc']
  val_acc = h.history['val_acc']
  loss = h.history['loss']
  val_loss = h.history['val_loss']
  for j in range(len(acc)):
    accuracy_data.append([i+1, j+1, acc[j], 'Entraînement'])
    accuracy_data.append([i+1, j+1, val_acc[j], 'Validation'])
    loss_data.append([i+1, j+1, loss[j], 'Entraînement'])
    loss_data.append([i+1, j+1, val_loss[j], 'Validation'])

acc_df = pd.DataFrame(accuracy_data,
                      columns=['Pli', 'Epoch', 'Accuracy', 'Données'])
sns.relplot(data=acc_df, x='Epoch', y='Accuracy', hue='Pli', style='Données',
            kind='line')

loss_df = pd.DataFrame(loss_data, columns=['Pli', 'Epoch', 'Perte', 'Données'])
sns.relplot(data=loss_df, x='Epoch', y='Perte', hue='Pli', style='Données',
            kind='line')

In [None]:
# Use party one to implement a CNN

In [None]:
# Tes with matrix embedding based solely on my vocab or on the whole spacy vocab

TODO : Once data has been generated, apply vizualization techniques found in partie 1 to it !