<a href="https://colab.research.google.com/github/ValentinCord/HandsOnAI_2/blob/main/Transformer_Training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <span> NLP : Entrainnement et sauvegarde du modèle </span>
<hr style="border-bottom: solid;background-color:light;color:black;">

* [Installations](#section-1)
* [Imports](#section-2)
* [Choix des paramètres](#section-3)
* [Lecture des données](#section-4)
* [Preprocessing](#section-5)
* [Création du modèle](#section-6)
* [Entrainement du modèle](#section-7)
* [Prédiction des données](#section-8)
* [Sauvegarde du modèle](#section-9)
* [Test du modèle](#section-10)

<a name="section-1"></a>
# <span>1. Installation des packages</span>
<hr style="border-bottom: solid;background-color:light;color:black;">

In [None]:
!/opt/bin/nvidia-smi
!rm -rf sample_data

!pip3 install transformers
!pip3 install datasets
!pip install sentencepiece

Wed Dec 28 10:51:33 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    31W /  70W |   6898MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

<a name="section-2"></a>
# <span>2. Imports </span>
<hr style="border-bottom: solid;background-color:light;color:black;">

In [None]:
# basics 
import os
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn import metrics

# transformers 
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer
from transformers import CamembertModel, CamembertTokenizer
from transformers import DataCollatorWithPadding
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# plot 
import matplotlib.pyplot as plt 
import seaborn as sns 

# torch 
import torch
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn.functional as F

# nltk 
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


<a name="section-3"></a>
# <span>3. Choix des paramètres</span>
<hr style="border-bottom: solid;background-color:light;color:black;">

<p align="justify">Dans cette section, nous pouvons paramétrer les données d'entrée et les données de notre modèle. Pour les paramètres du modèle, nous pouvons définir : le batch size, le nombre d'epochs ainsi que le learning rate.</p>

<p align="justify">Pour la préparation des données, il faut savoir que les Transformers acceptent uniquement les données d'une taille bien. définie. Dans le cas de notre Transformer, celle-ci est de 512. Il est possible que nous ayons des News de trop grandes taille. Au lieu de tronquer les données et risquer de perdre de l'information, nous avons décidé de splitter les News. Il nous est donc possible définir la taille maximum des morceaux de News et de leur overlap.</p>

In [None]:
MAX_LEN = 512
TRAIN_BATCH_SIZE = 10
VALID_BATCH_SIZE = 10
EPOCHS =5
LEARNING_RATE = 1e-05

LEN_TEXT = 150
OVERLAP = 50

DONNEE_AJOUTEES = 500

TRANSFORMER_NAME = "cmarkea/distilcamembert-base"
TRAIN_SIZE = 0.8

<a id="section-4"></a>
# <span>4. Lecture des données</span>
<hr style="border-bottom: solid;background-color:light;color:black;">

In [None]:
train_path = '/content/drive/MyDrive/HandOnAI_2_NLP/fake_train.csv'
added_path = '/content/drive/MyDrive/HandOnAI_2_NLP/added_train.csv'
test_path = '/content/drive/MyDrive/HandOnAI_2_NLP/fake_test.csv'

df = pd.read_csv(train_path)
df_added = pd.read_csv(added_path)
df_test = pd.read_csv(test_path)

# suppression des colonnes inutiles 
df = df.drop(['Unnamed: 0', 'target_name'], axis = 1)
df_added.rename(columns = {'french':'data'}, inplace = True)
df_added = df_added.drop(['Unnamed: 0'], axis = 1)
df_test = df_test.drop(['Unnamed: 0', 'target_name'], axis = 1)

df = df.append(df_added[:DONNEE_AJOUTEES], ignore_index=True)

<a id="section-5"></a>
# <span>5. Preprocessing</span>
<hr style="border-bottom: solid;background-color:light;color:black;">
<p align="justify">La partie preprocessing peut être scindée en plusieurs étapes. En premier, nous allons appliquer un nettoyage de données. Suite à ce nettoyage, les données seront converties dans le format adéquat. </p>



## <span>5.1 Nettoyage de données</span>
<hr style="border-bottom: solid;background-color:light;color:black;">

1.   Suppression des stopwords : les stopwords sont des mots qui n'apportent pas d'informations supplémentaires au texte. Afin de réduire la taille des news, on pourrait effectuer la suppression de ceux-ci.
2.   Suppression des caractères spéciaux : en regardant plusieurs news, nous avons constaté qu'il y avait plusieurs caractères spéciaux. N'apportant aucune information supplémentaire, nous pouvons les supprimer. 

In [None]:
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('french'))

def clean_text(text):
    text = text.lower() # lowercase text
    text = REPLACE_BY_SPACE_RE.sub(' ', text)
    #text = BAD_SYMBOLS_RE.sub('', text)
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwors from text
    return text

In [None]:
df['data'] = df['data'].apply(clean_text)
df_test['data'] = df_test['data'].apply(clean_text)

df = df.drop(df.index[1430])
df = df.drop(df.index[1429])
df = df.drop(df.index[1180])
df = df.drop(df.index[1136])
df = df.reset_index()

## <span>5.2 Découpage des données</span>
<hr style="border-bottom: solid;background-color:light;color:black;">

<p align="justify">Comme expliqué dans le choix des paramètres, nous allons ici couper les News en plusieurs morceaux à l'aide de la fonction get_split. La fonction est ensuite appliquée aussi bien sur le jeu de données d'entrainement et de test.</p>

In [None]:
def get_split(text1):
    l_total = []
    l_parcial = []
    if len(text1.split())//(LEN_TEXT - OVERLAP) >0:
        n = len(text1.split())//(LEN_TEXT - OVERLAP)
    else: 
        n = 1
    for w in range(n):
        if w == 0:
            l_parcial = text1.split()[:LEN_TEXT]
            l_total.append(" ".join(l_parcial))
        else:
            l_parcial = text1.split()[w*(LEN_TEXT - OVERLAP):w*(LEN_TEXT - OVERLAP) + LEN_TEXT]
            l_total.append(" ".join(l_parcial))
    return l_total

In [None]:
df['text_split'] = df['data'].apply(get_split)
df['len_split'] = df['text_split'].apply(lambda x: len(x))

df_test['text_split'] = df_test['data'].apply(get_split)
df_test['len_split'] = df_test['text_split'].apply(lambda x: len(x))

In [None]:
for index, row in df.iterrows():
  if len(row['text_split']) > 1: 
    print(index)
    break

print(df['text_split'][31][0].split()[(LEN_TEXT - OVERLAP):])
print(df['text_split'][31][1].split()[:OVERLAP])

1
['walter', 'luis', 'alvarez', 'avancé', "l'", 'extinction', 'massive', 'fin', 'crétacé', 'spectaculairement', 'manifestée', 'disparition', 'dinosaures', 'provenait', 'chute', "d'un", 'astéroïde', '.', 'selon', 'cette', 'théorie', "l'impact", "d'un", 'petit', 'corps', 'céleste', "d'environ", '10', 'km', 'diamètre', 'perturbé', 'biosphère', 'multiples', 'façons', 'notamment', 'éjectant', 'tellement', 'matériaux', "l'", 'atmosphère', "l'ensoleillement", 'chuté', 'considérablement', 'provoquant', 'mort', 'végétaux', 'nombreuses', 'espèces', 'animales', 'entraînés']
['walter', 'luis', 'alvarez', 'avancé', "l'", 'extinction', 'massive', 'fin', 'crétacé', 'spectaculairement', 'manifestée', 'disparition', 'dinosaures', 'provenait', 'chute', "d'un", 'astéroïde', '.', 'selon', 'cette', 'théorie', "l'impact", "d'un", 'petit', 'corps', 'céleste', "d'environ", '10', 'km', 'diamètre', 'perturbé', 'biosphère', 'multiples', 'façons', 'notamment', 'éjectant', 'tellement', 'matériaux', "l'", 'atmosphè

## <span>5.3 Reformulation du labels</span>
<hr style="border-bottom: solid;background-color:light;color:black;">
<p align="justify"> Dans cette partie, le label est reformulé sous la forme OneHot.</p>

In [None]:
def create_df(df): 
  train_l = []
  label_l = []
  for idx,row in df.iterrows():
      for l in row['text_split']:
          train_l.append(l)
          label_l.append([1 if row['label'] == i else 0 for i in range(2)])

  return pd.DataFrame({'data':train_l, 'label':label_l})

In [None]:
cleaned_df = create_df(df)
cleaned_df_test = create_df(df_test)

## <span>5.4 Création du dataset</span>
<hr style="border-bottom: solid;background-color:light;color:black;">
<p align="justify">La création de la classe CustomDataset permet de reformuler les données dans le format souhaité. Comme nous utilisons un Transformer, nous devons :</p>

1.   Utiliser un tokenizer
2.   Récupérer le "token id" et le "attention mask"

In [None]:
class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len, is_target = True):
        self.tokenizer = tokenizer
        self.df = dataframe
        self.text = dataframe.data
        self.max_len = max_len
        if is_target: 
          self.targets = self.df.label
        else: 
          self.targets = None

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        if self.targets is None: 
          return {
              'ids': torch.tensor(ids, dtype=torch.long),
              'mask': torch.tensor(mask, dtype=torch.long),
              'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long)
          }
        else: 
          return {
              'ids': torch.tensor(ids, dtype=torch.long),
              'mask': torch.tensor(mask, dtype=torch.long),
              'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
              'targets': torch.tensor(self.targets[index], dtype=torch.float)
          }

In [None]:
tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_NAME)

train_dataset = cleaned_df.sample(frac=TRAIN_SIZE,random_state=200)
test_dataset = cleaned_df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

print("FULL Dataset: {}".format(cleaned_df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = CustomDataset(cleaned_df, tokenizer, MAX_LEN)
testing_set = CustomDataset(cleaned_df_test, tokenizer, MAX_LEN)

FULL Dataset: (4685, 2)
TRAIN Dataset: (3748, 2)
TEST Dataset: (937, 2)


## <span>5.5 Création du dataloader</span>
<hr style="border-bottom: solid;background-color:light;color:black;">

In [None]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': False,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id="section-6"></a>
# <span>6. Création du modèle</span>
<hr style="border-bottom: solid;background-color:light;color:black;">
<p align="justify">Afin de pouvoir configurer au maximum le modèle, nous avons voulu créer une classe pour le modèle. Dans cette classe, nous pouvons définir les différentes couches. Une fois le modèle définit, nous pouvons définir la fonction loss ainsi que la fonction d'entrainement.</p>

In [None]:
class BERTClass(torch.nn.Module):
    def __init__(self):
      super(BERTClass, self).__init__()
      self.l1 = CamembertModel.from_pretrained(TRANSFORMER_NAME)
      self.l3 = torch.nn.Linear(768, 2) #2 = binary classification
    
    def forward(self, ids, mask, token_type_ids):
      output_1= self.l1(ids, attention_mask = mask, token_type_ids = token_type_ids)
      output = self.l3(output_1['pooler_output'])

      return F.softmax(output, dim=1)

model = BERTClass()
model.to(device)

Some weights of the model checkpoint at cmarkea/distilcamembert-base were not used when initializing CamembertModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.bias', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertModel were not initialized from the model checkpoint at cmarkea/distilcamembert-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream

BERTClass(
  (l1): CamembertModel(
    (embeddings): CamembertEmbeddings(
      (word_embeddings): Embedding(32005, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): CamembertEncoder(
      (layer): ModuleList(
        (0): CamembertLayer(
          (attention): CamembertAttention(
            (self): CamembertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): CamembertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNor

In [None]:
def loss_fn(outputs, targets):
    return torch.nn.CrossEntropyLoss()(outputs, targets)

optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [None]:
def train(epoch):
    model.train()
    for _,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device)
        mask = data['mask'].to(device)
        token_type_ids = data['token_type_ids'].to(device)
        targets = data['targets'].to(device)
        outputs = model(ids, mask, token_type_ids)

        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _%10==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

<a id="section-7"></a>
# <span>7. Entrainement du modèle</span>
<hr style="border-bottom: solid;background-color:light;color:black;">

In [None]:
for epoch in range(EPOCHS):
    train(epoch)



Epoch: 0, Loss:  0.6840534210205078
Epoch: 0, Loss:  0.6824912428855896
Epoch: 0, Loss:  0.6125276684761047
Epoch: 0, Loss:  0.6017783880233765
Epoch: 0, Loss:  0.6104455590248108
Epoch: 0, Loss:  0.5176016688346863
Epoch: 0, Loss:  0.6925471425056458
Epoch: 0, Loss:  0.600463330745697
Epoch: 0, Loss:  0.6949084401130676
Epoch: 0, Loss:  0.4859329164028168
Epoch: 0, Loss:  0.5126347541809082
Epoch: 0, Loss:  0.45529890060424805
Epoch: 0, Loss:  0.5158266425132751
Epoch: 0, Loss:  0.42135557532310486
Epoch: 0, Loss:  0.39204248785972595
Epoch: 0, Loss:  0.516261875629425
Epoch: 0, Loss:  0.3835049271583557
Epoch: 0, Loss:  0.4217202663421631
Epoch: 0, Loss:  0.47700071334838867
Epoch: 0, Loss:  0.6411043405532837
Epoch: 0, Loss:  0.6244808435440063
Epoch: 0, Loss:  0.3397986888885498
Epoch: 0, Loss:  0.4502750039100647
Epoch: 0, Loss:  0.6843103766441345
Epoch: 0, Loss:  0.46425411105155945
Epoch: 0, Loss:  0.4534526765346527
Epoch: 0, Loss:  0.31999117136001587
Epoch: 0, Loss:  0.64429

<a id="section-8"></a>
# <span>8. Prédiction des données</span>
<hr style="border-bottom: solid;background-color:light;color:black;">
<p align="justify">Maintenant que les données ont été entrainées, nous pouvons prédire des données de test. Comme prédiction, nous allons dans un premier temps faire des prédictions sur toutes les données de test. Une fois cette prédiction faite, nous pouvons combiner ces prédictions afin de déterminer le type de News. Comme les vraies News ont tendance à être plus longue, si nous avons autant de fake et real News prédiction nous dirons que la News est Vraie. </p>

In [None]:
def validation():
    ce_loss = torch.nn.CrossEntropyLoss()
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    correct_predictions = 0
    total_instances = 0
    total_loss = 0

    with torch.no_grad():
      for count, data in enumerate(testing_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.float)
        outputs = model(ids, mask, token_type_ids)
        fin_targets.extend(targets.cpu().detach().numpy().tolist())

        # accuracy 
        classifications = torch.argmax(outputs, dim=1)
        labels = torch.argmax(targets, dim=1)
        correct_predictions += sum(classifications==labels).item()
        total_instances += len(outputs)

        # loss 
        total_loss += ce_loss(outputs, labels)

        fin_outputs.extend(outputs.cpu().detach().numpy().tolist())

    accuracy = correct_predictions/total_instances
    loss = total_loss/total_instances

    print(f"Accuracy Score = {accuracy}")
    print(f"Loss Score = {loss}")

    return fin_outputs, fin_targets

In [None]:
outputs, targets = validation()

Accuracy Score = 0.96
Loss Score = 0.03531963378190994


In [None]:
cleaned_df_test['pred'] = outputs
cleaned_df_test.head()

Unnamed: 0,data,label,pred
0,président groupe lrem a pris toutes pincettes....,"[1, 0]","[0.9999217987060547, 7.823276246199384e-05]"
1,négociateur». patron groupe majoritaire connaî...,"[1, 0]","[0.9999257326126099, 7.42080228519626e-05]"
2,titre personnel» souhaite gouvernement «mette ...,"[1, 0]","[0.9999220371246338, 7.797416765242815e-05]"
3,tir. «c’est micro-aile gauche fait mousse» pré...,"[1, 0]","[0.9998860359191895, 0.00011391906446078792]"
4,villes françaises qualité l'air meilleure moin...,"[1, 0]","[0.9999215602874756, 7.837941666366532e-05]"


In [None]:
pos = 0
df_test['pred'] = [list() for x in range(len(df_test.index))]
for idx,row in df_test.iterrows():
  for i in range(row['len_split']): 
    row['pred'].append(cleaned_df_test.loc[pos]['pred'])
    pos += 1

In [None]:
df_test['prediction'] = df_test['pred'].apply(lambda x: [1, 0] if np.argmax(np.sum(x, axis = 0)) == 0 else [0, 1])
df_test['label_pred'] = df_test['pred'].apply(lambda x: np.argmax(np.sum(x, axis = 0)))

In [None]:
accuracy = metrics.accuracy_score(df_test['label_pred'], df_test['label'])
print(f"Accuracy Score = {accuracy}")

Accuracy Score = 0.948559670781893


<a id="section-9"></a>
# <span>9. Sauvegarde du modèle</span>
<hr style="border-bottom: solid;background-color:light;color:black;">

In [None]:
checkpoint = {'model': BERTClass(),
              'state_dict': model.state_dict(),
              'optimizer' : optimizer.state_dict()}

torch.save(checkpoint, '/content/drive/MyDrive/HandOnAI_2_NLP/transformer_model.pth')

Some weights of the model checkpoint at cmarkea/distilcamembert-base were not used when initializing CamembertModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.decoder.bias', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing CamembertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing CamembertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of CamembertModel were not initialized from the model checkpoint at cmarkea/distilcamembert-base and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream