# CamamBERT model

> CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR.
> https://camembert-model.fr

### Importing and functions
This cell imports and defines all the important libraries and functions for the Camambert model.

In [None]:
import pandas as pd
from transformers import AutoTokenizer
from datasets import load_dataset
import seaborn as sns
import torch  # GPU optim. + gradient opt.
from torch.utils.data import DataLoader
import functools
from LightningModel import LightningModel
import pytorch_lightning as pl
from sklearn.metrics import confusion_matrix
from matplotlib import pyplot as plt
from sklearn.metrics import classification_report

# Takes a batch and process the text.
def tokenize_batch(samples, tokenizer):
    text = [sample["sentence"] for sample in samples]
    labels = torch.tensor([sample["label"] for sample in samples])
    str_labels = [sample["difficulty"] for sample in samples]
    # The tokenizer handles
    # - Tokenization (amazing right?)
    # - Padding (adding empty tokens so that each example has the same length)
    # - Truncation (cutting samples that are too long)
    # - Special tokens (in CamemBERT, each sentence ends with a special token </s>)
    # - Attention mask (a binary vector which tells the model which tokens to look at. For instance it will not compute anything if the token is a padding token)
    tokens = tokenizer(text, padding="longest", return_tensors="pt")

    return {"input_ids": tokens.input_ids, "attention_mask": tokens.attention_mask, "labels": labels, "str_labels": str_labels, "sentences": text}

# Once the model is trained, this method will return the confusion matrix.
def plot_confusion_matrix(labels, preds, label_names):
    confusion_norm = confusion_matrix(labels, preds.tolist(), labels=list(range(len(label_names))), normalize="true")
    confusion = confusion_matrix(labels, preds.tolist(), labels=list(range(len(label_names))))

    plt.figure(figsize=(16, 14))
    sns.heatmap(
        confusion_norm,
        annot=confusion,
        cbar=False,
        fmt="d",
        xticklabels=label_names,
        yticklabels=label_names,
        cmap="viridis"
    )

### Tokenize and import data
**STEP 1 -  Create the tokenizer for the data**
The AutoTokenizer breaks the text into words and phrases. It also clean the data and preprocess it as our CamamBERT model need.

**STEP 2 - load the data**
We need a total of  3 sets:
1. a train dataset for training the model.
2. a validation dataset for fine tune the result across epochs.
3. a testing set, to evaluate our model once all epochs are completed.

In [None]:
#Breaking up + cleaning + processing the text
tokenizer = AutoTokenizer.from_pretrained('camembert-base')

# Loading and setting the datasets
dataset = load_dataset('Makxxx/french_CEFR') # stocked in huggingface in a form of a dictionary. It contains all 3 datasets.

pd_dataset = {split_name: split_data.to_pandas() for split_name, split_data in dataset.items()} # Setting up the validation set

train_dataset, test_dataset, val_dataset = dataset.values() #taking the values from the dataset (dictionary) and attributing them to new variables.

num_labels = len(pd_dataset["train"]["label"].unique()) # Saving the number of classes from the dataset

### Visualize data

You can find here plots and prints to better understand the data.

In [None]:
# This plot shows the labels and their frequencies

sns.set_theme()

nb_labels = len(pd_dataset["train"]["label"].unique())
print(f"Le dataset comprend {nb_labels} labels.")

ax = pd_dataset["train"]["label"].hist(density=True, bins=nb_labels)
ax.set_xlabel("Label ID")
ax.set_ylabel("Fréquence")
ax.set_title("Répartition des labels dans le dataset (train split)")
ax.figure.show()

In [None]:
# This graph shows the length of senteces (number of characters).

pd_dataset["train"]["len_sen"] = pd_dataset["train"]["sentence"].apply(lambda x: len(x))
ax = pd_dataset["train"]["len_sen"].hist(density=True, bins=50)
ax.set_xlabel("Longueur")
ax.set_ylabel("Fréquence")
ax.set_title("Nombre de caractères par phrase")
ax.figure.show()

In [None]:
# Some addItionnal information

print("Max lenght of a sentence: ", pd_dataset["train"]["len_sen"].max())
print("Number of rows in the training set: ", train_dataset.shape[0])
print("Number of rows in the testing set: ", test_dataset.shape[0])
print("Number of rows in the validation set: ", val_dataset.shape[0])

'''
ADD IN DOCUMENTATIN HOW WE ADDED MORE DATA TO IMPROVE MODEL -----------------!!!!!!!!!!!!!!!!!!!!!!!!!!!!
'''

### Load the DATA in the batches and define parameters.
the Dataloader has 4 parameters:
- The dataset.
- Batches are packs of data we inject while training the model to decrease the load on the processor and GPU. The bigger the batch size is, the faster the training is supposed to go. In case the batch size is big, the hardware must have enough memory to load the data.
- shuffle: move data around to prevent the model to remember the exact dataset to prevent the data to be too specific to the dataset rendering it useless for other ones.
- collate_fn: tells how to put together the data into the batch.

In [None]:
# link the dataset to the different models. We set up the batch and random parameters.
train_dataloader = DataLoader(
    dataset["train"],
    batch_size=16,
    shuffle=True,
    collate_fn=functools.partial(tokenize_batch, tokenizer=tokenizer) #uses the function and tokenizer declared above.
)
val_dataloader = DataLoader(
    dataset["validation"],
    batch_size=16,
    shuffle=False,
    collate_fn=functools.partial(tokenize_batch, tokenizer=tokenizer)
)

### Training the model
Creation of LightingModel instance imported from lightning_model file. This is the model that will be used to train the dataset. They have as attributes:
- Model Name (camembert-base).
- Number of labels (num_labels).
- lr as the step the gradient takes every epoch to optimize the solution.
- weight decay which increases generalization. The gradient would depend on it which will decrease the dependence on the training set.

Other parameters:
- max epochs: how many iterations to train the model.
- gpus: number of GPUs running the model.
- callbacks: in case the gradient is not optimizing the result enough, the callbacks would define when to stop it to return the result.


In [None]:
lightning_model = LightningModel("camembert-base", num_labels, lr=3e-5, weight_decay=2)

model_checkpoint = pl.callbacks.ModelCheckpoint(monitor="valid/acc", mode="max")

camembert_trainer = pl.Trainer(
    max_epochs=25, #how many times iteration on dataset.
    gpus=1,
    callbacks=[
        pl.callbacks.EarlyStopping(monitor="valid/acc", patience=4, mode="max"),
        model_checkpoint,
    ]
)

# fit the model
camembert_trainer.fit(lightning_model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader)

# recover best model we found, usually between 5-10.
lightning_model = LightningModel.load_from_checkpoint(checkpoint_path=model_checkpoint.best_model_path)


### Analysing training results.
For now, we only compared the training set to the evaluation one. In this section, we continue to compared them together. We will look at:
- The Confusion matrix
- Examples of wrong classifications.

First, we ecode the classification classes. This would be useful once we want to look at which sentence is wrongly classified.
We also add the classes names to a list to use them easily.

Secondly, we create the prediction variable.

In [None]:
ID_TO_LABEL = dict(zip(range(6), ('A1', 'A2', 'B1', 'B2', 'C1', 'C2',)))
label_names = list(ID_TO_LABEL.values())

camembert_preds = camembert_trainer.predict(lightning_model, dataloaders=val_dataloader)
camembert_preds = torch.cat(camembert_preds, -1)

Creating the Confusion Matrix.

In [None]:
plot_confusion_matrix(dataset["validation"]["label"], camembert_preds, label_names)

Looking at the precision, recall, f1-score for each class and for the model.

In [None]:
# classification method is defined above to provide the report of relevant metrics.
print(classification_report(dataset["validation"]["label"], camembert_preds, target_names=label_names))

Printing some examples of wrongly classified sentenses.

In [None]:
wrong_preds = camembert_preds.numpy() != np.array(dataset["validation"]["label"])
wrong = dataset["validation"].to_pandas()[['sentence', 'difficulty']][wrong_preds]

preds = pd.Series(camembert_preds.numpy())[wrong_preds].apply(lambda x: ID_TO_LABEL[x])
wrong["preds"] = preds
wrong.columns = ["sentence", "true", "predicted"]
wrong

### Generating final dataframe for submission.

Using the same logic as above, we load the test dataset and predict the outcome taking the trained model.

In [None]:
#Loading the dataset
test_dataloader = DataLoader(
    dataset["test"],
    batch_size=16,
    shuffle=False,
    collate_fn=functools.partial(tokenize_batch, tokenizer=tokenizer)
)

#Predicts and format using trained model.
preds = camembert_trainer.predict(lightning_model, dataloaders=test_dataloader)
preds = torch.cat(preds, -1)

# format the data for submission
test_df = dataset["test"].to_pandas()
test_df.label = preds.numpy()
test_df.difficulty = test_df.label.apply(lambda x: label_names[x])
test_df.index.name = 'id'
test_df.drop(columns=["sentence", "label"], inplace=True)

#Generate the csv file
test_df.to_csv('preds.csv')