# Bidirectional Encoder Representations from Transformers (BERT)

## 1. Introduction

Le modèle BERT (Bidirectional Encoder Representations from Transformers) a été concu par Google AI en 2018. Il s'agit d'un modèle non supervisé. Son objecif est de fournir une représention vectorielle du langage. BERT est une architecture de réseaux de neurones basé sur les [transformers](python-nlp-transformers.ipynb) bidirectionnels. Ces derniers utilisent un [mécanisme d'attention](python-nlp-mecanisme-attention.ipynb) pour construire des embeddings contextualisés.
<img src="images/bert/bert_model.png" >

Dans le papier original, ils ont présenté deux modèles :
* Base : modèle de base de BERT
 * 12 couches
 * 768 dimensions cachées  
 * 12 têtes d’attention
 * 110M paramètres
* Large : 
 * 24 couches 
 * 1024 dimensions cachées
 * 16 têtes d’attention
 * 340M paramètres

<img src="images/bert/bert_layers.png" >

## 2. Apprentissage de BERT

BERT a été entrainé de façon non supervisée sur deux tâches :

* **Masked Language Model (MLM)** : BERT utilise le MLM pour apprendre une représentation bidirectionnelle d'une séquence. En effet le MLM apprend à prédire des mots masqués dans une phrase. Le modèle masque aléatoirement 15% des tokens de la séquence en entrée, puis utilise les tokens qui restent pour prédire les tokens masqués.

* **Next Sentence Prediction (NSP)** : BERT se base sur le NSP pour apprendre la relation entres les séquences. Ce qui lui permet de savoir si deux séquences ont un lien logique et séquentiel ou si leur relation est simplement aléatoire.


## 3. Comment utiliser BERT

Les modèles pré-entrainés de BERT peuvent être exploités avec les techniques du Transfert Learning :

* Fine Tuning : ajouter une couche prédictive au dessus du modèle BERT. Le modèle sera ré-entraîné plus finement pour effectuer une nouvelle tâche d'apprentissage supervisé.
* Feature Based : extraire des caractéristiques. Le modèle fournit une représentation vectorielle des inputs.

BERT est peut-être utilisé pour répondre à plusieurs problématiques de NLP :

* Génération de texte
* Classification
* Question-réponse
* Reconnaissance d’entités nommées
* Traduction


# 4. Quelques adaptations de BERT

Plusieurs modèles ont été construits autour BERT :

* roBERTa : version optimisée de BERT proposé par Facebook
* FlauBERT :  un modèle BERT pré-entraîné sur un vocabulaire en français développé par les universités de Grenoble, de Paris Diderot et le CNRS.
* CamemBERT : un modèle roBERTa pré-entraîné sur un vocabulaire en français proposé par l'Inria et Facebook

**Références :**

[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)   
[Blog Post by Google AI](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)     
[Colab Notebook: Predicting Movie Review Sentiment with BERT on TF Hub](https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=xiYrZKaHwV81)   

## Application : Analyse de sentiments

Ce travail s'inpire du post d'Aniruddha Choudhury sur [Medium](https://medium.com/@aniruddha.choudhury94/part-2-bert-fine-tuning-tutorial-with-pytorch-for-text-classification-on-the-corpus-of-linguistic-18057ce330e1) et du [dépôt Git](https://github.com/google-research/bert) de Google Research


In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import tensorflow as tf
import warnings
from datetime import datetime

warnings.filterwarnings('ignore')

In [None]:
from tensorflow import keras
import os
import re

# Load all files from a directory in a DataFrame.
def load_directory_data(directory):
    data = {}
    data["sentence"] = []
    data["sentiment"] = []
    for file_path in os.listdir(directory):
        with tf.io.gfile.GFile(os.path.join(directory, file_path), "r") as f:
            data["sentence"].append(f.read())
            data["sentiment"].append(re.match("\d+_(\d+)\.txt", file_path).group(1))
    return pd.DataFrame.from_dict(data)

# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset(directory):
    pos_df = load_directory_data(os.path.join(directory, "pos"))
    neg_df = load_directory_data(os.path.join(directory, "neg"))
    pos_df["polarity"] = 1
    neg_df["polarity"] = 0
    return pd.concat([pos_df, neg_df]).sample(frac=1).reset_index(drop=True)

# Download and process the dataset files.
def download_and_load_datasets(force_download=False):
    dataset = tf.keras.utils.get_file(
      fname="aclImdb.tar.gz", 
      origin="http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz", 
      extract=True)

    train_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                       "aclImdb", "train"))
    test_df = load_dataset(os.path.join(os.path.dirname(dataset), 
                                      "aclImdb", "test"))

    return train_df, test_df

In [None]:
train, test = download_and_load_datasets()

In [None]:
train.head()

In [None]:
train = train.sample(10000)
test = test.sample(10000)

**Importation du tokenizer de BERT**

In [None]:
from transformers import BertTokenizer, BertModel
import torch

device = torch.device("cpu")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

**Préprocessing**

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from sklearn.model_selection import train_test_split
from transformers import get_linear_schedule_with_warmup
import numpy as np
import time
import datetime

def encode_sentences(sentences, max_length=512):
    """
     `encode` will:
        (1) Tokenize the sentence.
        (2) Prepend the `[CLS]` token to the start.
        (3) Append the `[SEP]` token to the end.
        (4) Map tokens to their IDs.
    """
    input_ids = []
    for sent in sentences:
        encoded_sent = tokenizer.encode(
                            sent,                      # Sentence to encode.
                            add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                            max_length = max_length,          # Truncate all sentences.
                            #return_tensors = 'pt',     # Return pytorch tensors.
                       )
        input_ids.append(encoded_sent)
    input_ids = pad_sequences(input_ids, maxlen=max_length, dtype="long", 
                              value=0, truncating="post", padding="post")
    
    # Create attention masks
    attention_masks = []
    # For each sentence...
    for sent in input_ids:

        # Create the attention mask.
        #   - If a token ID is 0, then it's padding, set the mask to 0.
        #   - If a token ID is > 0, then it's a real token, set the mask to 1.
        att_mask = [int(token_id > 0) for token_id in sent]

        # Store the attention mask for this sentence.
        attention_masks.append(att_mask)
    
    return input_ids, attention_masks


def prepare_data(inputs, labels, masks=None, training=True, batch_size=16):
    # Convert all inputs and labels into torch tensors, the required datatype for our model.

    inputs = torch.tensor(inputs)
    labels = torch.tensor(labels)
    masks = torch.tensor(masks)
    data = TensorDataset(inputs,
                         masks, 
                         labels)
    if training:
        # Create the DataLoader for training set.
        sampler = RandomSampler(data)
    else:
        # Create the DataLoader for validation set.
        sampler = SequentialSampler(data)

        
    dataloader = DataLoader(data,
                            sampler=sampler,
                            batch_size=batch_size)
    return dataloader

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)


def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
sentences = train.sentence.values
labels = train.polarity.values
random_state = 12345
batch_size = 16
max_length = 512

In [None]:
# Use train_test_split to split our data into train and validation sets for
# training
# Use 90% for training and 10% for validation.

input_ids, attention_masks = encode_sentences(sentences, max_length=max_length)

(train_inputs, validation_inputs,
train_labels, validation_labels) = train_test_split(input_ids,
                                                    labels, 
                                                    random_state=random_state,
                                                    test_size=0.1)
# Do the same for the masks.
train_masks, validation_masks, _, _ = train_test_split(attention_masks,
                                                       labels,
                                                       random_state=random_state,
                                                       test_size=0.1)

train_dataloader = prepare_data(inputs=train_inputs, 
                                labels=train_labels, 
                                masks=train_masks, 
                                training=True,
                                batch_size=batch_size)

validation_dataloader = prepare_data(inputs=train_inputs, 
                                     labels=train_labels, 
                                     masks=train_masks, 
                                     training=False,
                                     batch_size=batch_size)

**Importation du modèle BERT**

In [None]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels--2 for binary classification.
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False # Whether the model returns all hidden-states.
)

In [None]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())
print('The BERT model has {:} different named parameters.\n'.format(len(params)))
print('==== Embedding Layer ====\n')
for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== First Transformer ====\n')
for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))
print('\n==== Output Layer ====\n')
for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

In [None]:
# Number of training epochs (authors recommend between 2 and 4)
epochs = 4
optimizer = AdamW(model.parameters(), lr = 5e-5,  eps = 1e-7)
total_steps = len(train_dataloader) * epochs
# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [None]:
%%time
import random
# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

seed_val = 12345
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)
# Store the average loss after each epoch so we can plot them.
loss_values = []
# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')
    # Measure how long the training epoch takes.
    t0 = time.time()
    # Reset the total loss for this epoch.
    total_loss = 0
    model.train()
    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):
        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)            
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        b_input_ids = batch[0].to(device).long()
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        model.zero_grad()        
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        loss = outputs[0]
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()
    avg_train_loss = total_loss / len(train_dataloader)            
    loss_values.append(avg_train_loss)
    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    print("")
    print("Running Validation...")
    t0 = time.time()
    model.eval()
    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        b_input_ids = b_input_ids.long()
        with torch.no_grad():        
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            outputs = model(b_input_ids, 
                            token_type_ids=None, 
                            attention_mask=b_input_mask)
        logits = outputs[0]
        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy
        # Track the number of batches
        nb_eval_steps += 1
    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(format_time(time.time() - t0)))
print("")
print("Training complete!")

In [None]:
import plotly.express as px
f = pd.DataFrame(loss_values)
f.columns=['Loss']
fig = px.line(f, x=f.index, y=f.Loss)
fig.update_layout(title='Training loss of the Model',
                   xaxis_title='Epoch',
                   yaxis_title='Loss')
fig.show()

In [None]:
sentences = test.sentence.values
labels = test.polarity.values

input_ids, attention_masks = encode_sentences(sentences, max_length=max_length)

prediction_dataloader = prepare_data(inputs=input_ids, 
                                     labels=labels, 
                                     masks=attention_masks, 
                                     training=False,
                                     batch_size=batch_size)

In [None]:
%%time

#Prediction on test set
print('Predicting labels for {:,} test sentences...'.format(len(sentences)))
# Put model in evaluation mode
model.eval()
# Tracking variables 
predictions , true_labels = [], []
# Predict 
for batch in prediction_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)

    b_input_ids, b_input_mask, b_labels = batch
    b_input_ids = b_input_ids.long()

    with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
    logits = outputs[0]
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Store predictions and true labels
    predictions.append(logits)
    true_labels.append(label_ids)

print('Finished.')

## Evaluation du modèle

In [None]:
from sklearn.metrics import accuracy_score, f1_score, matthews_corrcoef
from sklearn.metrics import (confusion_matrix)
import matplotlib.pyplot as plt

import seaborn as sns
from pylab import rcParams
sns.set()
rcParams['figure.figsize'] = 7, 7

In [None]:
flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]

In [None]:
acc = accuracy_score(flat_true_labels, flat_predictions)
print('Accuracy : %.3f' % acc)


f1_s = f1_score(flat_true_labels, flat_predictions)
print('F1-Score : %.3f' % f1_s)

mcc = matthews_corrcoef(flat_true_labels, flat_predictions)
print('MCC: %.3f' % mcc)

In [None]:
class_names = ["0", "1"]
conf_matrix = confusion_matrix(flat_true_labels, flat_predictions)
conf_matrix = 100 * conf_matrix.astype('float') / conf_matrix.sum(axis=1)[:, np.newaxis]
sns.heatmap(conf_matrix, xticklabels=class_names, yticklabels=class_names, annot=True, fmt=".2f", cbar=False);
plt.title("Matrice de confusion")
plt.ylabel('Valeurs réalisées')
plt.xlabel('Valeurs prédites')
plt.show()