<a href="https://colab.research.google.com/github/fabnancyuhp/DEEP-LEARNING/blob/main/NOTEBOOKS/introduction-to-hugging-face-transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ref : https://huggingface.co/transformers/training.html<br>
ref : https://huggingface.co/docs/datasets/loading_datasets.html<br>
ref : https://blog.tensorflow.org/2019/11/hugging-face-state-of-art-natural.html<br>
In this notebook, 
* we will see a text classification implementation on the IMDB dataset with the Transfer Learning technique. 
* we will show how to fine-tune a pretrained model from the Transformers library. 
* we will fine-tune BERT on the IMDB dataset

BERT stands for Bidirectional Encoder Representations from Transformers. In TensorFlow, models can be directly trained using Keras and the fit method.<br><br>

**Hugging Face** is an NLP-focused startup with a large open-source community, in particular around the **Transformers library.** 🤗/Transformers is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT, that obtain state-of-the-art results on a variety of NLP tasks like text classification, information extraction, question answering, and text generation. Those architectures come pre-trained with several sets of weights.<br><br>

The **Transformers library** has seen super-fast growth in PyTorch and has recently been ported to TensorFlow 2.0, offering an API that now works with Keras’ fit API, TensorFlow Extended, and TPUs<br><br>

**Hugging Face transformers** is a collection of state-of-the-art NLP (Natural Language Processing) models. They offer a wide variety of architectures to choose from (BERT, GPT-2, RoBERTa etc) as well as a hub of pre-trained models uploaded by users and organisations. <br><br>
**Hugging Face Datasets** is a wrapper library that provides some tools to load and process data in many commonly used formats (CSV, JSON etc).

# Hugging Face Datasets
**Hugging Face Datasets** is a wrapper library that provides some tools to load and process data in many commonly used formats (CSV, JSON etc).<br><br>
Before we can fine-tune a model, we need a dataset. In this tutorial, we will show you how to fine-tune BERT on the IMDB dataset: the task is to classify whether movie reviews are positive or negative. Bellow, we load the IMBD data structure from the hugging-face hub. Row_datasets is DatasetDict object.

In [None]:
from datasets import load_dataset

raw_datasets = load_dataset("imdb")

Here, we show how a hugging-face DatasetDict is modelized. When we run the cell below we see that a hugging-face DatasetDict is a data structure close to a python dictionary. The raw_datasets DatasetDict has 3 sub-datasets: train, test and unsupervised. The unsupervised dataset is the union of train and test datasets.<br>
To access the movies reviews from the training dataset, we have to run raw_datasets['train']['text']. To get the labels of the training dataset we run raw_datasets['train']['label']

In [None]:
raw_datasets

In [None]:
raw_datasets['train']['text'][0:2]

In [None]:
raw_datasets['train']['label'][0:2]

In [None]:
long_text = [len(x) for x in raw_datasets['train']['text']]

# Preprocess/Tokenize a hugging-face dataset
Remind the goal of this notebook is to fin-tuned a Bert cased model. To preprocess our data, we will need a tokenizer. A tokenizer splits the data in tokens (these can be characters, words, part of words). For this task, we are using the tokenizer from the pre-trained model we selected (bert-base-cased).

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

The most importants features made during the tokenize step  are 'attention_mask', 'input_ids', 'label', 'token_type_ids'. Now, we don't need the text feature anymore. As a consequence, We remove the text columns from the huggingface dataset later in this notebook.

In [None]:
#tokenized_datasets = tokenized_datasets.remove_columns(["text"])
#tokenized_datasets

# Fine-tuning the hugging-face bert-case model with tensorflow

## Making the dataset usable by a tensorflow model
* First step: we define the content of the training set and the test set.
* Second step: Second step: we make these 2 datasets compatibles with tensorflow 
* Third step: we make the training and test dataset suitables for a tensorflow model

In [None]:
full_train_dataset = tokenized_datasets["train"]
full_test_dataset = tokenized_datasets["test"]

In [None]:
#Second step: Second step: we make these 2 datasets compatibles with tensorflow
tf_train_dataset = full_train_dataset.with_format("tensorflow")
tf_test_dataset = full_test_dataset.with_format("tensorflow")

In [None]:
#Third step: we make the training and test dataset suitables for a tensorflow model
import tensorflow as tf
train_features = {x: tf_train_dataset[x].to_tensor() for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)

test_features = {x: tf_test_dataset[x].to_tensor() for x in tokenizer.model_input_names}
test_tf_dataset = tf.data.Dataset.from_tensor_slices((test_features, tf_test_dataset["label"]))
test_tf_dataset = test_tf_dataset.batch(8)

## We import the tensorflow pretrained bert-base-cased model from the huggingface hub.
We import the pretrained bert-base-cased model from the huggingface hub. The model we will import is paired with the tokenizer we imported earlier in this notebook. We import the bert-base-cased in the cell below. We set num_labels=2 because we have 2 distincts values in the label columns from our dataset. In the cell below, model is a tensorflow model.

In [None]:
import tensorflow as tf
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

## We fine-tune the pretrained model
We compile the tensorflow model as we always do when we use tensorflow. Here we use the
* the adam  optimizer with learning_rate=5e-5
* SparseCategoricalCrossentropy loss fucntion
* SparseCategoricalAccuracy() metrics

To fine-tune properly a tensorflow hugging-face model with an adam optimizer, it recomded to set the leraning_rate to 5e-5.

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=tf.metrics.SparseCategoricalAccuracy(),
)

model.fit(train_tf_dataset, validation_data=test_tf_dataset, epochs=2)

After the compile step, model is a fine-tuned transformer model. Then, we can apply model on a new movie review.

In [None]:
new_movie_review = ["I was extraordinarily impressed by this film. It's one of the best sports films \
                    I've every seen. The visuals in this film are outstanding. I love the sequences \
                    in which the camer tracks the ball as it flies through the air or into the cup. \
                    The film moves well, offering both excitement and drama. The cinematography was fantastic.\
                    <br /><br />The acting performances are great. I was surprised by young Shia LaBeouf.\
                    He does well in this role. Stephen Dillane is also good as the brooding Harry Vardon. \
                    Peter Firth, Justin Ashforth, and Elias Koteas offer able support. \
                    The film is gripping and entertaining and for the first time in my \
                    life actually made me want to watch a golf tournament."]

In order to apply the model to the nes review, We have to:
* vectorize the new movie review
* put the vectorized text into a tf.data.Dataset
* to apply the .batch(1) method to make review_data suitable for the fine-tuned model.

In [None]:
#we vectorize
token_new_review = tokenizer(new_movie_review, padding="max_length", truncation=True)

#put the vectorized text into a tf.data.Dataset
import tensorflow as tf
review_data = tf.data.Dataset.from_tensor_slices((dict(token_new_review)))

#make review_data suitable for the fine-tuned model.
review_data_batch = review_data.batch(1)

Now, we can apply the fine-tuned model to review_data_batch. Then, we apply a softmax function to the result to get a probability vector.

In [None]:
import numpy as np
import tensorflow as tf

review_sentiment = ['negatif review','positif review']

#we apply the fine-tuned model to review_data_batch
pred = model.predict(review_data_batch).logits
#we apply a sofmax function to get a probability vector
prob_vector = tf.nn.softmax(pred, axis=1).numpy()
prob_vector

Then, we display the predicted class of the new review

In [None]:
import numpy as np
review_sentiment = ['negatif review','positif review']
review_sentiment[np.argmax(prob_vector)]

In [None]:
import pandas as pd
data = pd.DataFrame({'text':tokenized_datasets["test"]['text'],'sentiment':tokenized_datasets["test"]['label']})
data[data['sentiment']==1].iloc[12495,0]

# Fine-tuning in PyTorch with the Trainer API
**In this section, we deal with a PyTorch pretrained huggingface transformer. You might need to restart your notebook at this stage to free some memory.**
In order to use the hugging-face pytorch Trainer API, you have to set up WANDB_DISABLED to true. Wandb means weights & Biases integration.  We set WANDB_DISABLED to “true” to disable wandb entirely. 

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

We download the huggingface IMDB dataset:

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset("imdb")

**We vectorize the movie reviews with a bert-case- tokenizer. Then we create the training dataset and the test dataset.**<br> In the cell , full train_dataset and full_text dataset are 2 huggingface datasets.

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

full_train_dataset = tokenized_datasets["train"]
full_test_dataset = tokenized_datasets["test"]

Now, in order to fine-tune a bert huggingface model with the trainer API,
* We have to load the transfromer model we want to use. 
* We have to set the TrainingArguments
* We set the Trainer

The model we load is paired with the tokenizer we loaded earlier in this exemple. The model in the cell below is a Pytorch model. In the case of a pretrained pytorch model we use AutoModelForSequenceClassification instead of TFAutoModelForSequenceClassification.  

**We load the transfromer model we want to use**<br>
In the cell below, model is a PyTorch model.

In [None]:
import torch
torch.cuda.empty_cache()
from transformers import AutoModelForSequenceClassification

pytorch_model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

**We set the TrainingArguments**<br>
The Trainer API uses some training arguments we call with a TrainingArguments object. You can use the default configuration with TrainingArguments("test_trainer"). Else, you can set some arguments such that:
* output_dir='./results',     the output directory
* num_train_epochs the total number of training epochs
* per_device_train_batch_size the batch size per device during training
* per_device_eval_batch_size=20 the batch size for evaluation
* weight_decay=0.01 the strength of weight decay
* logging_dir='./logs' the directory for storing logs

In [None]:
#We set some arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

In [None]:
#we use the default configuration
from transformers import TrainingArguments
training_args = TrainingArguments("test_trainer")

**We set the Trainer and begin the training stage with Trainer.train().**<br>
The Trainer object takes some arguments such that:
* model : a huggingface pretrained PyTorch model
* args : a TrainingArguments defined earlier 
* train_dataset : a huggingface dataset made from a huggingface tokenizer step  
* eval_datase : a huggingface dataset made from a huggingface tokenizer step

The target value of a huggingface dataset used by the Trainer should always be named label else the Train API doesn't work.

In [None]:
import torch
torch.cuda.empty_cache()

from transformers import Trainer

trainer = Trainer(model=pytorch_model, args=training_args, train_dataset=full_train_dataset)

trainer.train()

After the Trainer step, pytorch_model is a fine-tuned transformer model. Then, we can apply pytorch_model on a new movie review. We have to
* vectorize the new movie review
* Apply the pytorch on the vectorized new text 
* use softmax function to get a PyTorch tensor probability vector
* convert the previous PyTorch tensor into a numpy array

In [None]:
new_movie_review = ["I was extraordinarily impressed by this film. It's one of the best sports films \
                    I've every seen. The visuals in this film are outstanding. I love the sequences \
                    in which the camer tracks the ball as it flies through the air or into the cup. \
                    The film moves well, offering both excitement and drama. The cinematography was fantastic.\
                    <br /><br />The acting performances are great. I was surprised by young Shia LaBeouf.\
                    He does well in this role. Stephen Dillane is also good as the brooding Harry Vardon. \
                    Peter Firth, Justin Ashforth, and Elias Koteas offer able support. \
                    The film is gripping and entertaining and for the first time in my \
                    life actually made me want to watch a golf tournament."]

**CUDA** (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing unit (GPU) for general purpose processing. **To("cuda") in tokenizer step means we put the vectorized text in the GPU memory.**

In [None]:
#we vectorize the new movie review
review_token =tokenizer(new_movie_review, padding="max_length", truncation=True,return_tensors="pt").to("cuda")

#get a PyTorch tensor probability vector
prob_pytorch_tensor = pytorch_model(**review_token )[0].softmax(1)

#convert the previous PyTorch tensor into a numpy array
prob_numpy_array = prob_pytorch_tensor.cpu().detach().numpy()

Then, we display the predicted class of the new review

In [None]:
import numpy as np
review_sentiment = ['negatif review','positif review']
review_sentiment[np.argmax(prob_numpy_array)]