<a href="https://colab.research.google.com/github/fabnancyuhp/DEEP-LEARNING/blob/main/NOTEBOOKS/fine-tuning-in-pytorch-with-the-trainer-api.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning in PyTorch with the Trainer API
This notebook is inspired by https://huggingface.co/blog/sentiment-analysis-python. In this notebook, we deal with a PyTorch pretrained hugging-face transformer. More precisely, we use a distilbert transformer over the IMDB dataset.<br> 
DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

We download the huggingface IMDB dataset. full_train_dataset and full_test_dataset are hugging-face datasets objects.

In [3]:
#!pip install datasets
from datasets import load_dataset
raw_datasets = load_dataset("imdb")

full_train_dataset = raw_datasets["train"].shuffle(seed=42).select([i for i in list(range(3000))])
full_test_dataset = raw_datasets["test"].shuffle(seed=42).select([i for i in list(range(300))])

import gc
del raw_datasets
gc.collect()

#full_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(5000))
#full_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(3000))

# We vectorize the movie reviews with a distilbert-base-uncased tokenizer. Then we create the training dataset and the test dataset.
 In the cell , full_train_dataset and full_test_dataset are 2 huggingface datasets.

In [5]:
#!pip install transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

full_train_dataset = full_train_dataset.map(tokenize_function, batched=True)
full_test_dataset = full_test_dataset.map(tokenize_function, batched=True)

Now, in order to fine-tune a bert huggingface model with the trainer API,
* We have to load the transfromer model we want to use. 
* We have to set the TrainingArguments
* We set the Trainer

The model we load is paired with the tokenizer we loaded earlier in this exemple. The model in the cell below is a Pytorch model. In the case of a pretrained pytorch model we use AutoModelForSequenceClassification instead of TFAutoModelForSequenceClassification.  

# We load the distilbert-base-uncased transfromer we want to use
In the cell below, model is a PyTorch hugging-face model.  

In [None]:
import torch
torch.cuda.empty_cache()
from transformers import AutoModelForSequenceClassification

#pytorch_model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
pytorch_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading:   0%|          | 0.00/256M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias', 'pre_classifier

**We set the TrainingArguments**<br>
The Trainer API uses some training arguments we call with a TrainingArguments object. You can use the default configuration with TrainingArguments("test_trainer"). Else, you can set some arguments such that:
* output_dir='./results',     the output directory
* num_train_epochs the total number of training epochs
* per_device_train_batch_size the batch size per device during training
* per_device_eval_batch_size=20 the batch size for evaluation
* weight_decay=0.01 the strength of weight decay
* logging_dir='./logs' the directory for storing logs

In [None]:
#We set some arguments
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [None]:
#we use the default configuration. We change the number of epochs to 1
from transformers import TrainingArguments
training_args = TrainingArguments("test_trainer")
training_args.num_train_epochs = 1
#training_args.per_device_eval_batch_size = 2
#training_args.per_device_train_batch_size = 2
#training_args

Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


**We set the Trainer and begin the training stage with Trainer.train().**<br>
The Trainer object takes some arguments such that:
* model : a huggingface pretrained PyTorch model
* args : a TrainingArguments defined earlier 
* train_dataset : a huggingface dataset made from a huggingface tokenizer step  
* eval_datase : a huggingface dataset made from a huggingface tokenizer step

The target value of a huggingface dataset used by the Trainer should always be named label else the Train API doesn't work.

In [None]:
import torch
torch.cuda.empty_cache()


from transformers import Trainer

trainer = Trainer(model=pytorch_model, args=training_args, train_dataset=full_train_dataset)

trainer.train()

The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 3000
  Num Epochs = 3
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 564


Step,Training Loss
500,0.355


Saving model checkpoint to ./results/checkpoint-500
Configuration saved in ./results/checkpoint-500/config.json
Model weights saved in ./results/checkpoint-500/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=564, training_loss=0.3428031130039946, metrics={'train_runtime': 243.078, 'train_samples_per_second': 37.025, 'train_steps_per_second': 2.32, 'total_flos': 1192206587904000.0, 'train_loss': 0.3428031130039946, 'epoch': 3.0})

After the Trainer step, pytorch_model is a fine-tuned transformer model. Then, we can apply pytorch_model on a new movie review. We have to
* vectorize the new movie review
* Apply the pytorch on the vectorized new text 
* use softmax function to get a PyTorch tensor probability vector
* convert the previous PyTorch tensor into a numpy array

In [None]:
new_movie_review = ["I was extraordinarily impressed by this film. It's one of the best sports films \
                    I've every seen. The visuals in this film are outstanding. I love the sequences \
                    in which the camer tracks the ball as it flies through the air or into the cup. \
                    The film moves well, offering both excitement and drama. The cinematography was fantastic.\
                    <br /><br />The acting performances are great. I was surprised by young Shia LaBeouf.\
                    He does well in this role. Stephen Dillane is also good as the brooding Harry Vardon. \
                    Peter Firth, Justin Ashforth, and Elias Koteas offer able support. \
                    The film is gripping and entertaining and for the first time in my \
                    life actually made me want to watch a golf tournament."]

**CUDA** (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing unit (GPU) for general purpose processing. **To("cuda") in tokenizer step means we put the vectorized text in the GPU memory.**

In [None]:
#we vectorize the new movie review
review_token =tokenizer(new_movie_review, padding="max_length", truncation=True,return_tensors="pt").to("cuda")

#get a PyTorch tensor probability vector
prob_pytorch_tensor = pytorch_model(**review_token )[0].softmax(1)

#convert the previous PyTorch tensor into a numpy array
prob_numpy_array = prob_pytorch_tensor.cpu().detach().numpy()

Then, we display the predicted class of the new review

In [None]:
import numpy as np
review_sentiment = ['negatif review','positif review']
review_sentiment[np.argmax(prob_numpy_array)]

'positif review'