<a href="https://colab.research.google.com/github/fabnancyuhp/DEEP-LEARNING/blob/main/NOTEBOOKS/pytorch-huggingface-transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
ref : https://www.thepythoncode.com/article/finetuning-bert-using-huggingface-transformers-python<br>
In this notebook we use the **transformers library**. **Transformers library** is designed by Hugging-face. Hugging Face is an NLP-focused startup with a large open-source community. **Transformers** is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT.<br><br>
In this chapter we use the BERT hugging-face model to classify email as spam or not spam. BERT is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.<br><br>
**In this notebook, we show how to fine-tune a PyTorch HuggingFace transformer.**

# Functions for preprossesing text data.

In this section, we give some functions for cleaning text data:
* One function to remove HTML tags
* One function to remove URLs
* One function to remove emails

Theoretically, It is not necessarily good to clean the text when we practice text mining with HuggingFace transformers. But in this notebook, I decided to clean a bit the text data.

In [None]:
import re

def remove_html(data):
    html_tag=re.compile(r'<.*?>')
    data=html_tag.sub(r'',data)
    return data

def _remove_urls(x):
    return re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)

def _remove_emails(x):
    return re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)

# Fine-tuning in PyTorch with the Trainer API
In this section, we fine-tune a **BERT** model from hugging-face with **PyTorch**. We use a BERT case model to detect spam emails. First, we import the dataset. The emails text are stored in the email column. The label column is equal to 0 for the valid emails and to 1 for the spams. Run the cell below to create the email dataset:

## Dataset creation
* We first import the data as a pandas dataset
* Secondly, we convert the pandas dataset into a hugging-face dataset
* Last, we split the hugging-face dataset into a training set and a test set

In [None]:
import pandas as pd
spam_ham = pd.read_csv("https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/DATA/spam_ornot_spam.csv")
spam_ham['email'] = spam_ham['email'].apply(lambda x:remove_html(str(x)))
spam_ham['email'] = spam_ham['email'].apply(lambda x:_remove_urls(str(x)))
spam_ham['email'] = spam_ham['email'].apply(lambda x:_remove_emails(str(x)))
spam_ham.head()

spam_ham['email'] = spam_ham['email'].apply(lambda x:' '.join(x.split()[0:512]))

In order to use the hugging-face pytorch Trainer API, you have to set up WANDB_DISABLED to true. Wandb means weights & Biases integration.  We set WANDB_DISABLED to “true” to disable wandb entirely. 

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
from datasets import load_dataset
from datasets import Dataset
from datasets import DatasetDict

# Secondly, we convert the pandas dataset into a hugging-face dataset
dataset = Dataset.from_pandas(spam_ham)

# Last, we split the hugging-face dataset into a training set and a test set
dataset_train_test = dataset.train_test_split(test_size=0.15)
dataset_train_test

## Tokenization and vectorization of the text
Remind that our goal is to fine-tune a BERT cased model from the hugging-face hub. A hugging-face model has two components:
* the tokenizer component is responsable for the text vectorization
* the model component is a pre-trained transformer that takes the tokenizer output as an input

The tokenizer and model should always be paired. In the cell below we:
* we import the tokenizer designed for the bert-cased model
* We vectorize the texts from the emails  

In [None]:
# We import the tokenizer designed for the bert-cased model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# We vectorize the texts from the emails
def tokenize_function(examples):
    return tokenizer(examples["email"], padding="max_length", truncation=True)

tokenized_datasets = dataset_train_test.map(tokenize_function, batched=True)

In [None]:
tokenized_datasets

attention_mask,  input_ids, token_type_ids are vectors features made by the bert-cased tokenizer from the email feature. The bert-cased model uses these features and the label feature during the training stage we will implement later. We get the training set and the test set from the hugging-face dataset tokenized_datasets.

In [None]:
train_set = tokenized_datasets['train']
test_set = tokenized_datasets['test']

## Training stage with PyTorch Trainer
* We have to load the transfromer model we want to use. 
* We have to set the TrainingArguments
* We set the Trainer 

**We load the transfromer model we want to use**<br>
The model we load is paired with the tokenizer we loaded earlier in this exemple. The model in the cell below is a Pytorch model. In the case of a pretrained pytorch model we use AutoModelForSequenceClassification instead of TFAutoModelForSequenceClassification.  

In [None]:
#We have to load the transfromer model we want to use. Model is a Pytorch model.
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

In [None]:
#We have to set the TrainingArguments
from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer")
          #num_train_epochs=3,              # total number of training epochs
          #per_device_train_batch_size=16,  # 16 batch size per device during training
          #per_device_eval_batch_size=20,   #20
          #output_dir='./results',
          #weight_decay=0.01)

**We set the Trainer and begin the training stage with Trainer.train().**<br>
The Trainer object takes some arguments such that:
* model : a huggingface pretrained PyTorch model
* args : a TrainingArguments defined earlier 
* train_dataset : a huggingface dataset made from a huggingface tokenizer step  

The target value of a huggingface dataset used by the Trainer should always be named label else the Train API doesn't work.

In [None]:
#We set the Trainer and begin the training stage with Trainer.train().
#import torch
#torch.cuda.empty_cache()

from transformers import Trainer

trainer = Trainer(model=model,args=training_args,train_dataset=train_set)
trainer.train()

Now we save the tokenizer and the pytorch model model.

In [None]:
model.save_pretrained("./")
tokenizer.save_pretrained("./")

## Apply the PyTorch fine-tuned model on a new text
Now, we try the model on the following email : 

In [None]:
email_test = """martin a posted tassos papadopoulos the greek sculptor behind 
                the plan judged that the limestone of mount kerdylio NUMBER 
                miles east of salonika and not far from the mount athos monastic 
                community was ideal for the patriotic sculpture as well as alexander s 
                granite features NUMBER ft high and NUMBER ft wide a museum a restored 
                amphitheatre and car park for admiring crowds are planned so is 
                this mountain limestone or granite if it s limestone it ll weather 
                pretty fast yahoo groups sponsor NUMBER dvds free s p join now URL 
                to unsubscribe from this group send an email to forteana unsubscribe 
                URL your use of yahoo groups is subject to URL"""

After the Trainer step, model is a fine-tuned transformer model. Then, we can apply model on a new email. We have to
* vectorize the new email
* Apply the pytorch on the vectorized new text 
* use softmax function to get a PyTorch tensor probability vector
* convert the previous PyTorch tensor into a numpy array

In [None]:
token_input = tokenizer(email_test, padding="max_length", truncation=True,return_tensors="pt").to("cuda")
#outputs = model(**token_input) #model is a PyTorch model
#probs = outputs[0].softmax(1)  #  
probs = model(**token_input)[0].softmax(1).cpu().detach().numpy()
print("proba="+ str(probs))
target_names = ['email valid','spam']
target_names[probs.argmax()]

## Prediction on the test set with the fine-tuned model
A huggingFace PyTorch is GPU-memory consuming. When we want to predict on a whole dataset with it, we have to clean the memory with the gc package (garbage collection).<br><br>
In the above cell, we do a loop over the whole dataset.  At each step of this loop we call the tokenizer and the model. At each step of the loop, We also convert a PyTorch tensor object into a numpy array to release some GPU memory.<br><br>
**CUDA** (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing unit (GPU) for general purpose processing.<br><br>

Y_test_proba in the above cell is a list of numpy probability vectors.

In [None]:
Y_true = test_set['label']
Y_test_email = test_set['email']

import gc

Y_test_proba = []

for email in test_set['email']:
    token_output = tokenizer(email, padding="max_length", truncation=True,return_tensors="pt").to("cuda")
    Y_test_proba.append(model(**token_output)[0].softmax(1).cpu().detach().numpy())
    del token_output
    gc.collect()

Now we use the np.argmax function to compute the predicted classes of the test set:

# EXERCICE : PyTorch Huggingface TRANSFORMER OVER fetch_20newsgroups
In this exercice, we fine-tune a hugging face bert model over the fetch_20newsgroups dataset. We have to understand the fetch_20newsgroups dataset as a first step. You have to run the cell bellow to create the dataset. The cell execution could take a while. 

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism','comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware',\
              'comp.sys.mac.hardware','comp.windows.x','misc.forsale','rec.autos','rec.motorcycles',\
              'comp.sys.mac.hardware','comp.windows.x','misc.forsale','rec.autos','rec.motorcycles',\
              'rec.sport.baseball','rec.sport.hockey','sci.crypt','sci.electronics','sci.med','sci.space',\
              'soc.religion.christian','talk.politics.guns','talk.politics.mideast','talk.politics.misc','talk.religion.misc']

dataset = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'),categories=categories)

import numpy as np
import pandas as pd
len(dataset['target_names'])
target_values_to_target_labels = dict(zip(range(0,len(dataset['target_names'])),dataset['target_names']))

target_values = dataset.target.tolist()
target_labels = [target_values_to_target_labels[o] for o in target_values]

dataset_fetch_20newsgroups = pd.DataFrame({'text':dataset.data,'target_labels':target_labels})

target_class_to_encoder = dict(zip(dataset_fetch_20newsgroups['target_labels'].unique().tolist(),list(range(0,len(dataset_fetch_20newsgroups['target_labels'].unique())))))

dataset_fetch_20newsgroups['class_encoded'] = dataset_fetch_20newsgroups['target_labels'].apply(lambda x:target_class_to_encoder[x])
dataset_fetch_20newsgroups.head()

Here we show some texts labeled rec.autos:

In [None]:
dataset_fetch_20newsgroups.loc[dataset_fetch_20newsgroups['target_labels']=='rec.autos']['text'].reset_index().iloc[3,1]

In [None]:
dataset_fetch_20newsgroups.loc[dataset_fetch_20newsgroups['target_labels']=='rec.autos']['text'].reset_index().iloc[10,1]

1) Remove the URL, the email addresses and the html tags from the column 'text' of the dataset_fetch_20newsgroups dataset. You can use the functions defined  below.

In [None]:
import re

def remove_html(data):
    html_tag=re.compile(r'<.*?>')
    data=html_tag.sub(r'',data)
    return data

def _remove_urls(x):
    return re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)

def _remove_emails(x):
    return re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)

def remove_url(data):
    url_clean= re.compile(r"https://\S+|www\.\S+")
    data=url_clean.sub(r'',data)
    return data

def remove_skip_line(x):
    return re.sub(r'\r?\n|\\|\r/g'," ", x)

In [None]:
dataset_fetch_20newsgroups['text'] = dataset_fetch_20newsgroups['text'].apply(lambda x:remove_html(str(x)))
#your code here

2) In this exercise, we show how to fine-tune a bert-model with the Trainer API. Behind the scene, the Trainer API works with PyTorch.<br>
The Trainer API takes a PyTorch hugging-face model as an input and some arguments such that 
* the number of epochs, 
* the weight decay, 
* the fact to load the best model during the training stage
* ...

In order to use the hugging-face pytorch Trainer API, you have to set up WANDB_DISABLED to true. Wandb means weights & Biases integration.  We set WANDB_DISABLED to “true” to disable wandb entirely. Run the cell below.

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

The Trainer API takes also a **hugging-face dataset** as an input. As As a consequence, we have to convert the pandas dataset dataset_fetch_20newsgroups into a hugging-face dataset.<br>
The target value of a hugging-face dataset should always be named label else the Train API doesn't work. 
* rename the 'class_encoded' column of dataset_fetch_20newsgroups into label. 
* After that, you have to convert the dataset_fetch_20newsgroups into a huggin-face dataset using Dataset.from_pandas method. Call this hugging-face dataset dataset_fetch_20_for_Trainer_API.

In [None]:
#your code here
#we rename ''class_encoded' 'label'  'class_encoded'
dataset_fetch_20newsgroups = 


from datasets import Dataset
#convert dataset_fetch_20newsgroups into a hugging-face dataset
dataset_fetch_20_for_Trainer_API = 

3) Split dataset_fetch_20_for_Trainer_API into a train set and a test set. Set the parameter test_size=0.15. In this question, you have to call the train_test_split method for a hugging-face dataset. In this question, we don't use the train_test_split from sklearn.

In [None]:
#we create the training and the test dataset. the test_size is 15% of the whole dataset
train_test_dataset = dataset_fetch_20_for_Trainer_API.train_test_split(test_size=0.15)

4) the goal of this exercise is to fin-tuned a Bert cased model with Train API. To preprocess our data, we will need a tokenizer. A tokenizer splits the data in tokens (these can be characters, words, part of words). For this task, we are using the tokenizer from the pre-trained model we selected (bert-base-cased).<br>
* Import the tokenizer designed for the bert-base-case model from hugging-face hub. Call this tokenizer tokenizer.
* Use the tokenizer you've just imported to preprocess dataset_fetch_20_for_Trainer_API. Store the preprocessed dataset into train_test_dataset_tokenize.
* Extract the training set from train_test_dataset_tokenize into train_dataset.
* Extract the test set from train_test_dataset_tokenize into test_dataset.

In [None]:
#we call the tokenizer from the huggingface hub. After that 
from transformers import AutoTokenizer
#complet the code
tokenizer = 


The most importants features made during the tokenize step  are 'attention_mask', 'input_ids', 'label', 'token_type_ids'. Now, we don't need the text feature anymore. As a consequence, you could remove the text columns from train_dataset and  test_datasetlater.

5) Call the transformer model paired with the tokenizer. We have to set num_labels=20 since we have 20 classes in the label. This transformer model should be suitable with PyTorch, then we have to use AutoModelForSequenceClassification in this case. Call this transformer model.

In [None]:
#we call the transformer model paired with the tokenizer. We have to set num_labels to 
from transformers import AutoModelForSequenceClassification

#complet the code
model = 

6) The Trainer API uses some training arguments we call with a TrainingArguments object. You can use the default configuration with TrainingArguments("test_trainer"). Else, you can set the following arguments:
* num_train_epochs
* weight_decay=0.01
* load_best_model_at_end

In [None]:
#we set up the TrainingArguments
from transformers import TrainingArguments

training_args = 

7) Call the Trainer API with your model, the arguments you defined in question 6 and train_dataset from question 4. Train your model and test it with test_dataset from question 4.

In [None]:
from transformers import Trainer

#complet the code
trainer = 

#code to train the model


8) In this question we test our fine-tune model on a new text. To achieve this purpose, we have to tokenize this new text with the tokenizer we defined in the question 4. After that we have to apply our model to this preprocessed text to get a probability vector modelized stored in a torch.Tensor object.

In [None]:
text = """
The first thing is first. 
If you purchase a Macbook, you should not encounter performance issues that will prevent you from learning to code efficiently.
However, in the off chance that you have to deal with a slow computer, you will need to make some adjustments. 
Having too many background apps running in the background is one of the most common causes. 
The same can be said about a lack of drive storage. 
For that, it helps if you uninstall xcode and other unnecessary applications, as well as temporary system junk like caches and old backups.
"""

In [None]:
#tokenize this new text 
token_input = tokenizer(text, padding="max_length", truncation=True,return_tensors="pt").to("cuda")

In the above cell we apply the model on the tokenized text and we get a probability vector modelized with a torch.Tensor object 

In [None]:
probs = model(**token_input)[0].softmax(1)
probs,type(probs)

Here, we convert the torch.Tensor object into a numpy object. Since the torch.Tensor probe is attached to a cuda device (in fact attached to GPU memory), we path it through the CPU memory and convert it into a numpy object.

In [None]:
probs_numpy = probs.cpu().detach().numpy()
probs_numpy,type(probs_numpy)

Here, we create a python dictionary to convert an encoded class into a class name.

In [None]:
encoder_to_class_name = dict(zip(list(range(0,len(dataset_fetch_20newsgroups['target_labels'].unique()))),dataset_fetch_20newsgroups['target_labels'].unique().tolist()))

In a last step, we display the predicted class of the new text. We use the numpy.argmax function.

In [None]:
encoder_to_class_name[np.argmax(probs_numpy)]

9) In this part we explain how to predict the classes of a whole test dataset. A huggingFace PyTorch is GPU-memory consuming. When we want to predict on a whole dataset with it, we have to clean the memory with the gc package (garbage collection).<br><br>
In the above cell, we do a loop over the whole dataset.  At each step of this loop we call the tokenizer and the model. At each step of the loop, We also convert a PyTorch tensor object into a numpy array to release some GPU memory.<br><br>
**CUDA** (or Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) that allows software to use certain types of graphics processing unit (GPU) for general purpose processing.<br><br>

Y_test_proba in the above cell is a list of numpy probability vectors.

In [None]:
Y_true = test_dataset['label']
Y_test_email = test_dataset['text']

import gc

Y_test_proba = []

for text in test_dataset['text']:
    token_output = tokenizer(text, padding="max_length", truncation=True,return_tensors="pt").to("cuda")
    Y_test_proba.append(model(**token_output)[0].softmax(1).cpu().detach().numpy())
    del token_output
    gc.collect()

In [None]:
import numpy as np
y_pred = [np.argmax(vect_proba) for vect_proba in Y_test_proba]

We comput the accuracy score:

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(Y_true,y_pred)