# Introduction
In this notebook we use the **transformers library**. **Transformers library** is designed by Hugging-face. Hugging Face is an NLP-focused startup with a large open-source community. **Transformers** is a python-based library that exposes an API to use many well-known transformer architectures, such as BERT, RoBERTa, GPT-2 or DistilBERT.<br><br>
In this chapter we use the BERT hugging-face model to classify email as spam or not spam. BERT is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus comprising the Toronto Book Corpus and Wikipedia.<br><br>
**In this notebook, we show how to fine-tune a Tensorflow HuggingFace transformer.**

# Functions for preprossesing text data.

In this section, we give some functions for cleaning text data:
* One function to remove HTML tags
* One function to remove URLs
* One function to remove emails

Theoretically, It is not necessarily good to clean the text when we practice text mining with HuggingFace transformers. But in this notebook, I decided to clean a bit the text data.

In [1]:
import re

def remove_html(data):
    html_tag=re.compile(r'<.*?>')
    data=html_tag.sub(r'',data)
    return data

def _remove_urls(x):
    return re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)

def _remove_emails(x):
    return re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)

# Fine tuned a pretrained model using hugging face transformer with tensorflow
In this section, we fine-tune a **BERT** model from hugging-face with **tensorflow**. We use a BERT case model to detect spam emails. First, we import the dataset. The emails text are stored in the email column. The label column is equal to 0 for the valid emails and to 1 for the spams. Run the cell below to create the email dataset:

## Dataset creation
* We first import the data as a pandas dataset
* Secondly, we convert the pandas dataset into a hugging-face dataset
* Last, we split the hugging-face dataset into a training set and a test set

In [2]:
# We first import the data as a pandas dataset
import pandas as pd
spam_ham = pd.read_csv("https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/DATA/spam_ornot_spam.csv")
spam_ham['email'] = spam_ham['email'].apply(lambda x:remove_html(str(x)))
spam_ham['email'] = spam_ham['email'].apply(lambda x:_remove_urls(str(x)))
spam_ham['email'] = spam_ham['email'].apply(lambda x:_remove_emails(str(x)))
spam_ham.head()

spam_ham['email'] = spam_ham['email'].apply(lambda x:' '.join(x.split()[0:512]))

In [4]:
!pip install datasets  #Only if you are in colab. dataset is already installed in kaggle  
from datasets import load_dataset
from datasets import Dataset
from datasets import DatasetDict

# Secondly, we convert the pandas dataset into a hugging-face dataset
dataset = Dataset.from_pandas(spam_ham)

# Last, we split the hugging-face dataset into a training set and a test set
dataset_train_test = dataset.train_test_split(test_size=0.15)
dataset_train_test

DatasetDict({
    train: Dataset({
        features: ['email', 'label'],
        num_rows: 2550
    })
    test: Dataset({
        features: ['email', 'label'],
        num_rows: 450
    })
})

In the above cell, dataset_train_test is a HuggingFace dataset. In the following cells, we show how
* to access the train part of dataset_train_test
* to access the email feature of the the train part
* access the label feature of the the train part

In [5]:
#access the train part of dataset_train_test
dataset_train_test['train']

Dataset({
    features: ['email', 'label'],
    num_rows: 2550
})

In [6]:
#access the email feature of the the train part
dataset_train_test['train']['email'][0:2]

['how about this a bored forker said yawn i believe tom had it right signal not noise i ll start los angeles times september NUMBER NUMBER monday copyright NUMBER los angeles times los angeles times september NUMBER NUMBER monday home edition section main news main news part NUMBER page NUMBER national desk length NUMBER words byline dana calvo times staff writer dateline seattle body the idea came at the end of a long frustrating brown bag session at a public policy think tank here the challenge was to save the city s child care programs staring into his empty coffee cup the meeting coordinator s mind landed on an unlikely solution put a tax just a benign dime a shot on espresso that led to a petition signed by more than NUMBER NUMBER seattle residents and next year voters will decide whether the tax becomes law one that taps right into seattle s legendary addiction to coffee this is after all the town where starbucks was born and where the NUMBER pound of beans became a staple there 

In [7]:
#access the label feature of the the train part
dataset_train_test['train']['label'][0:2]

[0, 0]

## Tokenization and vectorization of the text
Remind that our goal is to fine-tune a BERT cased model from the hugging-face hub. A hugging-face model has two components:
* the tokenizer component is responsable for the text vectorization
* the model component is a pre-trained transformer that takes the tokenizer output as an input

The tokenizer and model should always be paired. In the cell below we:
* we import the tokenizer designed for the bert-cased model
* We vectorize the texts from the emails  

In [9]:
# We import the tokenizer designed for the bert-cased model

#Import the transformer library only in Colab
!pip install transformers
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# We vectorize the texts from the emails
def tokenize_function(examples):
    return tokenizer(examples["email"], padding="max_length", truncation=True)

tokenized_datasets = dataset_train_test.map(tokenize_function, batched=True)

  0%|          | 0/3 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [12]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'email', 'input_ids', 'label', 'token_type_ids'],
        num_rows: 2550
    })
    test: Dataset({
        features: ['attention_mask', 'email', 'input_ids', 'label', 'token_type_ids'],
        num_rows: 450
    })
})

attention_mask,  input_ids, token_type_ids are vectors features made by the bert-cased tokenizer from the email feature. The bert-cased model uses these features and the label feature during the training stage we will implement later.

## Convert the hugging-face dataset in standard tf.data.Dataset 
Since we will use a hugging-face Tensorflow model, we have to convert tokenized_datasets (a hugging-face dataset) in standard tf.data.Dataset. We first make the training dataset and the test dataset from the tokenized_datasets in the cell below:

In [10]:
train_dataset = tokenized_datasets['train']
test_dataset = tokenized_datasets['test']  

Since we need the features attention_mask, input_ids, token_type_ids, label for the training stage, we get rid of the email feature. We also use the with_format("tensorflow"):

In [11]:
tf_train_dataset = train_dataset.remove_columns(["email"]).with_format("tensorflow")
tf_test_dataset = test_dataset.remove_columns(["email"]).with_format("tensorflow")

Then we convert everything in big tensors and use the tf.data.Dataset.from_tensor_slices method. In a Kaggle notebook, you have to change 
* test_features = {x: tf_test_dataset[x] for x in tokenizer.model_input_names} into test_features = {x: tf_test_dataset[x].to_tensor() for x in tokenizer.model_input_names} 
* train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names} into train_features = {x: tf_train_dataset[x].to_tensor() for x in tokenizer.model_input_names} 

In [12]:
import tensorflow as tf
train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
#train_features = {x: tf_train_dataset[x].to_tensor() for x in tokenizer.model_input_names} for kaggle notebook
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset["label"]))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(8)

test_features = {x: tf_test_dataset[x] for x in tokenizer.model_input_names}
#test_features = {x: tf_test_dataset[x].to_tensor() for x in tokenizer.model_input_names} for kaggle notebook
test_tf_dataset = tf.data.Dataset.from_tensor_slices((test_features, tf_test_dataset["label"]))
test_tf_dataset = test_tf_dataset.batch(8)

In [13]:
test_tf_dataset

<BatchDataset shapes: ({input_ids: (None, 512), token_type_ids: (None, 512), attention_mask: (None, 512)}, (None,)), types: ({input_ids: tf.int64, token_type_ids: tf.int64, attention_mask: tf.int64}, tf.int64)>

## Fine-tuning the hugging-face bert-case model with tensorflow
To fine-tune a hugging-face model, we have to:
* We import the bert-cased pre-trained tensorflow model from the hugging-face hub using the transformer library
* Compile this tensorflow model
* Fit this tensorflow model

In the cell below, we import the bert-case model from hugging-face. Since we have 2 classes (spam and not spam), we have to set num_classes=2. 

In [14]:
import tensorflow as tf
#!pip install transformers
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)

Downloading:   0%|          | 0.00/502M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


We compile the tensorflow model as we always do when we use tensorflow. Here we use the
* the adam  optimizer with learning_rate=5e-5
* SparseCategoricalCrossentropy loss fucntion
* SparseCategoricalAccuracy() metrics

To fine-tune properly a tensorflow hugging-face model with an adam optimizer, it recomded to set the leraning_rate to 5e-5.

In [15]:
#tf.keras.metrics.Precision()
import tensorflow as tf
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.metrics.SparseCategoricalAccuracy()]
)

#model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=3)
model.fit(train_tf_dataset, epochs=4)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f6d9023d4d0>

We evaluate the model on the test set with the SparseCategoricalAccuracy metric.

In [16]:
model.evaluate(test_tf_dataset)



[0.04701230302453041, 0.9866666793823242]

## Prediction on the test set 
We predict the probability vectors of the test set test_tf_dataset. 
* First we call model.predict.logits on test_tf_dataset.
* In a second step, we use a softmax function to transform to an array with probabilities and we convert in a numpy object.
* In the third step, we the classes using the numpy function np.argmax

In [17]:
import tensorflow as tf
import numpy as np

#First we call model.predict.logits on test_tf_dataset
pred_test_set = model.predict(test_tf_dataset).logits

#In a second step, we use a softmax function to transform to an array with probabilities and we convert in a numpy object.
proba_test_set = tf.nn.softmax(pred_test_set, axis=1).numpy()

#In the third step, we the classes using the numpy function np.argmax
classe_test_set = np.argmax(proba_test_set,axis=1)

We comput the precision. The precision is the ratio tp / (tp + fp).

In [18]:
m_precision = tf.keras.metrics.Precision()
m_precision.update_state(test_dataset['label'],classe_test_set.tolist())
m_precision.result().numpy()

0.9411765

We comput the recall. The recall is the ratio tp / (tp + fn)

In [19]:
m_recall = tf.keras.metrics.Recall()
m_recall.update_state(test_dataset['label'],classe_test_set.tolist())
m_recall.result().numpy()

0.9876543

We compute the AUC score.

In [20]:
m_auc = tf.keras.metrics.AUC()
m_auc.update_state(test_dataset['label'], proba_test_set[:,1])
m_auc.result().numpy()

0.99826014

## How to use the fine-tuned model on a new text
In this part, we try the fine-tuned model on a new email. Then we have to care about the tokenization of this new email. After that, we have to make this tokenization compatible with TensorFlow.


In [22]:
email_test = ["""martin a posted tassos papadopoulos the greek sculptor behind 
                the plan judged that the limestone of mount kerdylio NUMBER 
                miles east of salonika and not far from the mount athos monastic 
                community was ideal for the patriotic sculpture as well as alexander s 
                granite features NUMBER ft high and NUMBER ft wide a museum a restored 
                amphitheatre and car park for admiring crowds are planned so is 
                this mountain limestone or granite if it s limestone it ll weather 
                pretty fast yahoo groups sponsor NUMBER dvds free s p join now URL 
                to unsubscribe from this group send an email to forteana unsubscribe 
                URL your use of yahoo groups is subject to URL"""]

#We tokenize the new email with the tokenizer we defined earlier un this example.
email_encoding = tokenizer(email_test, padding="max_length", truncation=True)
#token_email

We convert the tokenizer into a tf.data.Dataset object. In a second step, we have to apply the .batch(1) method to make email_data suitable for the fine-tuned model.

In [23]:
import tensorflow as tf

email_data = tf.data.Dataset.from_tensor_slices((dict(email_encoding)))
email_data_batch = email_data.batch(1)

Then, we can apply the fine-tuned model to the Tensorflow dataset email_data_batch. After, we convert the previous tensor into a probability vector with a softmax function.  The result is converted in a numpy array.

In [24]:
#predict
pred = model.predict(email_data_batch).logits

#transform to array with probabilities
res = tf.nn.softmax(pred, axis=1).numpy()
res

array([[0.99655104, 0.00344894]], dtype=float32)

Now, we display the prediction of the new text.

In [25]:
import numpy as np
classe_email = ['valide email','spam']
classe_email[np.argmax(res)]

'valide email'

# EXERCICE : Tensorflow Huggingface TRANSFORMER OVER fetch_20newsgroups
Before to do this exercice I recomand you to read carfully the notebook called Transformers application with hugging face. In this notebook we fine tune a hugging face bert model over the fetch_20newsgroups dataset. We have to understand the fetch_20newsgroups dataset as a first step. You have to run the cell bellow to create the dataset. The cell execution could take a while. 

In [None]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism','comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware',\
              'comp.sys.mac.hardware','comp.windows.x','misc.forsale','rec.autos','rec.motorcycles',\
              'comp.sys.mac.hardware','comp.windows.x','misc.forsale','rec.autos','rec.motorcycles',\
              'rec.sport.baseball','rec.sport.hockey','sci.crypt','sci.electronics','sci.med','sci.space',\
              'soc.religion.christian','talk.politics.guns','talk.politics.mideast','talk.politics.misc','talk.religion.misc']

dataset = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'),categories=categories)

import numpy as np
import pandas as pd
len(dataset['target_names'])
target_values_to_target_labels = dict(zip(range(0,len(dataset['target_names'])),dataset['target_names']))

target_values = dataset.target.tolist()
target_labels = [target_values_to_target_labels[o] for o in target_values]

dataset_fetch_20newsgroups = pd.DataFrame({'text':dataset.data,'target_labels':target_labels})
dataset_fetch_20newsgroups.head()

1) Now, we work with the dataset fetch_20newsgroups object. How many unique labels does the dataframe datasets fetch_20newsgroups have?

In [None]:
#your code here

2) How many rows with the 'target_labels' column equal comp.sys.mac.hardware does fetch_20newsgroups object have? Print a row with a comp.sys.mac.hardware label.

In [None]:
#your code here

3) Use the sklearn labelencoder object to encode the target labels column into a new column called target_encoded.

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#your code here 
dataset_fetch_20newsgroups['target_encoded'] =

4) Remove the URL, the email addresses and the html tags from the column 'text' of the dataset_fetch_20newsgroups dataset. You can use the functions defined  below.

In [None]:
import re

def remove_html(data):
    html_tag=re.compile(r'<.*?>')
    data=html_tag.sub(r'',data)
    return data

def _remove_urls(x):
    return re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , x)

def _remove_emails(x):
    return re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", x)

def remove_url(data):
    url_clean= re.compile(r"https://\S+|www\.\S+")
    data=url_clean.sub(r'',data)
    return data

def remove_skip_line(x):
    return re.sub(r'\r?\n|\\|\r/g'," ", x)

5) The dataset library provided by huggingface allow us to perform efficient data-preprocessing and to make the data compatible for a pretrained huggingface model. The dataset dataset_fetch_20newsgroups is a pandas object.
* Load dataset_fetch_20newsgroups as a huggingface dataset object. For this purpose use Dataset.from_pandas. 
* Store the result in dataset_fetch20. dataset_fetch20 is a huggingface dataset object. 
* Remove the column 'target_labels' 

In [None]:
#your code here
from datasets import Dataset

6) Our ultimate goal is to fine tune a bert-base-uncased model from the huggingface hub. When we use a huggingface model, the tokenizer should be cordinated with the model. 
* Import a tokenizer compatible with a bert-base-uncased model.
* Tokenize the column 'text' from the dataset_fetch20 dataset. You have to use something like tokenized_datasets = dataset_fetch20.map(tokenize_function, batched=True). 
* Create a training and test set using the train_test_split method from the dataset package. Store the result in a huggingface dataset called train_test_dataset. You should set the argument test_size=0.15.

In [None]:
#your code here
from transformers import AutoTokenizer

7) The train_test_dataset you created in the previous question has a text column we don't need anymore. The bert model use only 'attention_mask', 'input_ids', 'target_encoded', 'text', 'token_type_ids'. 'target_encoded' is the target values. Remove the 'text' from train_test_dataset and make it compatible with tensorflow using with_format("tensorflow").

In [None]:
#complete the code here
train_test_dataset = train_test_dataset.remove_columns

8) Our goal is to fine-tune a bert-base-uncased model using tensorflow.keras. We have to convert train_test_dataset['train'] and  train_test_dataset['test'] in big tensors using the tf.data.Dataset.from_tensor_slices method. 
* You have to use the tokenizer you defined in the question 6.
* The tensor corresponding to  train_test_dataset['train'] should be stored in train_tf_dataset
* The tensor corresponding to  train_test_dataset['test'] should be stored in test_tf_dataset

In [None]:
tf_train_dataset = train_test_dataset['train']
tf_test_dataset = train_test_dataset['test']

In [None]:
#you code here. Use tf_train_dataset and tf_test_dataset from the previous cell


9)  Import the bert-base-uncased model from the huggingface hub. Use the result of question 1 to set the value of the num_labels parameter. For this purpose use the TFAutoModelForSequenceClassification object from the transfromers library. Call your pre-trained model model_text.

In [None]:
from transformers import TFAutoModelForSequenceClassification
#your code here

model_text = 

10) Compile the model_text such that
* You have to set optimizer argument to tf.keras.optimizers.Adam(learning_rate=5e-5)
* with the acc metric or the metric or the tf.metrics.SparseCategoricalAccuracy() metric.

Train your model with train_tf_dataset you mad in question 8. Evaluate your model on test_tf_dataset.

In [None]:
import tensorflow as tf


11) This part is not a question. We just show the code to use model_text to a new text to classify. 

In [None]:
text = ["""
The first thing is first. 
If you purchase a Macbook, you should not encounter performance issues that will prevent you from learning to code efficiently.
However, in the off chance that you have to deal with a slow computer, you will need to make some adjustments. 
Having too many background apps running in the background is one of the most common causes. 
The same can be said about a lack of drive storage. 
For that, it helps if you uninstall xcode and other unnecessary applications, as well as temporary system junk like caches and old backups.
"""]

In [None]:
import tensorflow as tf
import numpy as np

#tokenize the text
encodings = tokenizer(text, padding="max_length", truncation=True)

#transform to tf.Dataset
text_data_tf = tf.data.Dataset.from_tensor_slices((dict(encodings)))

#predict
preds = model_text.predict(text_data_tf.batch(1)).logits

#transform to array with probabilities
res = tf.nn.softmax(preds, axis=1).numpy()

#We show the predicted class
le.inverse_transform([np.argmax(res)])

12. This part is not a question. We predict the probability vectors of the test set test_tf_dataset. 
* First we call model.predict.logits on test_tf_dataset.
* In a second step, we use a softmax function to transform to an array with probabilities and we convert in a numpy object.
* In the third step, we the classes using the numpy function np.argmax

In [None]:
import tensorflow as tf
import numpy as np

pred_test_set = model_text.predict(test_tf_dataset).logits

#In a second step, we use a softmax function to transform to an array with probabilities and we convert in a numpy object.
proba_test_set = tf.nn.softmax(pred_test_set, axis=1).numpy()

#In the third step, we the classes using the numpy function np.argmax
classe_test_set = np.argmax(proba_test_set,axis=1)

We compute the accuracy on the predictions we've just done:

In [None]:
m_accuracy = tf.keras.metrics.Accuracy()
m_accuracy.update_state(tf_test_dataset['target_encoded'],classe_test_set.tolist())
m_accuracy.result().numpy()

Here, we give the class names of the prediction using the labelencoder:

In [None]:
classe_test_set_class_text =[le.inverse_transform([text])[0] for text in classe_test_set]
classe_test_set_class_text[0:3]