# Text classification with BERT

BERT (Bidirectional Encoder Representations from **Transformers**) is a NLP model developed by Google in 2018. It is a model that is already pre-trained on a 2,5000M (+- 170 GB) words corpus from Wikipedia. 

![bert](https://www.advisa.fr/wp-content/uploads/2019/10/google-bert-algorithm.jpg)

To accomplish a particular NLP task, the pre-trained BERT model is used as a base and refined by adding an additional layer; the model can then be trained on a labeled data set dedicated to the NLP task to be performed. This is the very principle of **transfer learning**. It is important to note that BERT is a very large model with 12 layers, 12 attention heads and 110 million parameters (BERT base).

The BERT model is able to do :

* translation
* text generation
* classification
* question-answering
* syntax analysis (tagging, parsing) 

**Why BERT?**

Just look at the different benchmarks to quickly realize that the first models in the list are all forks of BERT.

https://gluebenchmark.com/leaderboard

## Let's go !

To use BERT you need to have either pytorch or tensorflow installed in your environment. It is also preferable to have access to a GPU on your computer. If you don't have a GPU use Google Colab. 

**Exercise :** Use tensorflow or pytorch to check if you have a GPU.





Next, let’s install the [transformers](https://github.com/huggingface/transformers) package from Hugging Face. This package is an interface between BERT and pytorch and/or tensorflow.


``!pip install transformers``



## Load the data

The dataset comes from Odile. She's a bot that tries to answer general questions on a few BeCode Discord servers. The sentences all come from conversations between learners and Odile on Discord.

**Exercise :** Import ``'./dataset/odile_data.csv'`` file into a dataframe.

## Analyze the dataset ! 

It's time to take a quick look at our data. 

**Exercise :** You must answer the following questions: 
* How many observations does the dataset contain?
* How many different labels does the dataset contain?
* Which labels contain the most observations?
* Which labels contain the fewest observations?

## It's time to clean up !

Not all NLP tasks require the same preprocessing. In this case, we have to ask ourselves some questions: 

- Are there unwanted characters in the dataset? For example, do you want to keep the smiley's or not?  
  - If, for example, you want to create labels to analyze feelings, it might be perishable to keep the smiley's.
- Is it relevant to keep capital letters in sentences?
  - In this case, capital letters don't really matter, because on one hand, not everyone starts their sentences with capital letters when chatting. On the other hand, the sentences are quite short, addressed directly to Odile. 
- Is it necessary to limit the number of characters in a sentence?
  - Again in this case it may be preferable to limit the number of words. The questions asked to Odile are supposed to be short, as too long sentences could interfere with the classification if they contain too much information.

There is no universal answer. Everything will depend on the expected result. 

**Exercise :** Clean the dataset.
- Remove all unnecessary characters. You can choose to keep the smiley's or not.
- Put all sentences in lower case.
- Limit text to 256 words.

## Label's encoding
As you know, the machine needs to convert words into numbers so that it can interpret them. It's the same with labels. So we are going to create a dictionary that will allow us to convert all labels into numbers. 

**Exercise :** Create a dictionary that contains all the labels and assign an id to it. (Of course, there should be no duplicates). 



**Exercise :** Create a column `id_label` in your dataframe and insert the id's of the labels. 

When we make our predictions, the model will return the label id as a prediction. So it may be useful to save your label dictionary to be able to reinterpret the label for a human later on. 

**Exercise:** Save your label dictionary with pickle (or other). 

## Split your dataset !
After all this time, I dare to hope that it is not necessary to explain this step anymore!

**Exercise :** Create the variables X_train, X_test, y_train and y_test. 


## Tokenization 
If you don't know what tokenization is anymore look [here](../1.preprocessing/1.tokenization.ipynb)

We will use the tokenizer provided by BERT. This is a pre-trained model that will save us time. 

**Exercise :** Create a ``tokenizer`` variable and instantiate ``BertTokenizer.from_pretrained()`` from ``transformers``. You have to load ``bert-base-uncased`` model. (Uncased for case-insensitive.) 

[Documentation](https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer)



Good! We have instantiated our tokenizer but we have not yet encoded our words in vector.
To do this we will have to use the method ``tokenizer.batch_encode_plus()``. This method will convert our sentences into a vector and create the attention mask.



**Exercise :** Create an ``encoded_data_train`` variable and instantiate `tokenizer.batch_encode_plus()`. First you have to specify the data. So pass the variable `X_train`.

You need to know 4 parameters. 

- **padding :** this is the parameter to make all vectors have the same length. You can set it to True. We need it to work with the attention masks.

- **return_attention_mask :** allows to have the vector of the attention mask in return. Set it to True. Without this mask, we cannot see the attention points of our model. 
- **max_length :** Maximum length of the sequence. You can set it to 256
 
- **return_tensors :** Here depending on the framework you are using (Pytorch VS Tensorflow) you have to specify the type of tensors you want to return. 

  - For pytorch you have to specify "pt".
  - For tensorflow you have to specify "tf".
  - For a numpy array, you must indicate "np".


You must do the same for the test data set. 

**Exercise :** Create a `encoded_data_test` variable and do the same thing as above. 

If you do `print(encoded_data_train)`, you will see we have a dictionary with the following keys: `'input_ids'`, `'token_type_ids'` and `'attention_mask'`.

* **input_ids :** The sentence represented as a vector. The input_ids are the indices corresponding to each token in our sentence.

* **attention_mask :** It points out which tokens the model should pay attention to and which ones it should not.

* **token_type_ids :** Is used to bring together two sequences, we will not use it in this case.  
 But you can find more information by following this [link](https://huggingface.co/transformers/glossary.html#token-type-ids)
 

**Exercise :** print ``encoded_data_train['input_ids']`` and ``encoded_data_train['attention_mask']``

## Preapare the dataset
Whether it's for Pytorch or Tensorflow, we have to prepare the datasets (more simply said, convert the dataframes to tensors). 

We need to convert `y_train`, `y_test` into a tensor. For pytorch you have to use ``torch.tensor()`` and for tensorflow ``tf.tensor()``.

**Exercise :** Create a variable `labels_train` and create a tensor with `y_train`.


**Exercise :** Create a variable `labels_test` and create a tensor with `y_test`.

Define the batch size.  

**Exercise:** Create a `batch_size` variable. The number of samples will depend on several factors, such as the capacity of your graphics card. If your graphic card is not very powerful I advise you to put a small batch size of 8. 

Now we need to convert our encoded dataframe into a tensor.

**Exercise :** Create the ``dataset_train`` and ``dataset_test`` variables and convert ``encoded_data_train`` and ``encoded_data_test`` into tensor.

**PYTORCH  :** [Use torch.utils.data.Dataset class](https://classyvision.ai/tutorials/classy_dataset)  
**Tensorflow :** [Use tf.data.Dataset.from_tensor_slices](https://medium.com/when-i-work-data/converting-a-pandas-dataframe-into-a-tensorflow-dataset-752f3783c168)





## Load BERT model
Depending on what you use (pytorch or tensorflow) you will have to use the following class: 

pytorch = ``BertForSequenceClassification``  
tensorflow = ``TFBertForSequenceClassification.from_pretrained()``

⚠️ You must use the same model as the one used for tokenization. So in our case  ``bert-base-uncased``. 


[doc pytorch](https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification)   
[doc tensorflow](https://huggingface.co/transformers/model_doc/bert.html#tfbertforsequenceclassification)

**Exercise:** Create a model variable and instantiate the `BertForSequenceClassification().from_pretrained()` (or `TFBertForSequenceClassification.from_pretrained()`). As a parameter, you must indicate the number of labels (normally 95).



**🔦 Pytorch only :** Assign the model to "cuda" device   
``model.to("cuda")``

In [None]:
# 🔦 PYTORCH user only !! 
# Assign the model to gpu

## Train your model

It's time to start training the model!
For this, the HuggingFace package simplifies our life by bringing us a ``Trainer()`` class.

To use this class, we must first configure the model with the ``TrainingArguments()`` class. It is this class that will allow us to set the batch size, the number of epochs, ...

⚠️ For tensorflow you have to use `TFTrainer()` and `TFTrainingArguments()` !!

**Exercise :** import `Trainer` and `TrainingArgument` from transformers.

**Exercise :** Create the ``training_args`` variable and instantiate the class `TrainingArguments`. You need to specify several parameters : 
* `output_dir` : Directory path for saving your template.
* `num_train_epochs` : Number of epochs. Will depend on your machine, batch size, etc...
* `per_device_train_batch_size` : batch size per GPU and for training. Here again the number will depend on your machine. If you have a weak GPU, I advise you to put 8 or 16.
* `per_device_eval_batch_size` : batch size per GPU and for **testing**. During the evaluation, the gradient and backpropagation are not executed, so you can set a larger batch size.
* `learnig_rate` : by default it is `5e-5`. But most likely you will have to change it.  Again, only your tests can define a good learning rate.
* `logging_dir` : directory path for storing logs





We are going to improve the metrics,notably the f1 score.   
[Copy and paste the compute_metrics found in this documentation.](https://huggingface.co/transformers/training.html#codecell14)

**Exercise :** Create the ``trainer`` variable and instantiate the ``Trainer()`` or ``TFTrainer()`` class. You need to specify several parameters :
* `model` : the `model` variable.
* `args` : the `trainings_args` variable
* `compute_metrics` : the `compute_metrics` function
* `train_dataset` : the `train_dataset` variable
* `test_dataset` : the `test_dataset` variable 

**Exercise :** Train your model with `trainer.train()` method.

## Evaluate your model

**Exercise :** Evaluate your model with `trainer.evaluate()` method.

If you do not have an f1 score of at least 0.8, your model could be improved. If your score is very low or stagnant, change the learning rate values and adjust the batch size. You can also increase the number of epochs. Unfortunately, there is no magic parameter, it all depends on your environment. You will have to do some tests to find the right hyper-parameters.

**Exercise :** Test your model by making a prediction on the phrase "Hello how are you?".
You should get the label "smalltalk_greetings_how_are_you".