# Fine tuning pre-trained language models for text classification



## Overview
Fine-tuning pre-trained language models learnt with transformers has improved the state of the art in multiple NLP evaluation tasks(see [SuperGlue leader board](https://super.gluebenchmark.com/leaderboard)). Learning a language model is an unsupervised task where the model learns to predict the next word in a sequence given the previous words. Neural language models have been implemented as feed foorward networks, LSTMS (ELMo, ULMFit), and transformers-encoders (BERT) or decoders (Open AI GPT).

In this notebook we fine tune a BERT pre-trained language model to carry out a binary classification task where tweets are labelled as generated by bots or hurmans. 

<!-- The notebook is structured as follows:

- Motivation
- Setup
  - Libraries required
  - Dataset
- A glimpse on BERT tokenization 
- Fine tune the model
- Evaluate the BERT classifier 
-->

## Motivation
**While word embeddings are learnt from large corpora, their use in neural models to solve specific tasks is limited to the input layer.** So in practice a task-specific neural model is built almost from scratch because most of the model parameters are initialized randomly, and hence, these paremeters need to be optimized for the task at hand, requiring large sets of data to produce a high performance model.

**Recent advances in neural language models** (BERT or OPEN AI GPT) have shown evidence that task specific architectures are not longer necessary and transfering some internal representations (attention blocks) along with shallow feed forward networks is enough. 

**In (Garcia et al.,2019) we presented an experimental study** on the use of word embeddings as input of CNN architectures and Bi-LSTM to tackle the bot detection task and compare these results with fine-tuning pretrained language models. 

**Evaluation results, presented in the figure below, show that fine-tuning language models yields overall better results than training specific neural architectures** that are fed with mixture of: i) pre-trained contextualized word embeddings (ELMo), ii) pre-trained  context-indepedent word embeddings learnt from Common Crawl(FastText), Twitter (GloVe), and urban dictionary (word2vec), plus embeddings optimized by the neural network in the learning process. 

![Bot detection classification task](https://drive.google.com/uc?id=1rSzM544MK2QOezpvUKHfrxATbkEiyBHX)



**References**

Garcia-Silva, Andres, et al. "An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection." Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). 2019.

To cite this paper use the following BibTex entry: 

```
@inproceedings{garcia-silva-etal-2019-empirical,
    title = "An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection",
    author = "Garcia-Silva, Andres  and
      Berrio, Cristian  and
      G{\'o}mez-P{\'e}rez, Jos{\'e} Manuel",
    booktitle = "Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)",
    month = aug,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/W19-4317",
    doi = "10.18653/v1/W19-4317",
    pages = "148--155",
}
```



# A glimpse on BERT


## Input representation

<img src="https://drive.google.com/uc?id=1tZS7sszhNtT3m25EZJkjC9PjRZDEPKGy" alt="Token embeddings, segment embeddings, and positional embeddings" width="500"/>

Image source: "Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805."

## Pre-training learning objectives

### Language Model

<img src="https://drive.google.com/uc?id=1-17pLKqo6BqXbu7GerX1y_e9kJyghjNG" alt="Language modeling objective" width="500"/>

Image source: https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/


### BERT Learning Objective: Masked language model

<img src="https://drive.google.com/uc?id=1q6C-6ont7lsuoyw4c1N4-GRFhcko0c4m" alt="Masked LM" width="500"/>

Image source: https://mlexplained.com/2019/06/30/paper-dissected-xlnet-generalized-autoregressive-pretraining-for-language-understanding-explained/

### Bert: Transformer encoder

<img src="https://drive.google.com/uc?id=11GJiHlDeoKsShOwSJvMTq3fc8E8GxV4V" alt="Masked LM" width="500"/>

Image Source: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

### Next Sentence Prediction

<img src="https://drive.google.com/uc?id=1x0ckvgMwb5j3SfVNId-EyQgyDlt3C42h" alt="next sentence prediction" width="700"/>

Image source: http://jalammar.github.io/illustrated-bert/


## Contextualized word embeddings

<img src="https://drive.google.com/uc?id=1LozTCltkxXbkrE2M72r0lDUWEDXLEWZU" alt="Contextualized embeddings" width="600"/>

Image source: http://jalammar.github.io/illustrated-bert/


## Attention mechanism

https://drive.google.com/open?id=1-8VYAUkC30yaQ65yr-moLmJmZmK1D3Xg

<img src="https://drive.google.com/uc?id=1-8VYAUkC30yaQ65yr-moLmJmZmK1D3Xg" alt="Attention Mechanism" width="500"/>





## Fine-tuning: Supervised Training on a specific task

fine-tuning BERT for a task just requires to incorporate one  additional  output  layer,  so a minimal number of parameters need to be learned from scratch.

In the figure below E represents the input embedding,Ti represents the contextual representation of token i, [CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences

<img src="https://drive.google.com/uc?id=1wZPwbMNtHwf8g-7phWxwtJCxnTfPj-Ux" width="600"/>

Image source: Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

To **fine-tune BERT for a sequence classification task** the transfomer output for the CLS token is used as the sequence representation. The transfomer output for the CLS token is connected to a one layer feed forward network that predicts the classification labels. All the BERT parameters and the FF network are fine-tune jointly to maximize the log-probability of the correct label.






# Experimental setup


## Transformers library

We use transformer from Huggingface: https://github.com/huggingface/transformers

"Transformers provides state-of-the-art general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet, CTRL...) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch."

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/22/97/7db72a0beef1825f82188a4b923e62a146271ac2ced7928baa4d47ef2467/transformers-2.9.1-py3-none-any.whl (641kB)
[K     |████████████████████████████████| 645kB 2.8MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 40.2MB/s 
Collecting tokenizers==0.7.0
[?25l  Downloading https://files.pythonhosted.org/packages/14/e5/a26eb4716523808bb0a799fcfdceb6ebf77a18169d9591b2f46a9adb87d9/tokenizers-0.7.0-cp36-cp36m-manylinux1_x86_64.whl (3.8MB)
[K     |████████████████████████████████| 3.8MB 31.6MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |██████████

## Dataset 

**We use the bot detection dataset generated in (Garcia et al.,2019)** that was built starting from an existing list of twitter accounts that were manually labelled as bots and humans. Then we use the twitter API to extract tweets from these account.  In total the dataset contains around 600K tweet, approximately half of them generated by bots, and the other half by humans. 

In this notebook we provide a complete version of the dataset (large) and a reduced one (small) to be able to run the notebook whithin the time frame, since **fine tuning BERT on the large version takes more than 5h**. 

- Large: 500k train and 100k test labeled tweets which is in the path: "'/content/gdrive/My Drive/09_BERT/Large_Dataset/"
- Small: 1k train and 100 test labeled tweets which is in the path: "'/content/gdrive/My Drive/09_BERT/Small_Dataset/"

### Downloading from Google Drive

Let's download the datasets and the models from Google Drive, and then decompress the file.

In [None]:
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1kq0NxztYDBBN_yWCnvGq_xtBo-bAResU' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1kq0NxztYDBBN_yWCnvGq_xtBo-bAResU" -O BERT.tar && rm -rf /tmp/cookies.txt

--2020-05-21 14:47:40--  https://docs.google.com/uc?export=download&confirm=AzeJ&id=1kq0NxztYDBBN_yWCnvGq_xtBo-bAResU
Resolving docs.google.com (docs.google.com)... 172.217.193.100, 172.217.193.101, 172.217.193.102, ...
Connecting to docs.google.com (docs.google.com)|172.217.193.100|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-00-cc-docs.googleusercontent.com/docs/securesc/03gbe2qq1cal4slrk0s9cmqn0po8d2gu/ugs0ebc8d7jlralgrqmlf9b4ucuipldp/1590072450000/16197418968100245121/16455356080524975533Z/1kq0NxztYDBBN_yWCnvGq_xtBo-bAResU?e=download [following]
--2020-05-21 14:47:40--  https://doc-00-cc-docs.googleusercontent.com/docs/securesc/03gbe2qq1cal4slrk0s9cmqn0po8d2gu/ugs0ebc8d7jlralgrqmlf9b4ucuipldp/1590072450000/16197418968100245121/16455356080524975533Z/1kq0NxztYDBBN_yWCnvGq_xtBo-bAResU?e=download
Resolving doc-00-cc-docs.googleusercontent.com (doc-00-cc-docs.googleusercontent.com)... 108.177.13.132, 2607:f8b0:400c:c09::84
Connec

In [None]:
!tar xzvf ./BERT.tar -C .

09_BERT/
09_BERT/Bert_Classifier_Large/
09_BERT/Bert_Classifier_Large/added_tokens.json
09_BERT/Bert_Classifier_Large/config.json
09_BERT/Bert_Classifier_Large/eval_results.txt
09_BERT/Bert_Classifier_Large/predictions.txt
09_BERT/Bert_Classifier_Large/pytorch_model.bin
09_BERT/Bert_Classifier_Large/special_tokens_map.json
09_BERT/Bert_Classifier_Large/tokenizer_config.json
09_BERT/Bert_Classifier_Large/training_args.bin
09_BERT/Bert_Classifier_Large/vocab.txt
09_BERT/Large_Dataset/
09_BERT/Large_Dataset/cached_dev_Bert_Classifier_Large_128_cola
09_BERT/Large_Dataset/dev.tsv
09_BERT/Large_Dataset/train.tsv
09_BERT/run_glue.py
09_BERT/Small_Dataset/
09_BERT/Small_Dataset/Bert_Classifier/
09_BERT/Small_Dataset/Bert_Classifier/added_tokens.json
09_BERT/Small_Dataset/Bert_Classifier/config.json
09_BERT/Small_Dataset/Bert_Classifier/eval_results.txt
09_BERT/Small_Dataset/Bert_Classifier/predictions.txt
09_BERT/Small_Dataset/Bert_Classifier/pytorch_model.bin
09_BERT/Small_Dataset/Bert_Classi

### Set the dataset version
The enviroment variable DATA_DIR holds the path to the dataset 

In [None]:
%env DATA_DIR=./09_BERT/Small_Dataset/

#Uncomment the following line to use the large version of the dataset
#%env DATA_DIR=./09_BERT/Large_Dataset/

env: DATA_DIR=./09_BERT/Small_Dataset/


### Inspect the dataset

The dataset is in the tsv format expected by transfomer library. 

In [None]:
import os
import pandas as pd

test = pd.read_csv(os.environ["DATA_DIR"] + "dev.tsv", header=None, sep = '\t')
data = pd.DataFrame(test)
data.columns = ["index", "label", "mark", "tweet"]
data

Unnamed: 0,index,label,mark,tweet
0,0,1,a,Now Playing: ♬ Dick Curless - Evil Hearted Me ...
1,1,0,a,Not only are you comfortably swaddled in secur...
2,2,1,a,Follow @iAmMySign !!! Follow @iAmMySign our o...
3,3,0,a,These strawberry sandwich cookies are so easy ...
4,4,0,a,Do These Two Lines Match Up On Your Hands Here...
...,...,...,...,...
95,95,0,a,I’m sorry you hurt your first-grade teacher’s ...
96,96,1,a,#HometimeReading: If you’ve enjoyed #KewOrchid...
97,97,1,a,Miss_5_Thousand : All my afternoon plans just ...
98,98,0,a,"A bunch of associates, that I hardly associate..."


# Hands-on

## Tokenization 

Recent neural languages models use subword representations. ELMO relies on characters, Open AI GPT on byte pair encoding, and BERT on the word pieces algorithms. These **subword representations are combined when unseen words during training needs to be processed, hence avoiding the OOV problem**. 

BERT uses a 30k WordPieces vocabulary. 

Let us see how the BERT Tokenizer works

In [None]:
from transformers import *

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = input("Enter a word or a sentence: ")
print(tokenizer.tokenize(text))
print(tokenizer.encode(text))


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…


Enter a word or a sentence: ELMO relies on characters, Open AI GPT on byte pair encoding, and BERT on the word pieces algorithms
['elm', '##o', 'relies', 'on', 'characters', ',', 'open', 'ai', 'gp', '##t', 'on', 'byte', 'pair', 'encoding', ',', 'and', 'bert', 'on', 'the', 'word', 'pieces', 'algorithms']
[101, 17709, 2080, 16803, 2006, 3494, 1010, 2330, 9932, 14246, 2102, 2006, 24880, 3940, 17181, 1010, 1998, 14324, 2006, 1996, 2773, 4109, 13792, 102]


## Fine-Tuning the model

The next step would be to fine-tune the model.

Running the following script you can fine-tune the model and perform evaluation. While doing the evaluation the classification of the tweets on the test set is saved in the predictions.txt file that we will use later.

The most relevant parameters of the script are:
  - model type: the model that we are going to use, in this case BERT
  - model name or path: the name of the model or path storing a specific model.
  - task name: the task that we want to perform, in this case CoLA because we want to do classification.
  - ouput dir: the directory in which it stores the fine-tuned model.
  
You can try to change the parameters and see how it affects performance. 

This process is slow even though we reduced the dataset. You should expect that it takes around 1 minute.

In [None]:
!python ./09_BERT/run_glue.py \
    --model_type bert \
    --model_name_or_path bert-base-uncased \
    --task_name CoLA \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir "$DATA_DIR" \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 1.0 \
    --save_steps 62500 \
    --overwrite_output_dir \
    --output_dir  ./Bert_Classifier/

2020-05-21 14:49:05.408665: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
05/21/2020 14:49:07 - INFO - filelock -   Lock 140659343022064 acquired on /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca884517.lock
05/21/2020 14:49:07 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json not found in cache or force_download set to True, downloading to /root/.cache/torch/transformers/tmpknnw21be
Downloading: 100% 433/433 [00:00<00:00, 369kB/s]
05/21/2020 14:49:07 - INFO - transformers.file_utils -   storing https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json in cache at /root/.cache/torch/transformers/4dad0251492946e18ac39290fcfe91b89d370fee250efe9521476438fe8ca185.7156163d5fdc189c3016baca0775ffce230789d7fa2a42ef516483e4ca8

If you trained with the small dataset you should see a final result such as mcc = 0.24.

On the other hand if you trained with the large one mcc increases to 0.70

The MCC score measures how well does the algorithm perform on both positive and negative predictions.

It gives more information than the accuracy or the f1 score.

This numbers ranges from -1 to 1 being 0 the random case, -1 the worst value and +1 the best value.

## Further Evaluation

Let's compute the metrics of our fine-tuned model to see how well it performs on the test set

In [None]:
import numpy as np
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import matthews_corrcoef

preds = np.loadtxt("./Bert_Classifier/predictions.txt")
test = pd.read_csv(os.environ["DATA_DIR"] + "dev.tsv", header=None, sep = '\t')

print(classification_report(np.asarray(test[1]), preds))
print("Accuracy: ", accuracy_score(np.asarray(test[1]), preds))
print("MCC: ", matthews_corrcoef(test[1], preds))

              precision    recall  f1-score   support

           0       0.63      0.74      0.68        54
           1       0.62      0.50      0.55        46

    accuracy                           0.63       100
   macro avg       0.63      0.62      0.62       100
weighted avg       0.63      0.63      0.62       100

Accuracy:  0.63
MCC:  0.24851594087962242


You should see an accuracy of 0.63 and an f1-score of 0.62 which is a good result considering the size of the dataset.

The full model fine-tuned on the 500k tweets achieve the following metrics:

    - Accuracy = 0.85
    - Recall = 0.85
    - Precision = 0.86
    - Recall = 0.85



## Perform inference

Now let's take some random examples from our test set:

In [None]:
os.mkdir("./Test_Dataset/") # We are going to store the test dataset in this folder

test_evaluate = test[:4]
print(test_evaluate)
test_evaluate.to_csv("./Test_Dataset/dev.tsv", sep='\t', index=False, header=False)

   0  1  2                                                  3
0  0  1  a  Now Playing: ♬ Dick Curless - Evil Hearted Me ...
1  1  0  a  Not only are you comfortably swaddled in secur...
2  2  1  a  Follow @iAmMySign !!!  Follow @iAmMySign our o...
3  3  0  a  These strawberry sandwich cookies are so easy ...


If you want to perform inference with the larger model we provide an already trained version. You only have to change the argument in model_name_or path from Bert_Classifier_small to Bert_Classifier_Large

In [None]:
#%env MODEL_PATH=./Bert_Classifier/

#Uncomment the following line to use the version of the model trained with the large dataset
%env MODEL_PATH=./09_BERT/Bert_Classifier_Large/

env: MODEL_PATH=./09_BERT/Bert_Classifier_Large/


In [None]:
!python ./09_BERT/run_glue.py \
    --model_type bert \
    --model_name_or_path "$MODEL_PATH" \
    --task_name CoLA \
    --do_eval \
    --do_lower_case \
    --data_dir ./Test_Dataset/ \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=8   \
    --per_gpu_train_batch_size=8   \
    --learning_rate 2e-5 \
    --num_train_epochs 1.0 \
    --save_steps 62500 \
    --output_dir  "$MODEL_PATH"

2020-05-21 14:52:49.026841: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
05/21/2020 14:52:50 - INFO - transformers.configuration_utils -   loading configuration file ./09_BERT/Bert_Classifier_Large/config.json
05/21/2020 14:52:50 - INFO - transformers.configuration_utils -   Model config BertConfig {
  "attention_probs_dropout_prob": 0.1,
  "finetuning_task": "cola",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

05/21/2020 14:52:50 - INFO - transformers.tokenization_utils -   Model name './09_BERT/Bert_Classifier_Large/' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, 

mcc

Let's see if the model has correctly classified the examples:

In [None]:
import os

results = np.loadtxt(os.environ['MODEL_PATH'] + "predictions.txt")
for i,t in enumerate(test_evaluate[3]):
    print(t + " --> ", "BOT" if results[i]> 0.5 else "NOT A BOT")

Now Playing: ♬ Dick Curless - Evil Hearted Me ♬ https://t.co/fzgP9IRt2h -->  BOT
Not only are you comfortably swaddled in security today, it’s ... More for Capricorn https://t.co/MVCHEli4g1 -->  NOT A BOT
Follow @iAmMySign !!!  Follow @iAmMySign our official page for the whole Zodiac.  Follow @iAmMySign !!!  Follow @iAmMySign !!! -->  BOT
These strawberry sandwich cookies are so easy to make and so tasty! Perfect for #MothersDay https://t.co/Uq7cooR2y7 via @iamthemaven -->  NOT A BOT
