<a href="https://colab.research.google.com/github/abiolaTresor/NLP-Bert-Discovery/blob/master/sentence_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Lab 2: Sentence classification

In this second part of BERT lab, you'll use BERT with the huggingface PyTorch library to quickly and efficiently fine-tune the model to get near state of the art performance in sentence classification. This will also consits of a practical application of transfer learning in NLP to create high performance models with minimal effort on a range of NLP tasks.

The code in this notebook is a simplified version of the [run_glue.py](https://github.com/huggingface/transformers/blob/master/examples/run_glue.py) example script from huggingface.

## Setup


Google Colab offers free GPUs and TPUs! Since we'll be training a large neural network it's best to take advantage of this, otherwise training will take a very long time.

As stated in the README. A GPU can be added by going to the menu and selecting:

`Edit > Notebook Settings > Hardware accelerator (GPU)`

Then we can run the following cell to confirm that the GPU is detected.

In [None]:
import torch

if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    assert False, "Please select GPU in the Colab"

Next, let's install the [transformers](https://github.com/huggingface/transformers) library as in the first lab.

In [None]:
!pip install transformers

#### Downlaoding & Loading CoLA Dataset

We'll use [The Corpus of Linguistic Acceptability (CoLA)](https://nyu-mll.github.io/CoLA/) dataset for single sentence classification. It's a set of sentences labeled as grammatically correct or incorrect. So a simple binary classification task. It was first published in May of 2018, and is one of the tests included in the "GLUE Benchmark" on which models like BERT are competing. The dataset is hosted here: https://nyu-mll.github.io/CoLA/

To download the dataset directly to our Colab workspace. We'll use the `wget` package, and then Unzip the dataset to the file system. 

In [None]:
!pip install wget

In [None]:
import wget
import os

# Download the dataset if we haven't already
url = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'

if not os.path.exists('./cola_public_1.1.zip'):
    print('Downloading dataset...')
    wget.download(url, './cola_public_1.1.zip')
    
# Unzip the dataset (if we haven't already)
if not os.path.exists('./cola_public/'):
    print('Extracting the dataset files...')
    !unzip cola_public_1.1.zip

We can browse the file system of the Colab instance in the sidebar on the left as in the first lab to make sure the data is there.

#### Parsing the data

We can see from the file names that both `tokenized` and `raw` versions of the data are available. We can't use the pre-tokenized version because, in order to apply the pre-trained BERT, we **must** use the tokenizer provided by the model. This is because (1) the model has a specific, fixed vocabulary with learned emebeddings and (2) the BERT tokenizer has a particular way of handling out-of-vocabulary words.

We'll use pandas to parse the "in-domain" training set and look at a few of its properties and data points.

In [None]:
import pandas as pd

# Load the dataset into a pandas dataframe.
df = pd.read_csv("./cola_public/raw/in_domain_train.tsv", delimiter='\t', header=None, names=['sentence_source', 'label', 'label_notes', 'sentence'])

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display 10 random rows from the data.
df.sample(10)

The two properties we actually care about are the the `sentence` and its `label`, which is referred to as the "acceptibility judgment" (0=unacceptable, 1=acceptable).

Here are five sentences which are labeled as not grammatically acceptable.

In [None]:
df.loc[df.label == 0].sample(5)[['sentence', 'label']]



Let's extract the sentences and labels of our training set as numpy ndarrays.

In [None]:
sentences = df.sentence.values
labels = df.label.values

## Tokenization & Input Formatting

In this section, we'll transform our dataset into the format that BERT can be trained on.

#### BERT Tokenizer

To feed our text to BERT, it must be split into tokens, and then these tokens must be mapped to their index in the tokenizer vocabulary. The tokenization must be performed by the tokenizer included with BERT. We'll be using the "uncased" version here since the model will be using latter was not trained on cased inputs.

In [None]:
from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

Let's apply the tokenizer to one sentence just to see the output.


In [None]:
# Print the original sentence.
print(' Original: ', sentences[0])

# Print the sentence split into tokens.
print('Tokenized: ', tokenizer.tokenize(sentences[0]))

# Print the encoded sentence, ie., mapped to token ids.
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

When we actually convert all of our sentences, we'll use the `tokenize.encode` function to handle both steps, rather than calling `tokenize` and `convert_tokens_to_ids` separately. 

Before we can do that, though, we need to talk about some of BERT's formatting requirements.

### Required Formatting

Before inputing text to BERT, we need to follow some requirements as dicussed in the introduction to BERT. In this case, we are required to:
1. Add special tokens to the start and end of each sentence.
2. Pad & truncate all sentences to a single constant length.
3. Explicitly differentiate real tokens from padding tokens with the "attention mask".

#### Special Tokens



**`[SEP]`**

At the end of every sentence, we need to append the special `[SEP]` token.  This token is an artifact of two-sentence tasks, where BERT is given two separate sentences and asked to determine if their order is correct. But in the classification task, this token will be used to indicate the end of the input sentence (followed by padding).

**`[CLS]`**

For classification tasks, we must pre-append the special `[CLS]` token to the beginning of every sentence. This token has special significance. Since it is the required token to be used for classification. Note that with the BERT architecture, and the aggregation mecanisom, this token will contrain information about the whole input sequence, so it is reasonble to use it for classification, and with fine-tuning, the model will further given more importance to this token. Here is what the authors said:

>  "The first token of every sequence is always a special classification token (`[CLS]`). The final hidden state
corresponding to this token is used as the aggregate sequence representation for classification
tasks." (from the [BERT paper](https://arxiv.org/pdf/1810.04805.pdf))

#### Sentence Length & Attention Mask

The sentences in our dataset obviously have varying lengths, so we need to pad them all to the same length, but without surpassing the maximum sentence length which is 512 tokens. But in this case, we don't need to worry about the max length, since our inputs are quite small.

Padding is done with a special `[PAD]` token, which is at index 0 in the BERT vocabulary. The below illustration demonstrates padding out to a "MAX_LEN" of 8 tokens.

<img src="https://i.imgur.com/4tB3vTD.png" width="700">

The "Attention Mask" is simply an array of 1s and 0s indicating which tokens are padding and which aren't. This mask tells the "Self-Attention" mechanism in BERT not to incorporate these PAD tokens into its interpretation of the sentence.

Note, that when chosing the maximum length, we need to take into account its impact on training and evaluation speed.

### Tokenize Dataset

The transformers library provides a helpful `encode` function which will handle most of the parsing and data prep steps for us. Before we are ready to encode our text, though, we need to decide on a **maximum sentence length** for padding.

So first, let's apply one tokenization pass of the dataset in order to measure the maximum sentence length.

In [None]:
max_len = 0

for sent in sentences:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

So let's set the maximum length to 64.

Now we're ready to perform the real tokenization.

The `tokenizer.encode_plus` function combines multiple steps for us:

1. Split the sentence into tokens.
2. Add the special `[CLS]` and `[SEP]` tokens with `add_special_tokens = True`.
3. Map the tokens to their IDs.
4. Pad all sentences to the same length with `pad_to_max_length = True`.
5. Create the attention masks which explicitly differentiate real tokens from `[PAD]` tokens with `return_attention_mask = True,`.
6. Convert numpy array to directly return pytorch tensors with `return_tensors = 'pt'`.

For more details, see the docs is [here](https://huggingface.co/transformers/main_classes/tokenizer.html?highlight=encode_plus#transformers.PreTrainedTokenizer.encode_plus).

In [None]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

for sent in sentences:
    encoded_dict = tokenizer.encode_plus(sent, add_special_tokens = True, max_length = 64, truncation=True,
                        pad_to_max_length = True, return_attention_mask = True, return_tensors = 'pt')
    
    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])

### Training & Validation Split


An imortant pre-processing step is the creation of train/val datasets, so we need to divide up our training set to use 90% for training and 10% for validation. The training split will be used to fine-tune the model and the validation set will be used to see if the model is a good model and does produce the correct predictions.

In [None]:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 90-10 train-validation split.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('Training samples {}'.format(train_size))
print('Validation samples {}'.format(val_size))

Instead of looping over one example at a time, we create a torch datalader, the dataloader will directly gives us batches (i.e., many inputs) at a time, and do the correct memory optimization to avoid loading the example to memory each time. We'll need to specify the batch_size, which is how many sentences we'll input to BERT at one given iteration. We also pass samplers, so for training, we want some randomness, if an example comes first at one time, we want it to come at a different order, we do this with shuffling by using a `RandomSampler`, and this will help us avoid any overfitting. For validation, no need for shuffling given that we don't train the model, so we'll go through the examples sequentially with `SequentialSampler`.

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

batch_size = 32

# Create the DataLoaders for our training and validation sets.
train_dataloader = DataLoader(train_dataset, sampler = RandomSampler(train_dataset),
            batch_size = batch_size)

validation_dataloader = DataLoader(val_dataset, sampler = SequentialSampler(val_dataset),
            batch_size = batch_size)

### Training

Now that our input data is properly formatted, it's time to fine tune the BERT model. 

#### BertForSequenceClassification

For this task, we first want to modify the pre-trained BERT model to give outputs for classification, and then we want to continue training the model on our dataset until that the entire model, end-to-end, is well-suited for our task. Thankfully to the transformer library; this is very straighforward. Where we have a set of interfaces designed for a variety of NLP tasks. Though these interfaces are all built on top of a trained BERT model, each has different top layers and output types designed to accomodate their specific NLP task. Here is the current list of classes provided for fine-tuning:
* BertModel
* BertForPreTraining
* BertForMaskedLM
* BertForNextSentencePrediction
* **BertForSequenceClassification**
* BertForTokenClassification
* BertForQuestionAnswering

The documentation for these can be found [here](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html).

We'll be using [BertForSequenceClassification](https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#bertforsequenceclassification). This is the normal BERT model with an added single linear layer on top for classification that we will use as a sentence classifier. As we feed input data, the entire pre-trained BERT model and the additional untrained classification layer is trained on our specific task. All we need to do is load the model using the method `from_pretrained` with the correct number of labels (in our case, with a binary classification, we have 2 labels).

In [None]:
from transformers import BertForSequenceClassification, AdamW, BertConfig
from transformers import get_linear_schedule_with_warmup

model = BertForSequenceClassification.from_pretrained( "bert-base-uncased", num_labels = 2,
    output_attentions = False, output_hidden_states = False)

# Send model to GPU
model.cuda()

#### Optimizer & Learning Rate Scheduler

Now that we have our model loaded we need to define the training hyperparameters from within the stored model. For the purposes of fine-tuning, the authors recommend choosing from the following values:

- Batch size: 16, **32**
- Learning rate (Adam): 5e-5, 3e-5, **2e-5**
- Number of epochs: 2, 3, **4**

In addition to creating the learning rate scheduler, that specifies show the learning reate will change as the training progresses. In this case the leaninig rate starts from the initiale learning down to zero at the end.

In [None]:
optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs]. 
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps = 0, num_training_steps = total_steps)

#### Training & Evaluation loops

*The training loss consist of the following steps:*
- Fetch inputs and labels
- Load data onto the GPU
- Clear out the gradients calculated in the previous pass
- Forward pass (pass input data through the network)
- Compute loss using labels and outputs (done directly by the library in the forward pass)
- Backward pass (compute the gradients of the parameters with respect to the loss)
- Update parameters with optimizer.step()
- Track variables for monitoring progress

*Evaluation:*
- Fetch inputs and labels
- Load data onto the GPU
- Forward pass (feed input data through the network)
- Compute loss on our validation data and track variables for monitoring progress
- Compute some metric at the end (like accuracy)

For more details, PyTorch also has some [tutorials](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py)

Define a helper functions for calculating accuracy and formatting elapsed time.

In [None]:
import numpy as np
import time
import datetime

def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

def format_time(elapsed):
    return str(datetime.timedelta(seconds=int(round((elapsed)))))

In [None]:
import random
import numpy as np

# Set the seed for reproducibility
seed_val = 42
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# To store training stats
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

for epoch_i in range(0, epochs):
    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode.
    model.train()

    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack ther training batch and send them to GPU
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Clear Grads
        model.zero_grad()        

        # Perform a forward pass 
        loss, logits = model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)

        # Accumulate the training loss
        total_train_loss += loss.item()

        # Perform a backward pass
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
        
    # Validation

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    for batch in validation_dataloader:
        
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # No need to accumulate the grads
        with torch.no_grad():        

            (loss, logits) = model(b_input_ids, 
                                   token_type_ids=None, 
                                   attention_mask=b_input_mask,
                                   labels=b_labels)
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        

    # Report the final stats
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    validation_time = format_time(time.time() - t0)

    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Save stats
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("Training complete!")
print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))

Let's view the summary of the training process.

In [None]:
import pandas as pd

# Use two decimal places.
pd.set_option('precision', 2)
# Create a DataFrame from our training statistics.
df_stats = pd.DataFrame(data=training_stats)
# Use the 'epoch' as the row index.
df_stats = df_stats.set_index('epoch')
df_stats

Notice that, while the the training loss is going down with each epoch, the validation loss is increasing! This suggests that we are training our model too long, and it's over-fitting on the training data. 

In [None]:
import matplotlib.pyplot as plt
% matplotlib inline

import seaborn as sns

# Use plot styling from seaborn.
sns.set(style='darkgrid')

# Increase the plot size and font size.
sns.set(font_scale=1.5)
plt.rcParams["figure.figsize"] = (12,6)

# Plot the learning curve.
plt.plot(df_stats['Training Loss'], 'b-o', label="Training")
plt.plot(df_stats['Valid. Loss'], 'g-o', label="Validation")

# Label the plot.
plt.title("Training & Validation Loss")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.legend()
plt.xticks([1, 2, 3, 4])

plt.show()

### Performance On Test Set

Now we'll load the holdout dataset and prepare inputs just as we did with the training set. Then we'll evaluate predictions using [Matthew's correlation coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html) because this is the metric used by the wider NLP community to evaluate performance on CoLA. With this metric, +1 is the best score, and -1 is the worst score. This way, we can see how well we perform against the state of the art models for this specific task.

### 5.1. Data Preparation



We'll need to apply all of the same steps that we did for the training data to prepare our test data set.

##  <span style="color:red">Your turn. </span>
Just like the preprocessing we did for training and validation data, we'll need to do the following:

1. **Load test data called `out_of_domain_dev.tsv`.**
2. **Tokenize it and create the dataset and the dataloader.**
3. **Evaluate the model on the test Set.**

In [None]:
# Load the dataset into a pandas dataframe.

# Tokenize all of the sentences and map the tokens to thier word IDs.

# Create the Datasets & DataLoader.

With the test set prepared, you'll need to use the fine-tuned model to generate predictions on the test set.

In [None]:
# Prediction on test set

# Tracking variables, append the labels and prediction using these two lists
predictions , true_labels = [], []

In [None]:
print('Positive samples: %d of %d (%.2f%%)' % (df.label.sum(), len(df.label), (df.label.sum() / len(df.label) * 100.0)))

As we see on the print above, the CoLA dataset is not balanced, where the distribution of labels is not the same, and a randaom model will do better than random.The performance on the CoLA benchmark is measured using the "[Matthews correlation coefficient](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html)" (MCC).

In [None]:
from sklearn.metrics import matthews_corrcoef

matthews_set = []

# Combine the results across all batches. 
flat_predictions = np.concatenate(predictions, axis=0)

# For each sample, pick the label (0 or 1) with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()

# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0)

# Calculate the MCC
mcc = matthews_corrcoef(flat_true_labels, flat_predictions)

print('Total MCC: %.3f' % mcc)

<span style="color:red">**Expected -> Total MCC: 0.498** </span>

Cool! In about half an hour and without doing any hyperparameter tuning (adjusting the learning rate, epochs, batch size, ADAM properties, etc.) we are able to get a good score.

The library documents the expected accuracy for this benchmark [here](https://huggingface.co/transformers/examples.html#glue) as `49.23`. You can also look at the official leaderboard [here](https://gluebenchmark.com/leaderboard/submission/zlssuBTm5XRs0aSKbFYGVIVdvbj1/-LhijX9VVmvJcvzKymxy).




Let's take a look at the scores on the individual batches to get a sense of the variability in the metric between batches. Each batch has 32 sentences in it, except the last batch which has only (516 % 32) = 4 test sentences in it.

In [None]:
print('Calculating Matthews Corr. Coef. for each batch...')

for i in range(len(true_labels)):
    pred_labels_i = np.argmax(predictions[i], axis=1).flatten()

    matthews_set.append(matthews)

ax = sns.barplot(x=list(range(len(matthews_set))), y=matthews_set, ci=None)

plt.title('MCC Score per Batch')
plt.ylabel('MCC Score (-1 to +1)')
plt.xlabel('Batch #')
 
plt.show()

### Optional

To maximize the score, we should remove the "validation set" (which we used to help determine how many epochs to train for) and train on the entire training set. This will gives better performances.