# Fine Tuning Bert

In this notebook we try to replicate the results in [this Colab Notebook](https://colab.research.google.com/drive/1Nwvg3QaVW-OAFoi7cV-0sEnxFGsLbYqs). We start by loading the necessary libraries.

In [2]:
import os
import sys
import numpy as np
import pandas as pd
import torch
from pathlib import Path
import zipfile
from urllib import request
import time
import datetime
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import get_linear_schedule_with_warmup
from transformers import BertForSequenceClassification, AdamW, BertConfig
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

sys.path.insert(0, os.path.abspath('../src'))
import utils

In [26]:
EPOCHS = 4
BATCH_SIZE = 32
MAXLEN = 64

device = torch.device('cuda:1')

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
ZIPFILE = 'cola_public_1.1.zip'
DATAFOLDER = Path('../data')
DESTFILE = DATAFOLDER/ZIPFILE
RAWFOLDER = DATAFOLDER/'cola_public/raw'
TRAINFILE = RAWFOLDER/'in_domain_train.tsv'
TESTFILE = RAWFOLDER/'in_domain_dev.tsv'

In [3]:
URL = 'https://nyu-mll.github.io/CoLA/cola_public_1.1.zip'
if not os.path.exists(DESTFILE):
    request.urlretrieve(URL, DESTFILE)

In [4]:
if not os.path.exists(DATAFOLDER/'cola_public'):
    with zipfile.ZipFile(DESTFILE, 'r') as zh:
        zh.extractall(DATAFOLDER)

In [5]:
df = pd.read_table(TRAINFILE, delimiter='\t', header=None,
                  names=['sentence_source', 'label', 'label_notes', 'sentence'])
print(df.shape)
df.head()

(8551, 4)


Unnamed: 0,sentence_source,label,label_notes,sentence
0,gj04,1,,"Our friends won't buy this analysis, let alone..."
1,gj04,1,,One more pseudo generalization and I'm giving up.
2,gj04,1,,One more pseudo generalization or I'm giving up.
3,gj04,1,,"The more we study verbs, the crazier they get."
4,gj04,1,,Day by day the facts are getting murkier.


In [6]:
df['label'].value_counts()

1    6023
0    2528
Name: label, dtype: int64

Are there repeated sentences?

In [7]:
df = df[['label', 'sentence']]
df['sentence'].duplicated().sum()

19

In [8]:
df = df.drop_duplicates()
df.shape

(8543, 2)

In [9]:
sentences, labels = df.sentence.values, df.label.values
sentences

array(["Our friends won't buy this analysis, let alone the next one we propose.",
       "One more pseudo generalization and I'm giving up.",
       "One more pseudo generalization or I'm giving up.", ...,
       'It is easy to slay the Gorgon.',
       'I had the strangest feeling that I knew you.',
       'What all did you get for Christmas?'], dtype=object)

We need to repeat the operations above for the test set.

In [10]:
df_test = pd.read_table(TESTFILE, delimiter='\t', header=None,
                  names=['sentence_source', 'label', 'label_notes', 'sentence'])
print(df_test.shape)
df_test = df_test.drop_duplicates()
print(df_test.shape)

(527, 4)
(527, 4)


In [11]:
test_sentences, test_labels = df_test.sentence, df_test.label

In [12]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lowercase=True)
len(tokenizer.vocab)

30522

The tokenizer contains a `tokenize` function that does *not* add the special tokens. However, the `tokenizer.encode` function, does:

1. Split the sentence into tokens.
2. Add the special tokens.
3. Map the tokens to integer indexes.

In [13]:
tokenizer.tokenize(sentences[0])

['our',
 'friends',
 'won',
 "'",
 't',
 'buy',
 'this',
 'analysis',
 ',',
 'let',
 'alone',
 'the',
 'next',
 'one',
 'we',
 'propose',
 '.']

The `tokenizer.encode` function adds the special tokens `[SEP]`and `[CLS]`. The function in the `utils` module take also care of padding the encoded sequences with zeros. Note that it can also truncate sentences to a `MAXLEN` length.

In [14]:
inputs = utils.process_sentences(sentences, maxlen=MAXLEN)
masks = utils.create_attention_mask(inputs)
print(inputs.shape, inputs.dtype, masks.shape, masks.dtype)

(8543, 64) int64 (8543, 64) int64


Same for the test set.

In [15]:
test_inputs = utils.process_sentences(test_sentences, maxlen=MAXLEN)
test_masks = utils.create_attention_mask(test_inputs)
print(test_inputs.shape, test_masks.shape)

(527, 64) (527, 64)


## Training and validation splits

Let's create a training and a testidation set using Scikit-Learn functionality.

In [16]:
from sklearn.model_selection import train_test_split

train_inputs, val_inputs, train_labels, val_labels = train_test_split(
    inputs, labels, random_state=42, test_size=0.1)

train_masks, val_masks, _, _ = train_test_split(
    masks, labels, random_state=42, test_size=0.1)

print(train_inputs.shape, val_inputs.shape)
print(train_labels.shape, val_labels.shape)

(7688, 64) (855, 64)
(7688,) (855,)


Convert these arrays into tensors.

In [17]:
train_inputs = torch.tensor(train_inputs)
train_labels = torch.tensor(train_labels)
train_masks = torch.tensor(train_masks)

val_inputs = torch.tensor(val_inputs)
val_labels = torch.tensor(val_labels)
val_masks = torch.tensor(val_masks)

test_inputs = torch.tensor(test_inputs)
test_labels = torch.tensor(test_labels)
test_masks = torch.tensor(test_masks)

## Create DataLoaders

In [18]:
train_data = TensorDataset(train_inputs, train_masks, train_labels)
val_data = TensorDataset(val_inputs, val_masks, val_labels)
test_data = TensorDataset(test_inputs, test_masks, test_labels)

In [19]:
train_loader = DataLoader(train_data, sampler=RandomSampler(train_data), batch_size=BATCH_SIZE)
val_loader = DataLoader(val_data, sampler=SequentialSampler(val_data), batch_size=BATCH_SIZE)
test_loader = DataLoader(test_data, sampler=SequentialSampler(test_data), batch_size=BATCH_SIZE)

## Run BERT

In [20]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.to(device);

In [21]:
optimizer = AdamW(model.parameters(), lr = 2e-5, eps = 1e-8)
total_steps = len(train_loader) * EPOCHS
scheduler = get_linear_schedule_with_warmup(
    optimizer, num_warmup_steps = 0, num_training_steps = total_steps)

## Training Loop

In [28]:
import random

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# Store the average loss after each epoch so we can plot them.
loss_values = []

model.zero_grad()

# For each epoch...
for epoch_i in range(0, EPOCHS):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, EPOCHS))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_loss = 0

    # Set our model to training mode (as opposed to evaluation mode)
    model.train()
        
    # This training code is based on the `run_glue.py` script here:
    # https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

    # For each batch of training data...
    for step, batch in enumerate(train_loader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = utils.format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(
                step, len(train_loader), elapsed))

        # Put the model into training mode.    
        model.train()

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
                
        # Forward pass (evaluate the model on this training batch)
        # `model` is of type: pytorch_pretrained_bert.modeling.BertForSequenceClassification
        outputs = model(b_input_ids, 
                    token_type_ids=None, 
                    attention_mask=b_input_mask, 
                    labels=b_labels)
        
        loss = outputs[0]

        # Accumulate the loss. `loss` is a Tensor containing a single value; 
        # the `.item()` function just returns the Python value from the tensor.
        total_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

        # Clear out the gradients (by default they accumulate)
        model.zero_grad()

    # Calculate the average loss over the training data.
    avg_train_loss = total_loss / len(train_loader)            
    
    loss_values.append(avg_train_loss)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(utils.format_time(time.time() - t0)))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put model in evaluation mode to evaluate loss on the validation set
    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    # Evaluate data for one epoch
    for batch in val_loader:
        
        # Add batch to GPU
        batch = tuple(t.to(device) for t in batch)
        
        # Unpack the inputs from our dataloader
        b_input_ids, b_input_mask, b_labels = batch
        
        # Telling the model not to compute or store gradients, saving memory and speeding up validation
        with torch.no_grad():        
            # Forward pass, calculate logit predictions
            # token_type_ids is for the segment ids, but we only have a single sentence here.
            # See https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L258 
            outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)
        
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = utils.flat_accuracy(logits, label_ids)
        
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

    # Report the final accuracy for this validation run.
    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
    print("  Validation took: {:}".format(utils.format_time(time.time() - t0)))

print("")
print("Training complete!")


Training...
  Batch    40  of    241.    Elapsed: 0:00:09.
  Batch    80  of    241.    Elapsed: 0:00:17.
  Batch   120  of    241.    Elapsed: 0:00:26.
  Batch   160  of    241.    Elapsed: 0:00:35.
  Batch   200  of    241.    Elapsed: 0:00:44.
  Batch   240  of    241.    Elapsed: 0:00:52.

  Average training loss: 0.61
  Training epcoh took: 0:00:52

Running Validation...
  Accuracy: 0.72
  Validation took: 0:00:02

Training...
  Batch    40  of    241.    Elapsed: 0:00:09.
  Batch    80  of    241.    Elapsed: 0:00:18.
  Batch   120  of    241.    Elapsed: 0:00:26.
  Batch   160  of    241.    Elapsed: 0:00:35.
  Batch   200  of    241.    Elapsed: 0:00:44.
  Batch   240  of    241.    Elapsed: 0:00:53.

  Average training loss: 0.61
  Training epcoh took: 0:00:53

Running Validation...
  Accuracy: 0.72
  Validation took: 0:00:02

Training...
  Batch    40  of    241.    Elapsed: 0:00:09.
  Batch    80  of    241.    Elapsed: 0:00:18.
  Batch   120  of    241.    Elapsed: 0:00:27