<a href="https://colab.research.google.com/github/coda-nsit/BERT_experiments/blob/master/BERT_finetuning_simple.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetuning BERT in a simple way without the complicated code of the Transformers library. 
Reads 2 files, one with abstracts related to vaccines and one related to therapeutics. BERT is used to classify them.
## References:
I have followed https://mccormickml.com/2019/07/22/BERT-fine-tuning/ tutorial


In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/37/ba/dda44bbf35b071441635708a3dd568a5ca6bf29f77389f7c7c6818ae9498/transformers-2.7.0-py3-none-any.whl (544kB)
[K     |▋                               | 10kB 28.9MB/s eta 0:00:01[K     |█▏                              | 20kB 3.1MB/s eta 0:00:01[K     |█▉                              | 30kB 4.5MB/s eta 0:00:01[K     |██▍                             | 40kB 3.0MB/s eta 0:00:01[K     |███                             | 51kB 3.7MB/s eta 0:00:01[K     |███▋                            | 61kB 4.4MB/s eta 0:00:01[K     |████▏                           | 71kB 5.1MB/s eta 0:00:01[K     |████▉                           | 81kB 5.8MB/s eta 0:00:01[K     |█████▍                          | 92kB 6.4MB/s eta 0:00:01[K     |██████                          | 102kB 4.9MB/s eta 0:00:01[K     |██████▋                         | 112kB 4.9MB/s eta 0:00:01[K     |███████▏                        | 122kB 4.9M

In [0]:
import torch
from torch.utils.data import TensorDataset, random_split, DataLoader, RandomSampler, SequentialSampler

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from transformers import BertTokenizer, BertForSequenceClassification, AdamW, BertConfig, get_linear_schedule_with_warmup

import random
import time
import datetime

In [3]:
from google.colab import drive
drive.mount('/gdrive')
%cp /gdrive/"My Drive"/OncampusJob/vaccines .
%cp /gdrive/"My Drive"/OncampusJob/therapeutics .

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /gdrive


In [4]:
dataset = []

with open("vaccines") as f:
  for passage in f.readlines():
    dataset.append([passage, 1])

with open("therapeutics") as f:
  for passage in f.readlines():
    dataset.append([passage, 0])

dataset = pd.DataFrame(dataset, columns=["data", "label"])
dataset = dataset.sample(frac=1).reset_index(drop=True)
dataset.head()

Unnamed: 0,data,label
0,"Like Moderna, CureVac uses man-made mRNA to sp...",1
1,To determine whether convalescent plasma trans...,0
2,Gilead’s remdesivir is being studied in five c...,0
3,An outbreak of the novel coronavirus SARS-CoV-...,0
4,"GlaxoSmithKline, one of the world’s largest va...",1


In [0]:
sentences = dataset.data.values
labels = dataset.label.values

In [6]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
print('Tokenized: ', tokenizer.tokenize(sentences[0]))
print('Token IDs: ', tokenizer.convert_tokens_to_ids(tokenizer.tokenize(sentences[0])))

HBox(children=(IntProgress(value=0, description='Downloading', max=213450, style=ProgressStyle(description_wid…


Tokenized:  ['Like', 'Modern', '##a', ',', 'Cure', '##V', '##ac', 'uses', 'man', '-', 'made', 'm', '##RNA', 'to', 'spur', 'the', 'production', 'of', 'proteins', '.', 'And', ',', 'like', 'Modern', '##a', ',', 'it', 'got', 'a', 'grant', 'from', 'the', 'nonprofit', 'Coalition', 'for', 'E', '##pid', '##em', '##ic', 'Pre', '##par', '##ed', '##ness', 'Innovation', '##s', 'to', 'apply', 'its', 'technology', 'to', 'co', '##rona', '##virus', '.', 'Cure', '##V', '##ac', 'has', 'said', 'it', 'expects', 'to', 'have', 'a', 'candidate', 'ready', 'for', 'animal', 'testing', 'by', 'April', ',', 'aiming', 'to', 'start', 'a', 'clinical', 'study', 'this', 'summer', '.', 'The', 'company', 'is', 'also', 'working', 'with', 'CE', '##PI', 'on', 'a', 'mobile', 'm', '##RNA', 'manufacturing', 'technology', ',', 'one', 'that', 'would', 'theoretical', '##ly', 'allow', 'health', 'care', 'workers', 'to', 'rapidly', 'produce', 'vaccine', '##s', 'to', 'respond', 'at', 'the', 'site', 'of', 'an', 'outbreak', '.']
Token

# Format input to fit Bert input

## Find the maximum sequence length of the dataset to find the max_len parameter of BERT. 
max_len = 512 for BERT

In [7]:
max_len = 0

for sent in sentences:

    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sent, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

Max sentence length:  371


In [8]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                         # Sentence to encode.
                        add_special_tokens = True,    # Add '[CLS]' and '[SEP]'
                        max_length = 512,             # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True, # Construct attn. masks.
                        return_tensors = 'pt',        # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', sentences[0])
print('Token IDs:', input_ids[0])

Original:  Like Moderna, CureVac uses man-made mRNA to spur the production of proteins. And, like Moderna, it got a grant from the nonprofit Coalition for Epidemic Preparedness Innovations to apply its technology to coronavirus. CureVac has said it expects to have a candidate ready for animal testing by April, aiming to start a clinical study this summer. The company is also working with CEPI on a mobile mRNA manufacturing technology, one that would theoretically allow health care workers to rapidly produce vaccines to respond at the site of an outbreak. 

Token IDs: tensor([  101,  2409,  4825,  1161,   117, 27121,  2559,  7409,  2745,  1299,
          118,  1189,   182, 15654,  1106, 16650,  1103,  1707,  1104,  7865,
          119,  1262,   117,  1176,  4825,  1161,   117,  1122,  1400,   170,
         5721,  1121,  1103, 15773, 10651,  1111,   142, 25786,  5521,  1596,
        11689, 17482,  1174,  1757, 13886,  1116,  1106,  6058,  1157,  2815,
         1106,  1884, 15789, 27608, 

## Split to test and train

In [9]:
# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Calculate the number of samples to include in each set.
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

   12 training samples
    4 validation samples


## Create the dataloader

In [0]:
# Authors recommend 16 or 32 batch size
batch_size = 8

train_dataloader = DataLoader(
    train_dataset,
    sampler = RandomSampler(train_dataset),
    batch_size = batch_size)

validation_dataloader = DataLoader(
    val_dataset,
    sampler = SequentialSampler(val_dataset),
    batch_size = batch_size)

# Train the model

In [0]:
%%capture
model = BertForSequenceClassification.from_pretrained(
    "bert-base-cased",
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False)

model.cuda()

In [12]:
# Get all of the model's parameters as a list of tuples.
params = list(model.named_parameters())

print('The BERT model has {:} different named parameters.\n'.format(len(params)))

print('==== Embedding Layer ====\n')

for p in params[0:5]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== First Transformer ====\n')

for p in params[5:21]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

print('\n==== Output Layer ====\n')

for p in params[-4:]:
    print("{:<55} {:>12}".format(p[0], str(tuple(p[1].size()))))

The BERT model has 201 different named parameters.

==== Embedding Layer ====

bert.embeddings.word_embeddings.weight                  (28996, 768)
bert.embeddings.position_embeddings.weight                (512, 768)
bert.embeddings.token_type_embeddings.weight                (2, 768)
bert.embeddings.LayerNorm.weight                              (768,)
bert.embeddings.LayerNorm.bias                                (768,)

==== First Transformer ====

bert.encoder.layer.0.attention.self.query.weight          (768, 768)
bert.encoder.layer.0.attention.self.query.bias                (768,)
bert.encoder.layer.0.attention.self.key.weight            (768, 768)
bert.encoder.layer.0.attention.self.key.bias                  (768,)
bert.encoder.layer.0.attention.self.value.weight          (768, 768)
bert.encoder.layer.0.attention.self.value.bias                (768,)
bert.encoder.layer.0.attention.output.dense.weight        (768, 768)
bert.encoder.layer.0.attention.output.dense.bias              (

In [0]:
# eps: a very small number to prevent any division by zero in the implementation
optimizer = AdamW(model.parameters(),
                  lr = 2e-5,
                  eps = 1e-8)

epochs = 3
total_steps = len(train_dataloader) * epochs

# dynamically change the learning rate
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [0]:
def flat_accuracy(preds, labels):
  pred_flat = np.argmax(preds, axis=1).flatten()
  labels_flat = labels.flatten()
  return np.sum(pred_flat == labels_flat) / len(labels_flat)

In [0]:
def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [20]:
device = torch.device("cuda")
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as 
# 1. training loss  
# 2. validation loss, 
# 3. validation accuracy
# 4. timings
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # `dropout` and `batchnorm` layers behave differently during training vs validation 
    # source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch
    model.train()

    for step, batch in enumerate(train_dataloader):

        if step % 40 == 0 and not step == 0:
            elapsed = format_time(time.time() - t0)
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch
        model.zero_grad()        

        loss, logits = model(b_input_ids, 
                             token_type_ids=None, 
                             attention_mask=b_input_mask, 
                             labels=b_labels)

        # `loss` is a Tensor containing a single value; 
        # the `.item()` function just returns the Python value from the tensor.
        total_train_loss += loss.item()

        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================
    print("")
    print("Running Validation...")

    t0 = time.time()

    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    for batch in validation_dataloader:
        
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            # Get the "logits" output by the model. The "logits" are the output
            # values prior to applying an activation function like the softmax.
            (loss, logits) = model(b_input_ids, 
                                  token_type_ids=None, 
                                  attention_mask=b_input_mask,
                                  labels=b_labels)
            
        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))


Training...

  Average training loss: 0.47
  Training epcoh took: 0:00:01

Running Validation...
  Accuracy: 0.75
  Validation Loss: 0.61
  Validation took: 0:00:00

Training...

  Average training loss: 0.45
  Training epcoh took: 0:00:01

Running Validation...
  Accuracy: 0.75
  Validation Loss: 0.56
  Validation took: 0:00:00

Training...

  Average training loss: 0.32
  Training epcoh took: 0:00:01

Running Validation...
  Accuracy: 0.75
  Validation Loss: 0.54
  Validation took: 0:00:00

Training complete!
Total training took 0:00:02 (h:mm:ss)
