### In this notebook, we re-train BERT on the MNLI dataset plus examples from the HANS dataset. 

Before running the cells, change runtime to GPU. It is also required to upload the following:
- ``` utils.py ``` a python script with a bunch of helper functions
- ``` heuristics_train_set.txt ``` from the HANS dataset(https://github.com/tommccoy1/hans)
- https://gist.github.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e, a zip file containing a script to download the MNLI dataset
- ``` heuristics_evaluation_set.txt ``` from the HANS dataset




In [0]:
import tensorflow as tf
import torch
import pandas as pd
import numpy as np

In [0]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/13/33/ffb67897a6985a7b7d8e5e7878c3628678f553634bd3836404fef06ef19b/transformers-2.5.1-py3-none-any.whl (499kB)
[K     |▋                               | 10kB 16.5MB/s eta 0:00:01[K     |█▎                              | 20kB 1.8MB/s eta 0:00:01[K     |██                              | 30kB 2.4MB/s eta 0:00:01[K     |██▋                             | 40kB 1.7MB/s eta 0:00:01[K     |███▎                            | 51kB 1.9MB/s eta 0:00:01[K     |████                            | 61kB 2.3MB/s eta 0:00:01[K     |████▋                           | 71kB 2.5MB/s eta 0:00:01[K     |█████▎                          | 81kB 2.6MB/s eta 0:00:01[K     |██████                          | 92kB 3.0MB/s eta 0:00:01[K     |██████▋                         | 102kB 2.8MB/s eta 0:00:01[K     |███████▏                        | 112kB 2.8MB/s eta 0:00:01[K     |███████▉                        | 122kB 2.8M

In [0]:
!pip install transformers
!pip install wget
# unzipping glue datasets
!unzip 60c2bdb54d156a41194446737ce03e2e-17b8dd0d724281ed7c3b2aeeda662b92809aadd5.zip

Archive:  60c2bdb54d156a41194446737ce03e2e-17b8dd0d724281ed7c3b2aeeda662b92809aadd5.zip
17b8dd0d724281ed7c3b2aeeda662b92809aadd5
   creating: 60c2bdb54d156a41194446737ce03e2e-17b8dd0d724281ed7c3b2aeeda662b92809aadd5/
  inflating: 60c2bdb54d156a41194446737ce03e2e-17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py  


In [0]:
# downloading datasets
!python '/content/60c2bdb54d156a41194446737ce03e2e-17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py'

Downloading and extracting CoLA...
	Completed!
Downloading and extracting SST...
	Completed!
Processing MRPC...
Local MRPC data not specified, downloading data from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt
	Completed!
Downloading and extracting QQP...
	Completed!
Downloading and extracting STS...
	Completed!
Downloading and extracting MNLI...
	Completed!
Downloading and extracting SNLI...
	Completed!
Downloading and extracting QNLI...
	Completed!
Downloading and extracting RTE...
	Completed!
Downloading and extracting WNLI...
	Completed!
Downloading and extracting diagnostic...
	Completed!


-----------------
### Reading datasets
First, we read the MNLI dataset. This is done using the ``` read_data ``` function from the ``` utils.py ``` file.

In [0]:
from utils import read_data
# reading in MNLI dataset
train_premises, train_hypotheses, train_labels = read_data('/content/glue_data/MNLI/train.tsv')
val_premises, val_hypotheses, val_labels = read_data('/content/glue_data/MNLI/dev_matched.tsv')

For the next step, we must upload the 
``` heuristics_train_set.txt ``` 
file. This can be downloaded from https://github.com/tommccoy1/hans. We must also import the 
``` read_and_convert_hans_test ``` function from our ``` utils.py ``` file to read and convert the file.



In [0]:
from utils import read_and_convert_hans
hans_premises, hans_hypotheses, hans_pairIDs, hans_labels = read_and_convert_hans('/content/heuristics_train_set.txt')

We now concatenate both datasets.

In [0]:
train_premises = train_premises + hans_premises
train_hypotheses = train_hypotheses + hans_hypotheses
train_labels = train_labels + hans_labels

In [0]:
len(train_labels), len(train_premises)

(422703, 422703)

Tokenize the premises, hypotheses, and labels. This is done using helper functions from the ``` utils.py ``` file.





In [0]:
from transformers import BertTokenizer
# loading bert tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [0]:
from utils import tokenize_sentences
train_inputs, train_ids, train_masks = tokenize_sentences(train_premises, train_hypotheses, 128, tokenizer)
val_inputs, val_ids, val_masks = tokenize_sentences(val_premises, val_hypotheses, 128, tokenizer)

In [0]:
from utils import tokenize_labels_hans
train_labels = tokenize_labels_hans(train_labels, tokenizer)
val_labels = tokenize_labels_hans(val_labels, tokenizer)

Converting to tensors

In [0]:
train_inputs = torch.tensor(train_inputs)
val_inputs = torch.tensor(val_inputs)

In [0]:
train_ids = torch.tensor(train_ids)
val_ids = torch.tensor(val_ids)

In [0]:
train_masks = torch.tensor(train_masks)
val_masks = torch.tensor(val_masks)

In [0]:
train_labels = torch.tensor(train_labels)
val_labels = torch.tensor(val_labels)

Creating DataLoaders

In [0]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
batch_size = 32

In [0]:

# Create DataLoader for training set
train_data = TensorDataset(train_inputs, train_masks, train_ids, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

# Create DataLoader for validation set
validation_data = TensorDataset(val_inputs, val_masks, val_ids, val_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

------
### Using GPU for faster training time

In [0]:
# check for GPU
device_name = tf.test.gpu_device_name()
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

# use GPU
device = torch.device("cuda")
# confirm
print('We are using a ', torch.cuda.get_device_name(0))

SystemError: ignored

In [0]:
tf.test.is_gpu_available()

False

------
### Loading BERT
We use 2 labels now instead of 3 since the HANS dataset uses ``` entailment ```and ```non-entailment``` instead of ``` entailment```, ```contradiction```, and ```neutral```. 



In [0]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", 
    num_labels = 2,  
    output_attentions = False, 
    output_hidden_states = False, 
)

# run model on GPU
model.cuda()

optimizer = AdamW(model.parameters(),
                  lr = 2e-5, 
                  eps = 1e-8
                )

In [0]:
from transformers import get_linear_schedule_with_warmup

# number of training epochs (authors recommend between 2 and 4)
epochs = 1 # manually train 3 times to avoid GPU connection issues

total_steps = len(train_dataloader)*epochs

# create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

NameError: ignored

### Training BERT

In [0]:
import random
import time
import datetime
import re
import os
from google.colab import files


torch.set_default_dtype(torch.float64)

seed = 72

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

loss_values = []

for epoch in range(0, epochs):
  print('---------- Epoch %s ----------' % str(epoch))
  # start clock
  t0 = time.time()

  # reset loss for epoch
  total_loss = 0

  # put model into training mode
  model.train()

  # for each batch of the training data
  for step, batch in enumerate(train_dataloader):

    if step % 100 == 0 and not step == 0:
      time_elapsed = str(datetime.timedelta(seconds=int(round(time.time() - t0))))
      print('\t Batch %i of %i. Time elapsed: %s' % (step, len(train_dataloader), time_elapsed))
    
    # retrieve tensors from dataloader
    # copy each to GPU using to(device)
    input_ids = batch[0].to(device)
    attention_mask = batch[1].to(device)
    sequence_ids = batch[2].to(device)
    labels = batch[3].to(device)

    # clear previously calculated gradients
    model.zero_grad()

    # perform forward pass
    # the loss is returned
    outputs = model(
        input_ids = input_ids.long(),
        attention_mask = attention_mask.long(),
        token_type_ids = sequence_ids.long(),
        labels = labels.long()
        )
    
    loss = outputs[0]
    total_loss += loss.item()

    # perform backward pass to calculate gradients
    loss.backward()

    # Clip the norm of the gradients to 1.0 to help prevent "exploding gradients" 
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

    # Update parameters and take a step using the computed gradient
    optimizer.step()

    # Update the learning rate
    scheduler.step()

  try:
    # Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()
    output_dir = '/content/saved_model'
    print("Saving model to %s" % output_dir)
    torch.save(model.state_dict(), output_dir)
    torch.save(model, '/content/entire_model.pth')
  except:
    print('Saving Failed')

  # Calculate the average loss over the training data.
  avg_train_loss = total_loss / len(train_dataloader)
  loss_values.append(avg_train_loss)

  print('--- Average Training Loss: %f' % avg_train_loss)

  # Measure performance on validation set
  t0 = time.time()
  model.eval()

  try:
    hans_premises, hans_hypotheses, hans_pairIDs, hans_labels = read_and_convert_hans('/content/heuristics_evaluation_set.txt')
    test_inputs, test_ids, test_masks = tokenize_sentences(hans_premises, hans_hypotheses, 128, tokenizer)
    test_inputs = torch.tensor(test_inputs)
    test_ids = torch.tensor(test_ids)
    test_masks = torch.tensor(test_masks)
    hans_pairIDs = torch.tensor(hans_pairIDs)
    test_data = TensorDataset(test_inputs, test_masks, test_ids, hans_pairIDs)
    test_sampler = RandomSampler(test_data)
    test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
    predictions = []
    pair_ids = []

    model.eval()

    # Tracking variables 
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0

    for batch in test_dataloader:

        batch = tuple(t.to(device) for t in batch)
              
        # Unpack the inputs from dataloader
        input_ids, attention_mask, sequence_ids, batch_pair_ids = batch

        # no need for grad since evaluation
        with torch.no_grad():        

          outputs = model(input_ids = input_ids.long(),
                              attention_mask = attention_mask.long(),
                              token_type_ids = sequence_ids.long())
              
          logits = outputs[0]

          # Move logits and labels to CPU
          logits = logits.detach().cpu().numpy()
          batch_pair_ids = batch_pair_ids.to('cpu').numpy()

          for i in range(0,len(logits)): 
            predictions.append(logits[i])
            pair_ids.append('ex' + str(batch_pair_ids[i]))
          
    df = pd.DataFrame()
    df['pairID'] = pair_ids
    df['gold_label'] = predictions
    df.to_csv('hans_predictions_post.csv', index=False)

    print('---- HANS Testing Completed ----')
  except:
    print('---- HANS testing failed ----')

  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0

  for batch in validation_dataloader:

      batch = tuple(t.to(device) for t in batch)
          
      # Unpack the inputs from dataloader
      input_ids, attention_mask, sequence_ids, labels = batch

      # no need for grad since evaluation
      with torch.no_grad():        

        outputs = model(input_ids = input_ids.long(),
                          attention_mask = attention_mask.long(),
                          token_type_ids = sequence_ids.long())
          
        # Get the "logits" output by the model. The "logits" are the output
        # values prior to applying an activation function like the softmax.
        logits = outputs[0]

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = labels.to('cpu').numpy()
          
        # Calculate the accuracy for this batch of test sentences.
        tmp_eval_accuracy = np.sum(np.argmax(logits, axis=1).flatten() == label_ids.flatten())/len(label_ids)
          
        # Accumulate the total accuracy.
        eval_accuracy += tmp_eval_accuracy

        # Track the number of batches
        nb_eval_steps += 1

  # Report the final accuracy for this validation run.
  print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
  try:
    print("  Validation took: {:}".format((datetime.timedelta(seconds=int(round(time.time() - t0)))))) 
  except:
    continue

print('Training complete.')

In [0]:
import re
from utils import tokenize_sentences
def read_and_convert_hans(filepath): 
  premises = []
  hypotheses = []
  pairIDs = []
  gold_labels = []
  first_line = True
  with open(filepath) as file:
    for fline in file:
      line = re.split(r'\t+', fline)
      if first_line == True:
        first_line = False
        premises.append(line[5])
        hypotheses.append(line[6])
        continue
      pairIDs.append(int(re.sub('ex', '', line[7])))
      gold_labels.append(line[0])
      premises.append(line[5])
      hypotheses.append(line[6])
    
    #assert(len(pairIDs) == len(premises))
    assert(len(premises) == len(hypotheses))
    assert(len(pairIDs) == len(gold_labels))

    return premises, hypotheses, pairIDs, gold_labels

In [0]:
hans_premises, hans_hypotheses, hans_pairIDs, hans_labels = read_and_convert_hans('/content/heuristics_evaluation_set.txt')

In [0]:
test_inputs, test_ids, test_masks = tokenize_sentences(hans_premises, hans_hypotheses, 128, tokenizer)

In [0]:
test_inputs = torch.tensor(test_inputs)
test_ids = torch.tensor(test_ids)
test_masks = torch.tensor(test_masks)
hans_pairIDs = torch.tensor(hans_pairIDs)

In [0]:
test_data = TensorDataset(test_inputs, test_masks, test_ids, hans_pairIDs)
test_sampler = RandomSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

In [0]:
predictions = []
pair_ids = []

model.eval()

# Tracking variables 
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0

for batch in test_dataloader:

    batch = tuple(t.to(device) for t in batch)
          
    # Unpack the inputs from dataloader
    input_ids, attention_mask, sequence_ids, batch_pair_ids = batch

    # no need for grad since evaluation
    with torch.no_grad():        

      outputs = model(input_ids = input_ids.long(),
                          attention_mask = attention_mask.long(),
                          token_type_ids = sequence_ids.long())
          
      logits = outputs[0]

      # Move logits and labels to CPU
      logits = logits.detach().cpu().numpy()
      batch_pair_ids = batch_pair_ids.to('cpu').numpy()

      for i in range(0,len(logits)): 
        predictions.append(logits[i])
        pair_ids.append('ex' + str(batch_pair_ids[i]))
      
df = pd.DataFrame()
df['pairID'] = pair_ids
df['gold_label'] = predictions
df.to_csv('hans_predictions_post.csv', index=False)

print('---- HANS Testing Completed ----')

NameError: ignored

In [0]:
# output to CSV file to submit in kaggle competition
df = pd.DataFrame()
df['pairID'] = pair_ids
df['gold_label'] = predictions
df.to_csv('hans_predictions_post.csv', index=False)