# COLX 585 Trends in Computational Linguistic
##  Lab tutorial 4: Training on TPU device

* Utilize the Cloud TPUs to fine-tune BERT.
* PyTorch XLA package help us use Cloud TPU cores as devices, like any other PyTorch device.


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Overview
Generally, we only need to change few things to utilize TPUs:
  - Install PyTorch XLA package
  - Set device as XLA device: `device = xm.xla_device()`
  - Change `optimizer.step()` function to `xm.optimizer_step(optimizer, barrier=True)` in order to prevent graphs from growing too large.


## Import require Python libraries

Colab doesn't install `transformers` library automatically. Hence, we should install `transformers` first.

In [0]:
! pip install transformers



In [0]:
import os
assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator'

### Installing PyTorch/XLA
PyTorch can use Cloud **TPU cores** as devices with the **PyTorch/XLA package**. For more on PyTorch/XLA see its [Github](https://github.com/pytorch/xla) or its [documentation](http://pytorch.org/xla/).

Run the following cell to install PyTorch, Torchvision, and PyTorch/XLA. I

In [0]:
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  3727  100  3727    0     0  15464      0 --:--:-- --:--:-- --:--:-- 15400
Updating TPU and VM. This may take around 2 minutes.
Updating TPU runtime to pytorch-dev20200325 ...
Uninstalling torch-1.4.0:
Done updating TPU runtime: <Response [200]>
  Successfully uninstalled torch-1.4.0
Uninstalling torchvision-0.5.0:
  Successfully uninstalled torchvision-0.5.0
Copying gs://tpu-pytorch/wheels/torch-nightly+20200325-cp36-cp36m-linux_x86_64.whl...
- [1 files][ 83.4 MiB/ 83.4 MiB]                                                
Operation completed over 1 objects/83.4 MiB.                                     
Copying gs://tpu-pytorch/wheels/torch_xla-nightly+20200325-cp36-cp36m-linux_x86_64.whl...
- [1 files][114.5 MiB/114.5 MiB]                                                
Operation completed over 1 objects/114.5 MiB.           

In [0]:
import tensorflow
import torch

from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd as autograd
from tqdm import tqdm, trange
import pandas as pd
import numpy as np
import io
import os
import matplotlib.pyplot as plt
from keras.preprocessing.sequence import pad_sequences
from sklearn.metrics import accuracy_score, f1_score, recall_score, precision_score, classification_report, confusion_matrix
import sys
import shutil
import argparse
import tempfile
import urllib.request
import zipfile

# imports the torch_xla package
import torch_xla
import torch_xla.core.xla_model as xm

Using TensorFlow backend.


In [0]:
from transformers import *

Now, we set the device as `xla` device (i.e., TPU). 

In [0]:
device = xm.xla_device()
print(device)

xla:1


## Data prepare

Similar, we use the corpus from the [CL-Aff shared task](https://sites.google.com/view/affcon2019/cl-aff-shared-task?authuser=0). We focus on `sociality classification`. Sociality refers to `whether or not other people than the author are involved in the emotion situation`. 
We only use labelled dataset which include 10,560 labelled samples. 

We have already preprocessed (tokenization, removing URLs, mentions, hashtags and so on) the tweets and placed it under ``./happy_db`` folder in three files as ``train.tsv``, ``dev.tsv`` and ``test.tsv``. We split the labeled data into 80\% training set (8,448 moments) and 20\% development set (2112 moments).

First, we define a function to pre-process input data. 

In [0]:
# define a function for data preparation
def data_prepare(file_path, lab2ind, tokenizer, max_len = 32, mode = 'train'):
    '''
    file_path: the path to input file. 
                In train mode, the input must be a tsv file that includes two columns where the first is text, and second column is label.
                The first row must be header of columns.

                In predict mode, the input must be a tsv file that includes only one column where the first is text.
                The first row must be header of column.

    lab2ind: dictionary of label classes
    tokenizer: BERT tokenizer
    max_len: maximal length of input sequence
    mode: train or predict
    '''
    # if we are in train mode, we will load two columns (i.e., text and label).
    if mode == 'train':
        # Use pandas to load dataset
        df = pd.read_csv(file_path, delimiter='\t',header=0, names=['content','label'])
        print("Data size ", df.shape)
        labels = df.label.values
        
        # Create sentence and label lists
        labels = [lab2ind[i] for i in labels] 
        print("Label is ", labels[0])
        
        # Convert data into torch tensors
        labels = torch.tensor(labels)

    # if we are in predict mode, we will load one column (i.e., text).
    elif mode == 'predict':
        df = pd.read_csv(file_path, delimiter='\t',header=0, names=['content'])
        print("Data size ", df.shape)
        # create placeholder
        labels = []
    else:
        print("the type of mode should be either 'train' or 'predict'. ")
        return
        
    # Create sentence and label lists
    content = df.content.values

    # We need to add a special token at the beginning for BERT to work properly.
    content = ["[CLS] " + text for text in content]

    # Import the BERT tokenizer, used to convert our text into tokens that correspond to BERT's vocabulary.
    tokenized_texts = [tokenizer.tokenize(text) for text in content]

    # [cls,s,cd,r,[SEP]]. max_len= 3
    # if the sequence is longer the maximal length, we truncate it to the pre-defined maximal length
    tokenized_texts = [ text[:max_len+1] for text in tokenized_texts]

    # We also need to add a special token at the end.
    tokenized_texts = [ text+['[SEP]'] for text in tokenized_texts]
    print ("Tokenize the first sentence:\n",tokenized_texts[0])
    
    # Use the BERT tokenizer to convert the tokens to their index numbers in the BERT vocabulary
    input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]
    print ("Index numbers of the first sentence:\n",input_ids[0])

    # Pad our input seqeunce to the fixed length (i.e., max_len) with index of [PAD] token
    pad_ind = tokenizer.convert_tokens_to_ids(['[PAD]'])[0]
    input_ids = pad_sequences(input_ids, maxlen=max_len+2, dtype="long", truncating="post", padding="post", value=pad_ind)
    print ("Index numbers of the first sentence after padding:\n",input_ids[0])

    # Create attention masks
    attention_masks = []

    # Create a mask of 1s for each token followed by 0s for pad tokens 
    for seq in input_ids:
        seq_mask = [float(i>0) for i in seq]
        attention_masks.append(seq_mask)

    # Convert all of our data into torch tensors, the required datatype for our model
    inputs = torch.tensor(input_ids)
    masks = torch.tensor(attention_masks)

    return inputs, labels, masks

We use `BertTokenizer.from_pretrained()` to load vocabulary of pretrained model. The first argument should be either a string with the `shortcut name` of a pretrained model or a path to a directory containing model vocabulary file, `vocab.txt`. `Transformers` provides many pre-trained checkpoints with pre-defined `shortcut name`. If the argument is a correct model identifier listed on [here](https://huggingface.co/models), the model will download the vocabulary and load it to tokenizer automatically. If it doesn't match any model identifier, the model will use this argument as a path to load the vocabulary. 

We give "bert-large-uncased" as the first argument, which refers to the **24-layer, 1024-hidden, 16-heads, 340M parameters** [variant of BERT model](https://huggingface.co/transformers/pretrained_models.html). "bert-large-uncased" is a model identifier in `Transformers` so the vocabulary will be downloaded and applied to our tokenizer. The vocabulary of "bert-large-uncased" was generated using bype-pair encoding and includes 30,522 WordPieces.     

In [0]:
model_path = "bert-large-uncased"
# define label to number dictionary
lab2ind = {'no': 0, 'yes': 1}

# tokenizer from pre-trained BERT model
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased',do_lower_case=True)

HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




In [0]:
# Use defined funtion to extract data
train_inputs, train_labels, train_masks = data_prepare("./drive/My Drive/Colab Notebooks/happy_db/train.tsv", lab2ind,tokenizer)
validation_inputs, validation_labels, validation_masks = data_prepare("./drive/My Drive/Colab Notebooks/happy_db/dev.tsv", lab2ind,tokenizer)

Data size  (8448, 2)
Label is  1
Tokenize the first sentence:
 ['[CLS]', 'it', 'was', 'my', 'birthday', ',', 'and', 'my', 'wife', 'and', 'daughter', 'surprised', 'me', 'with', 'some', 'surprise', 'guests', 'and', 'a', 'small', 'party', '.', '[SEP]']
Index numbers of the first sentence:
 [101, 2009, 2001, 2026, 5798, 1010, 1998, 2026, 2564, 1998, 2684, 4527, 2033, 2007, 2070, 4474, 6368, 1998, 1037, 2235, 2283, 1012, 102]
Index numbers of the first sentence after padding:
 [ 101 2009 2001 2026 5798 1010 1998 2026 2564 1998 2684 4527 2033 2007
 2070 4474 6368 1998 1037 2235 2283 1012  102    0    0    0    0    0
    0    0    0    0    0    0]
Data size  (1056, 2)
Label is  1
Tokenize the first sentence:
 ['[CLS]', 'my', 'baby', 'took', 'a', '1', '.', '5', 'hour', 'nap', 'instead', 'of', 'a', '20', '##min', '##ute', 'nap', 'and', 'i', 'was', 'able', 'to', 'get', 'some', 'things', 'done', '!', '[SEP]']
Index numbers of the first sentence:
 [101, 2026, 3336, 2165, 1037, 1015, 1012, 1019, 

In [0]:
train_inputs.shape

torch.Size([8448, 34])

Create an iterator of our data with `torch DataLoader`. This helps us to save on memory during training.

For fine-tuning BERT on a specific task, the authors recommend a batch size of 16 or 32. We use 32 batch size here. 



In [0]:
batch_size = 32
# We'll take training samples in random order in each epoch. 
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_dataloader = DataLoader(train_data, 
                              sampler = RandomSampler(train_data), # Select batches randomly
                              batch_size=batch_size)

# We'll just read validation set sequentially.
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_dataloader = DataLoader(validation_data, 
                                   sampler = SequentialSampler(validation_data), # Pull out batches sequentially.
                                   batch_size=batch_size)


## Creating `Bert_cls` class

Now, we put everthing together. We bulid a `Bert_cls` class to train a BERT classifier end-to-end.

In [0]:
class Bert_cls(nn.Module):
    def __init__(self, lab2ind, model_path, hidden_size):
        super(Bert_cls, self).__init__()
        self.model_path = model_path
        self.hidden_size = hidden_size
        self.bert_model = BertModel.from_pretrained(model_path, output_hidden_states=True, output_attentions=True)
        self.label_num = len(lab2ind)
        self.fc = nn.Linear(self.hidden_size, self.label_num)
    def forward(self, bert_ids, bert_mask):
        last_hidden_state, pooler_output, hidden_states, attentions = self.bert_model(input_ids=bert_ids)
        fc_output = self.fc(pooler_output)
        return fc_output, attentions

Instantiate model.

In [0]:
bert_model = Bert_cls(lab2ind, 'bert-large-uncased', 1024).to(device)

HBox(children=(IntProgress(value=0, description='Downloading', max=362, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=1344997306, style=ProgressStyle(description…




Count the number of parameters. 

In [0]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(bert_model):,} trainable parameters')

The model has 335,143,938 trainable parameters


This model will use 10.72 Gb memory!!

## Optimizer and Learning Rate Scheduler

For the purposes of fine-tuning, the authors recommend the following hyperparameter ranges (from Appendix A.3 of the [paper](https://arxiv.org/pdf/1810.04805.pdf)):

* Batch size: 16, 32
* Learning rate (Adam): 5e-5, 3e-5, 2e-5
* Number of epochs: 2, 3, 4

We use:

* Batch size: 32
* Learning rate (Adam): 2e-5
* Number of epochs: 3

In [0]:
# Parameters:
lr = 2e-5
max_grad_norm = 1.0
epochs = 3
warmup_proportion = 0.1
num_training_steps  = len(train_dataloader) * epochs
num_warmup_steps = num_training_steps * warmup_proportion

### In Transformers, optimizer and schedules are instantiated like this:
# Note: AdamW is a class from the huggingface library
# the 'W' stands for 'Weight Decay"
optimizer = AdamW(bert_model.parameters(), lr=lr, correct_bias=False)
# schedules
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps)  # PyTorch scheduler

# We use nn.CrossEntropyLoss() as our loss function. 
criterion = nn.CrossEntropyLoss()

# Model training

We define a `train()` function. 

In [0]:
def train(model, iterator, optimizer, scheduler, criterion):
    
    model.train()
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        # Add batch to TPU
        batch = tuple(t.to(device) for t in batch)
        # Unpack the inputs from our dataloader
        input_ids, input_mask, labels = batch

        outputs,_ = model(input_ids, input_mask)

        loss = criterion(outputs, labels)
        # delete used variables to free memory
        del batch, input_ids, input_mask, labels

        loss.backward()
        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)  # Gradient clipping is not in AdamW anymore

        xm.optimizer_step(optimizer, barrier=True)  # Note: Cloud TPU-specific code!!!

        scheduler.step()
        epoch_loss += loss.cpu().item()
        optimizer.zero_grad()
  
    return epoch_loss / len(iterator)

PyTorch uses Cloud TPUs through the [XLA deep learning compiler](https://www.tensorflow.org/xla). This compiler records the operations we perform into a graph which is evaluated all-at-once and as needed. `xm.optimizer_step(optimizer, barrier=True)` inserts a "barrier" in the graph that forces evaluation every time the gradients are updated. This prevents XLA's graphs from growing too large. For more details about how PyTorch uses XLA and Cloud TPUs see [the documentation](http://pytorch.org/xla/).

We define a `evaluate()` function. 

In [0]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    all_pred=[]
    all_label = []
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            # Add batch to TPU
            batch = tuple(t.to(device) for t in batch)
            # Unpack the inputs from our dataloader
            input_ids, input_mask, labels = batch

            outputs,_ = model(input_ids, input_mask)
            
            loss = criterion(outputs, labels)

            # delete used variables to free TPU memory
            del batch, input_ids, input_mask
            
            epoch_loss += loss.cpu().item()

            # identify the predicted class for each example in the batch
            probabilities, predicted = torch.max(outputs.cpu().data, 1)
            # put all the true labels and predictions to two lists
            all_pred.extend(predicted)
            all_label.extend(labels.cpu())
    
    accuracy = accuracy_score(all_label, all_pred)
    f1score = f1_score(all_label, all_pred, average='macro') 
    return epoch_loss / len(iterator), accuracy, f1score

Training model

In [0]:
# Train the model
loss_list = []
acc_list = []

for epoch in trange(epochs, desc="Epoch"):
    train_loss = train(bert_model, train_dataloader, optimizer, scheduler, criterion)  
    val_loss, val_acc, val_f1 = evaluate(bert_model, validation_dataloader, criterion)

    # Create checkpoint at end of each epoch
    state = {
        'epoch': epoch,
        'state_dict': bert_model.cpu().state_dict(),
        'optimizer': optimizer.state_dict(),
        'scheduler': scheduler.state_dict()
        }

    torch.save(state, "./drive/My Drive/Colab Notebooks/ckpt_BERT/BERT_"+str(epoch+1)+".pt")

    print('\n Epoch [{}/{}], Train Loss: {:.4f}, Validation Loss: {:.4f}, Validation Accuracy: {:.4f}, Validation F1: {:.4f}'.format(epoch+1, epochs, train_loss, val_loss, val_acc, val_f1))
    

	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, Number alpha)
Epoch:  33%|███▎      | 1/3 [03:24<06:49, 204.92s/it]


 Epoch [1/3], Train Loss: 0.3165, Validation Loss: 0.2382, Validation Accuracy: 0.9252, Validation F1: 0.9250


Epoch:  67%|██████▋   | 2/3 [05:26<02:59, 179.93s/it]


 Epoch [2/3], Train Loss: 0.1761, Validation Loss: 0.1928, Validation Accuracy: 0.9328, Validation F1: 0.9323


Epoch: 100%|██████████| 3/3 [07:27<00:00, 149.19s/it]


 Epoch [3/3], Train Loss: 0.1261, Validation Loss: 0.1992, Validation Accuracy: 0.9375, Validation F1: 0.9370





### References:
* http://pytorch.org/xla/release/1.5/index.html
* https://github.com/pytorch/xla
* http://mccormickml.com/2019/07/22/BERT-fine-tuning/
* https://huggingface.co/transformers/index.html
* https://colab.research.google.com/drive/1ywsvwO6thOVOrfagjjfuxEf6xVRxbUNO#scrollTo=_QXZhFb4LnV5
* https://github.com/google-research/bert
* https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1