# Transformers and BERT
In this part, we will learn how to apply pre-trained BERT model to improve text classification. Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP (Natural Language Processing) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. Google is leveraging BERT to better understand user searches. (From WIKI)





Read more: http://jalammar.github.io/illustrated-transformer/

BERT paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

Understanding BERT: https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad



## Preparing Dataset
Set random seed

Make sure that you are using Python3 and a GPU.

In [0]:
import torch
import random
import numpy as np


SEED = 1001

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


In [0]:
# A way to enlarge RAM: https://towardsdatascience.com/upgrade-your-memory-on-google-colab-for-free-1b8b18e8791d
# d=[]
# while(1):
#   d.append('1')

We use an existing library called `transformers` to import BERT models.  Now let's install it first. 

Read more in the repo: https://github.com/huggingface/transformers


In [3]:
# make sure that transformers library is installed
! pip install transformers




We now import the tokenizer, this is to tokenize sentences. 

In [4]:
from transformers import BertTokenizer
# let's use a pre-trained version ('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

print(max_input_length)

len(tokenizer.vocab)

# tokenize a sentence: you will see the tokenizer "cleans" the sentence as well.
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')

print(tokens)

# There will be a warning, but just leave it

512
['hello', 'world', 'how', 'are', 'you', '?']


Now we convert the tokens into IDs.

And we list the IDs, and some spcial tokens: `<CLS>` means classification token; `<SEP>` means a separator between two sentences, and so on...

In [5]:
indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)

init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

[7592, 2088, 2129, 2024, 2017, 1029]
[CLS] [SEP] [PAD] [UNK]
101 102 0 100


In [0]:
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens

Prepare TEXT and LABEL .

In [0]:
from torchtext import data

TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

LABEL = data.LabelField(dtype = torch.float)

In [8]:
 # follow the steps to authorize colab to get access to your google drive data
 from google.colab import drive
 drive.mount('/content/gdrive')
 # set up the path
ROOT_DIR = " gdrive/My\ Drive/Colab\ Notebooks/nlp_hw2/"
DATA_DIR = ROOT_DIR+'IMDB.gz'



Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [9]:
## TODO: Let's use the IMDB data, and split into training and testing (this may take a few minutes)
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_data, valid_data = train_data.split(random_state=random.seed(SEED))
# all_data = datasets.IMDB(DATA_DIR,TEXT, LABEL)
# train_data, test_data = all_data.splits(TEXT, LABEL)

# train_data, valid_data = ...


print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 17500
Number of validation examples: 7500
Number of testing examples: 25000


In [10]:
tokens = tokenizer.convert_ids_to_tokens(vars(train_data.examples[6])['text'])

print(tokens)

['a', 'typical', 'clause', '##n', 'film', ',', 'but', 'then', 'again', 'not', 'typical', '.', 'clause', '##n', 'writes', ',', 'directs', 'and', 'play', 'one', 'of', 'the', 'leading', 'roles', '.', 'this', 'is', 'really', 'a', 'great', 'film', 'about', 'normal', 'people', 'living', 'normal', 'lives', 'trying', 'to', 'make', 'the', 'best', 'of', 'it', '.', 'the', '4', 'primary', 'actors', 'were', 'fantastic', '.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'fritz', 'helmut', 'was', 'convincing', '.', 'you', 'believe', 'that', 'he', 'really', 'is', 'sick', '.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'son', '##ja', 'richter', 'plays', 'a', 'nurse', 'that', 'really', 'is', 'an', 'actor', ',', 'but', 'it', 'turns', 'out', 'that', 'she', 'is', 'the', 'best', 'nurse', 'to', 'take', 'care', 'of', 'the', 'old', 'man', '.', '<', 'br', '/', '>', '<', 'br', '/', '>', 'everybody', 'has', 'problems', 'and', 'those', 'who', 'nobody', 'believes', 'in', 'ends', 'up', 'being', 'happy', '.', 'but', 'not

In [0]:
# build vocab
LABEL.build_vocab(train_data)

In [0]:
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

## Model building

Import the `BertModel`, and we load from the pre-trained model by giving the path. 


In [0]:
from transformers import BertModel
# It will download the pre-trained model
bert = BertModel.from_pretrained('bert-base-uncased')


## Applying BERT to classification

It is possible to use the BERT model directly, however, the free GPU is not large enough to load the whole model; 
So let's try to use the pre-trained embedding layer. Then we train our own RNN layer on top of it. 

In [0]:
import torch.nn as nn

class MyBERTwithRNN(nn.Module):
    def __init__(self,
                 bert,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout):
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)
        
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #text = [batch size, sent len]
                
        with torch.no_grad():
            embedded = self.bert(text)[0]
                
        #embedded = [batch size, sent len, emb dim]
        
        _, hidden = self.rnn(embedded)
        
        #hidden = [n layers * n directions, batch size, emb dim]
        
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
                
        #hidden = [batch size, hid dim]
        
        output = self.out(hidden)
        
        #output = [batch size, out dim]
        
        return output

In [0]:
## creat model instance
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

model = MyBERTwithRNN(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)

In [16]:
def count_parameters(model):
    ## fill here:
    param_number = 0
    param_number = sum(p.numel() for p in model.parameters() if p.requires_grad)
    
    return param_number

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 112,241,409 trainable parameters


In [0]:
# Let's fix the bert embeddings
for name, param in model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

## Model Training


In [0]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)

In [0]:
## TODO: define the accuracy function (Hint: similar to the same function from PartB)
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() 
    acc = correct.sum()/len(correct)
    return acc

In [0]:
## TODO: finish the training function
## model is our model; iterator contains data in batches; criterion is to calculate the loss function
## we want to return the average epoch loss and epoch accuracy.  (Hint: use binary_accuracy() )

def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        # ...
        optimizer.zero_grad()
        batch_pred = model(batch.text).squeeze()
        loss = criterion(batch_pred, batch.label)
        acc = binary_accuracy(batch_pred, batch.label)

        loss.backward()
        optimizer.step()
        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
## TODO: finish the evaluation function
## model is our model; iterator contains data in batches; criterion is to calculate the loss function
## we want to return the average epoch loss and epoch accuracy.   (Hint: use binary_accuracy() )
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            # ...
            batch_pred = model(batch.text).squeeze()
            loss = criterion(batch_pred, batch.label)
            acc = binary_accuracy(batch_pred, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [0]:
import time
# a helper function to see how much time needed
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [23]:
# Start training.
# Note that it will take ~17 minutes for one epoch.
# The output will be: Epoch: 01 | Epoch Time: 17m 36s...
# Validate accuracy is higher than 85% in the first epoch, higher than 90% in the second epoch.

N_EPOCHS = 2

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
        
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
        
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'best_model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Epoch: 01 | Epoch Time: 6m 46s
	Train Loss: 0.480 | Train Acc: 75.87%
	 Val. Loss: 0.267 |  Val. Acc: 89.21%
Epoch: 02 | Epoch Time: 6m 46s
	Train Loss: 0.268 | Train Acc: 89.05%
	 Val. Loss: 0.242 |  Val. Acc: 90.37%


In [24]:
# Load the best model and evaluate; this may take about 5-10 mins; the Test Accuracy is higher than 90%
model.load_state_dict(torch.load('best_model.pt'))

test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Test Loss: 0.246 | Test Acc: 90.27%


## Inference on your own sentence

In [0]:
## Each sentence when converting into the index, should have [CLS] tag at the beginning and [SEP] tag in the end. 
def my_predict(model, tokenizer, sentence):
    model.eval()
    tokens = tokenizer.tokenize(sentence)
    tokens = tokens[:max_input_length-2]
    indexed = [init_token_idx] + tokenizer.convert_tokens_to_ids(tokens) + [eos_token_idx]
    tensor = torch.LongTensor(indexed).to(device)
    tensor = tensor.unsqueeze(0)
    prediction = torch.sigmoid(model(tensor))
    return prediction.item()

In [26]:
# the score should be close to 0
my_predict(model, tokenizer, "This film is terrible")

0.09066072851419449

In [27]:
# the score should be close to 1
my_predict(model, tokenizer, "I like it !")

0.9777471423149109

**Question**: how do you compare the bert model with PartB? (Hint: training time, accuracy, etc.) Please answer in the next Text cell

**Answer**: Bert can achieve better score than the networks![alt text](https://) in Part B in term of accuracy. However, Bert require more training time compared with networks in Part B.

## Submission

Now that you have completed the assignment, follow the steps below to submit your aissgnment:
1. Click __Runtime__  > __Run all__ to generate the output for all cells in the notebook. 
2. Save the notebook (__File__  > __Save__) with the output from all the cells in the notebook by click __File__ > __Download .ipynb__.
3. **Keep the output cells** , and answers to the question in the Text cell. 
4. Put the .ipynb file under your hidden directory on the Zoo server `~/hidden/<YOUR_PIN>/Homework2/`.
5. As a final step, run a script that will set up the permissions to your homework files, so we can access and run your code to grade it. Make sure the command be;pw runs without errors, and do not make any changes or run the code again. If you do run the code again or make any changes, you need to run the permissions script again. Submissions without the correct permissions may incur some grading penalty.
`/home/classes/cs477/bash_files/hw2_set_permissions.sh <YOUR_PIN>`
