In [None]:
# Jupyter Candies 

# Run the notebook readable w/o wanrings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Install the additional libaries.
! pip install transformers
! pip install torch==1.2.0

# Introduction

The first section of this notebook will first ***very*** briefly introduce some background concepts that's good to know about 

- The ImageNet Moment in NLP
- A Zoo of Pre-trained Models
- BERT (Bidirectional Encoder Representation from Transformers) Basics, one of the more popular transfer learning models for NLP and 


The second section** demonstrates how you can the BERT model from `pytorch_transformer` library to: 

1. **Convert text to array/list of floats** 
2. **Fill in the blanks** 

<!--
3. **Fine-tune the pre-trained model** based on the data you want to use for a specific task
4. **Apply the fine-tuned model** to a couple of downstream tasks
-->

### References

I'll strongly recommend these readings to better understand/appreciate the first part of the notebook =)

 - [Rush (2018) blogpost](https://nlp.seas.harvard.edu/2018/04/03/attention.html) on "The Annotated Transformer" that explains the explaining the Transformer architecture 
 - [Ruder et al. (2019) tutorial](http://ruder.io/state-of-transfer-learning-in-nlp/index.html) on "Transfer Learning in NLP" @ NAACL
 - [Weng (2019) blogpost](https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html) on "Generalized Language Models"
 - https://github.com/huggingface/transformers
 - https://github.com/explosion/spacy-transformers
 

## Background: The ImageNet Moment for NLP 


Transfer learning gained traction in Computer Vision, made popular by the [ImageNet](http://www.image-net.org) and [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html) image classification task. Similarly, transfer learning gained popularity when a wave of Transformer based models, with the BERT model being the more popular one from the zoo.


## Background: A Zoo of Pre-trained Models

There's a whole variety of transfer learning pre-trained models in the wild. [Sanh et al. (2019)](https://arxiv.org/abs/1910.01108) puts them nicely into a chart of the no. of parameters* of the model with respect to the dates the models were released: 

<img src="https://miro.medium.com/max/4140/1*IFVX74cEe8U5D1GveL1uZA.png" alt="DistilBERT" style="width:700px;"/>

***Note:** "*Parameters*" approximates to how much "*memory*"/"*information*" the model is storing after pre-training.


<!-- 
Here's a summary inspired by [Weng's (2019) blogpost](https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html):

| Name | Architecture | Autoregressive | No. of Parameters | Release Date | Pre-training | Downstream tasks | Downstream Model | 
|:-|:-|:-:|:-:|:-:|:-:|:-:|:-|
| [ELMo](https://allennlp.org/elmo) | 2-layers BiLSTM | Yes | 94M | Apr 2018 | Unsupervised | Feature-based | Task-agnostic | None | 
| [ULMFit](http://nlp.fast.ai/classification/2018/05/15/introducing-ulmfit.html) | AWD-LSTM | Yes | ?? | Apr 2018 | Unsupervised | Feature-based | Task-agnostic | None | 
| [GPT](https://openai.com/blog/language-unsupervised/) | Transformer Decoder | Yes | 110M | Jul 2018 | Unsupervised | Model-baed | Task-agnostic | Pre-trained layers + Task layers | 
| [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) | Transformer Encoder | No | 340M | Oct 2018 | Unsupervised | Model-based | Task-agnostic | Pre-trained layers + Task layers | 
| [Transfomer ElMo](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/training_transformer_elmo.md) | Transformer Decoder | Yes | 465M | Jan 2019 | Unsupervised | Task-agnostic | Pre-trained layers + Task layers | 
| [GPT-2](https://openai.com/blog/better-language-models/) | Transformer Decoder | Yes | 1500M | Feb 2019 | Unsupervised | Model-baed | Task-agnostic | Pre-trained layers + Task layers | 

-->

### Background: BERT Basics

Lets start with the elephant in the zoo!

<img src="https://lilianweng.github.io/lil-log/assets/images/BERT-input-embedding.png" alt="BERTInputs" style="width:700px;"/>

First the input string needs to prepended with the `[CLS]` token, this special token is used to allocate some placeholder than can be used to produce the labels for classification task. 

Then, for each sentence that's inside the text string, explicit `[SEP]` tokens need to be added to indicate on of a sentence. 


Then string input needs to be converted to three components before passing them to Transformer model:

 - **WordPiece tokenization**: The text (string) input would be split into tokens segmented using the WordPiece model that may split natural words further into sub-words units to handle rare/unknown words. 
 
 - **Segment Indices**: This part indicates the start and end of the sentences in the string inputs, delimited by the special `[SEP]` token.
 
 - **Position Indices**: This part simply enumerates the index of WordPiece tokens. 

In [None]:
from itertools import chain
from collections import namedtuple

import numpy as np
import torch
from transformers import BertTokenizer, BertModel, BertForMaskedLM

# Load pre-trained model tokenizer (vocabulary)
# A tokenizer will split the text into the appropriate sub-parts (aka. tokens).
# Depending on how the pre-trained model is trained, the tokenizers defers.
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')

In [None]:
# Example of a tokenized input after WordPiece Tokenization.
text = "[CLS] my dog is cute [SEP] he likes playing [SEP]"
tokenizer.tokenize(text)


### Gotcha! The output is differen from texample in the image above!!

That's because the full word `playing` is inside the `BertTokenizer`'s WordPiece vocabulary.

In [None]:
"playing" in tokenizer.wordpiece_tokenizer.vocab

### Lets try another verb that's not in the vocabulary.

In [None]:
print("slacking" in tokenizer.wordpiece_tokenizer.vocab)

text = "[CLS] my dog is cute [SEP] he likes slacking [SEP]"
tokenized_text = tokenizer.tokenize(text)  # There, we see the ##ing token!
tokenized_text

### We fetch the index of these words from the model's vocabulary. 

In [None]:
token_indices = tokenizer.convert_tokens_to_ids(tokenized_text)
token_indices

### Corresponding to the text input, we need to create the "segment indices"

In [None]:
# We need to create an array that indicates the end of sentences, delimited by [SEP]
text = "[CLS] my dog is cute [SEP] he likes slacking [SEP]"
tokenized_text = tokenizer.tokenize(text)  # There, we see the ##ing token!

# First we find the indices of `[SEP]`, and incrementally adds it up. 
# Here's some Numpy gymnastics... (Thanks to @divakar https://stackoverflow.com/a/58316889/610569)
m = np.asarray(tokenized_text) == "[SEP]"
segments_ids = m.cumsum()-m
segments_ids

### Now, we convert the list and numpy arrays to PyTorch's Tensor objects

In [None]:
tokens_tensor, segments_tensors = torch.tensor([token_indices]), torch.tensor([segments_ids])

# See the type change?
print(tokens_tensor.shape, type(token_indices), type(tokens_tensor))
print(segments_tensors.shape, type(segments_ids), type(segments_tensors))

# Lets convert our input text to an array of number!!!

### First, we load the pre-trained model

In [None]:
# When using the BERT model for "encoding", i.e. convert string to array of floats, 
# we use the `BertModel` object from pytorch transformer library.
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

In [None]:
# Predict hidden states features for each layer
with torch.no_grad():
    encoded_layers, _ = model(tokens_tensor, segments_tensors)

In [None]:
encoded_layers

We see that the shape is 3-Dimension, i.e. (`batch_size`, `sequence_length`, `hidden_dimension`), where

 - `batch_size` corresponds to "no. of sentences"
 - `sequence_length` corresponds to "no. of tokens"
 - `hidden_dimensions` refers to the "information for each word provided by the pre-trained model"

In [None]:
encoded_layers.shape

# The BERT model is very good at fill-in-the-blank task

The BERT model is trained using a "cloze" task where words are randomly replaced with the `[MASK]` symbols and the model learns to adjust its parameters such that it learns which words are most probable to fit into the `[MASK]` symbols.

When using the BERT model for "guessing missing words", we use the `BertForMaskedLM` object from pytorch transformer library. Here's an example if we blank out words in the sentence, BERT is able to find the appropriate word to fill it in.

In [None]:
# Load the model.
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

In [None]:
# We need to create an array that indicates the end of sentences, delimited by [SEP]
text = "[CLS] please don't let the [MASK] out of the [MASK] . [SEP]"
tokenized_text = tokenizer.tokenize(text)
token_indices = tokenizer.convert_tokens_to_ids(tokenized_text)

# Create the segment indices.
m = np.asarray(tokenized_text) == "[SEP]"
segments_ids = m.cumsum()-m

# Convert them to the arrays to pytorch tensors.
tokens_tensor, segments_tensors = torch.tensor([token_indices]), torch.tensor([segments_ids])

In [None]:
# Apply the model to the inputs.
with torch.no_grad(): # You can take this context manager to mean that we're not training.
    outputs, *_ = model(tokens_tensor, token_type_ids=segments_tensors)

In [None]:
outputs.shape

Now we see that the output tensor shape is different. The dimensions now refers to the (`batch_size`, `sequence_length`, `vocab_size`), where: 

 - `batch_size` corresponds to "no. of sentences"
 - `sequence_length` corresponds to "no. of tokens"
 - `vocab_size` is the no. of wordpiece tokens in the tokenizer's vocabulary, we'll use this to fetch the correct word that we want to use to fill in the `[MASK]` symbol.

In [None]:
print(tokenized_text)

In [None]:
# We have to check where the masked token is from the original text. 
mask_index = tokenized_text.index('[MASK]') 
print(mask_index) # The 7th token.

# Then we fetch the vector for the 7th value, 
# The [0, mask_index] refers to accessing vector of vocab_size for
# the 0th sentence, mask_index-th token.
output_value = outputs[0, mask_index]

# As a sanity check we can see that the shape of the output_value
# is the same as the `vocab_size` from the outputs' shape.
print(output_value.shape, 
      output_value.shape[0] == outputs.shape[-1])

In [None]:
# Lets recap the original sentence with the masked word.
print(text)

# We have to check where the first masked token is from the original text. 
mask_index = tokenized_text.index('[MASK]') 
output_value = outputs[0, mask_index]
## We use torch.argmax to get the index with the highest value.
mask_word_in_vocab = int(torch.argmax(output_value))
print(tokenizer.convert_ids_to_tokens([mask_word_in_vocab]))

In [None]:
# Lets recap the original sentence with the masked word.
print(text)

# We have to check where the first masked token is from the original text. 

for mask_index, token in enumerate(tokenized_text):
    if token == '[MASK]':
        output_value = outputs[0, mask_index]
        mask_word_in_vocab = int(torch.argmax(output_value))
        print(tokenizer.convert_ids_to_tokens([mask_word_in_vocab]))

# Lets make the fill-in-the-blank feature into a function.

In [None]:
def fill_in_the_blanks(text, model, return_str=False):
    tokenized_text = tokenizer.tokenize(text)
    token_indices = tokenizer.convert_tokens_to_ids(tokenized_text)
    # Create the segment indices.
    m = np.asarray(tokenized_text) == "[SEP]"
    segments_ids = m.cumsum()-m
    # Convert them to the arrays to pytorch tensors.
    tokens_tensor = torch.tensor([token_indices])
    segments_tensors = torch.tensor([segments_ids])
    
    # Apply the model to the inputs.
    with torch.no_grad(): # You can take this context manager to mean that we're not training.
        outputs, *_ = model(tokens_tensor, token_type_ids=segments_tensors)
    
    output_tokens = []
    for mask_index, token_id in enumerate(token_indices):
        token = tokenizer.convert_ids_to_tokens([token_id])[0]
        if token == '[MASK]':
            output_value = outputs[0, mask_index]
            # The masked word index in the vocab.
            mask_word_in_vocab = int(torch.argmax(output_value))
            token = tokenizer.convert_ids_to_tokens([mask_word_in_vocab])[0]
        output_tokens.append(token)
        
    return " ".join(output_tokens).replace(" ##", "") if return_str else output_tokens

In [None]:
# Load the model.
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()

text = "[CLS] please don't let the [MASK] out of the [MASK] . [SEP] the [MASK] [MASK] [MASK] ran [MASK] . [SEP]"
print(fill_in_the_blanks(text, model, return_str=True))

In [None]:
text = "[CLS] i like to drink beer and eat [MASK] . [SEP]"
print(fill_in_the_blanks(text, model, return_str=True))

In [None]:
text = "[CLS] i like to drink coffee and eat [MASK] . [SEP]"
print(fill_in_the_blanks(text, model, return_str=True))

# Fine-tuning BERT models. 

By default, the pre-trained model is trained on the

 - BookCorpus, ~800M words
 - English Wikipedia, ~2500M words
 
If we want the model to adapt to a specific domain, we need to ***fine-tune*** the model. This section demonstrate how this can be done with the same PyTorch Transformer Library.


In [None]:
phoenix_turtle = """Truth may seem but cannot be;\nBeauty brag but ’tis not she;\nTruth and beauty buried be."""
sonnet20 = """A woman’s face with Nature’s own hand painted\nHast thou, the master-mistress of my passion;\nA woman’s gentle heart, but not acquainte\nWith shifting change, as is false women’s fashion;"""
sonnet1 = """From fairest creatures we desire increase,\nThat thereby beauty’s rose might never die,\nBut as the riper should by time decease,\nHis tender heir might bear his memory:"""
sonnet73 = """In me thou see’st the glowing of such fire,\nThat on the ashes of his youth doth lie,\nAs the death-bed whereon it must expire,\nConsum’d with that which it was nourish’d by."""
venus_adonis = """It shall be cause of war and dire events,\nAnd set dissension ‘twixt the son and sire;\nSubject and servile to all discontents,\nAs dry combustious matter is to fire:\nSith in his prime Death doth my love destroy,\nThey that love best their loves shall not enjoy\n"""
sonnet29 = """When, in disgrace with fortune and men’s eyes,\nI all alone beweep my outcast state,\nAnd trouble deaf heaven with my bootless cries,\nAnd look upon myself and curse my fate,"""
sonnet130 = """I have seen roses damask’d, red and white,\nBut no such roses see I in her cheeks;\nAnd in some perfumes is there more delight\nThan in the breath that from my mistress reeks."""
sonnet116 = """Love’s not Time’s fool, though rosy lips and cheeks\nWithin his bending sickle’s compass come;\nLove alters not with his brief hours and weeks,\nBut bears it out even to the edge of doom."""
sonnet18 = """But thy eternal summer shall not fade\nNor lose possession of that fair thou ow’st;\nNor shall Death brag thou wander’st in his shade,\nWhen in eternal lines to time thou grow’st;\nSo long as men can breathe or eyes can see,\nSo long lives this, and this gives life to thee."""
anthony_cleo = """She made great Caesar lay his sword to bed;\nHe plowed her, and she cropped."""

shakespeare = [phoenix_turtle, sonnet20, sonnet1, sonnet73, venus_adonis,
              sonnet29, sonnet130, sonnet116, sonnet18, anthony_cleo]

In [None]:
from transformers import BertConfig, BertForMaskedLM, BertTokenizer

# Load the BERT model.
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# Load the BERT Tokenizer.
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased')
# Load the BERT Config.
config = BertConfig.from_pretrained('bert-large-uncased')

In [None]:
from torch.utils.data import DataLoader, Dataset, SequentialSampler, RandomSampler
import torch.nn.functional as F

class TextDataset(Dataset):
    def __init__(self, texts, tokenizer):
        """
        :param texts: A list of documents, each document is a list of strings.
        :rtype texts: list(string)
        """
        tokenization_process = lambda s: tokenizer.build_inputs_with_special_tokens(
                                             tokenizer.convert_tokens_to_ids(
                                                 tokenizer.tokenize(s.lower())))
        pad_sent = lambda x: np.pad(x, (0,tokenizer.max_len_single_sentence - len(x)), 'constant', 
                                    constant_values=tokenizer.convert_tokens_to_ids(tokenizer.pad_token))
        self.examples = torch.tensor([pad_sent(tokenization_process(doc)) for doc in texts])

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item])

# Initialize the Dataset object.
train_dataset = TextDataset(shakespeare, tokenizer)
# Initalize the DataLoader object, `batch_size=2` means reads 2 poems at a time.
dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), batch_size=2)

In [None]:
# 10 poems with 510 tokens per poems, 
# if poem has <510, pad with the 0th index.
train_dataset.examples.shape

# For each batch, we read 2 poems at a time.
print(next(iter(dataloader)).shape)

In [None]:
# An example of a batch.
next(iter(dataloader))

In [None]:
def mask_tokens(inputs, tokenizer, mlm_probability=0.8):
    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
    labels = inputs.clone()
    # We sample a few tokens in each sequence for masked-LM training (with probability args.mlm_probability defaults to 0.15 in Bert/RoBERTa)
    probability_matrix = torch.full(labels.shape, mlm_probability)
    special_tokens_mask = [tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()]
    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -1  # We only compute loss on masked tokens

    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

    # 10% of the time, we replace masked input tokens with random word
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
    inputs[indices_random] = random_words[indices_random]

    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
    return inputs, labels

In [None]:
from transformers import AdamW, WarmupLinearSchedule

Arguments = namedtuple('Arguments', ['learning_rate', 'weight_decay', 'adam_epsilon', 'warmup_steps', 
                                     'max_steps', 'num_train_epochs'])

args = Arguments(learning_rate=5e-5, weight_decay=0.0, adam_epsilon=1e-8, warmup_steps=0, # Optimizer arguments
                 max_steps=10, num_train_epochs=10  # Training routine arugments
                )  

# Prepare optimizer and schedule (linear warmup and decay)
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': args.weight_decay},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
    ]

optimizer = AdamW(optimizer_grouped_parameters, lr=args.learning_rate, eps=args.adam_epsilon)
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=args.warmup_steps, t_total=args.max_steps)

In [None]:
for _e in range(args.num_train_epochs):
    print(_e)
    for step, batch in enumerate(iter(dataloader)):
        # Randomly mask the tokens 80% of the time. 
        inputs, labels = mask_tokens(batch, tokenizer)
        # Initialize the model to train mode.
        model.train()
        # Feed forward the inputs through the models.
        loss, _ = model(inputs, masked_lm_labels=labels)
        # Backpropagate the loss.
        loss.backward()
