# Text Generation - Fine-tuning GPT-2

In this notebook we'll tackle the task of text generation with the notorious [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model. We'll look at data preparation and fine-tuning process needed in order for GPT-2 to produce desired text. Our goal in this notebook is to fine-tune a pretrained model to produce motivational/inspirational quotes.

First things first, let's make sure we have a GPU instance in this Colab session:
- `Edit -> Notebook settings -> Hardware accelerator` must be set to GPU
- if needed, reinitiliaze the session by clicking `Connect` in top right corner

After the session is initilized, we can check our assigned GPU with the following command (fingers crossed it's a Tesla P100 :P):

In [None]:
!nvidia-smi

Let's install and import everything we need:

In [None]:
!wget https://github.com/andrejmiscic/NLP-workshop/raw/master/utils/generation_utils.py
!wget https://github.com/andrejmiscic/NLP-workshop/raw/master/utils/trainer.py

In [None]:
!pip install transformers
!pip install gdown

In [None]:
import gc
import os

import numpy as np
import torch
import torch.nn as nn

from generation_utils import TextDatasetWithEpochs
from sklearn.model_selection import train_test_split
from trainer import Trainer, RunConfig
from transformers import GPT2TokenizerFast, GPT2LMHeadModel, DataCollatorForLanguageModeling

## Data

As we already mentioned, our goal is to fine-tune a GPT-2 model to produce motivational quotes. For this reason we are working with a dataset of 32716 quotes collected and cleaned from this [Kaggle source](https://www.kaggle.com/stuffbyyc/quotes). 

Let's load the data and take a look at some of the examples:

In [None]:
!wget https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/MotivationalQuotes/quotes.txt

In [None]:
with open("/content/quotes.txt", "r") as f:
    examples = [l.strip() for l in f.readlines()]

print(f"Our dataset contains {len(examples)} quotes.")
print(f"Some examples:")
print("- " + "\n- ".join(examples[:3]))

Good, after looking at some of these quotes we're of course super motivated to continue. Our next step is to design a dataset class that will hold our data and serve training samples.

Input to the GPT model is a batch of lists of input ids, these are the indices of tokens in the vocabulary. In order to form a batch, these lists must be of equal length. We usually achieve this by truncating too long quotes and padding too short ones. However, here we opt for a different approach:

First we combine all the of quotes into one long text - a motivational essay if you will :). To delimit each of the samples we introduce a new token - "example delimeter" `<|endoftext|>`. To create training samples from this motivational essay, we "cut" it into blocks of tokens (each block is of predefined size - `block_size`). One problem remains, if we leave the quotes in the same order as they are in the dataset, the model might pickup certain dependencies between consecutive quotes that we wouldn't like it to learn. To mitigate this we create a dataset that already has combined data for all of required training epochs - and we shuffle the quotes for each epoch.

Finally, there is another reason to include the delimeter `<|endoftext|>`. The model will learn to produce a quote after it and also end the quote with it. Therefore we can prompt a fine-tuned model with `<|endoftext|>` to produce a new motivational quote.

We've implemented these dataset structure in `generation_utils.py` and named it *TextDatasetWithEpochs*.

In [None]:
block_size = 128  # length of input samples
num_dataset_epochs = 8  # used to create the dataset, during training we'll only use 1 epoch

train_examples, valid_examples = train_test_split(examples, test_size=0.2)
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

train_dataset = TextDatasetWithEpochs(train_examples, tokenizer, block_size, num_dataset_epochs)
val_dataset = TextDatasetWithEpochs(valid_examples, tokenizer, block_size, 1)
collate_call = DataCollatorForLanguageModeling(tokenizer, mlm=False)

## Training

We have now implemented everything we need to start fine-tuning. We can save the fine-tuned models to our Colab instance (available under `/content/`) or we can connect our Google Drive to Colab and use it as external memory. If you want to do the latter, run the cell below and follow instructions.

In [None]:
# optional if you want to save your models to Google Drive
from google.colab import drive
drive.mount("/content/drive/")

Google Drive is now accessable under `/content/drive/`.

Let's set the training parameters:


In [None]:
run_config = RunConfig(
    learning_rate = 3e-5,
    batch_size = 32,  # start with 32 and decrease if you get CUDA out of memory exception
    num_epochs = 1,  # the dataset already encodes the epochs
    output_dir = "/content/drive/MyDrive/NLP-workshop-materials/GPT2-generation/",
    collate_fn = collate_call
)

Let's instantiate the pretrained GPT-2 model. We are using the small version of GPT-2 with 12 layers, 768 hidden dimension, 12 attention heads which combines for 117M parameters.

In [None]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
trainer = Trainer(model)
trainer.train(train_dataset, val_dataset, device, run_config)

If you happen to get a CUDA out of memory exception, do the following:
- cause another exception so python doesn't hold any references to trainer or model, e.g. run the bottom cell causing ZeroDivisionError
- run the cell below that empties GPU cache
- decrease the batch_size in run_config and rerun that cell
- reinstantiate the model and rerun training

In [None]:
1 / 0

In [None]:
model = None
trainer = None
gc.collect()
torch.cuda.empty_cache()

For the purposes of this workshop we already fine-tuned a GPT-2 model, let's load it.

## Evaluation

We now have a fine-tuned GPT-2 model ready to produce motivational quotes. GPT-2 outputs a probability distribution over the next token conditioned on previous ones. There are a couple of ways we can go about generating text:
- Greedy decoding
- Beam search
- Top-k/Top-p sampling

You can read more [here](https://huggingface.co/blog/how-to-generate).

Let's first download and initilize the already fine-tuned model.

In [None]:
!mkdir /content/gpt2-quotes
!gdown -O /content/gpt2-quotes/config.json https://drive.google.com/uc?id=1-AYqNe0968Ru-m4qXwbyVPHyjf9cJIWc
!gdown -O /content/gpt2-quotes/pytorch_model.bin https://drive.google.com/uc?id=1-CpfjekRPQX_FWt5FQzkv81GdYIsgr4M

In [None]:
# only run if you want to use the model we've already fine-tuned for you
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("/content/gpt2-quotes/").to(device)

#### Greedy decoding
This is the simplest approach, at every step we just select the most probable next word, i.e. the word with highest outputed probability. One can immediately see that after some text the model will start repeating itself. This would therefore be a bad decoding scheme if we want to produce long continuous text, but since we're producing fairly short quotes it might achieve okay results.

<div>
<img src="https://github.com/andrejmiscic/NLP-workshop/raw/master/figures/greedy.PNG" width="800"/>
</div>

In [None]:
import logging
logging.getLogger("transformers.generation_utils").setLevel(logging.CRITICAL)

In [None]:
def generate_text_greedy(prompt="", max_length=64):
  model.eval()
  input_ids = tokenizer.encode("<|endoftext|>" + prompt, return_tensors='pt').to(device)
  generated_ids = model.generate(input_ids, max_length=max_length).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
  return generated_text

In [None]:
generate_text_greedy()

Wow, so deep. Since greedy decoding is deterministic, this is the only quote produced by default prompt, but we can initialize the prompt with some text:

In [None]:
print(generate_text_greedy("I believe"))
print(generate_text_greedy("Data science"))
print(generate_text_greedy("Just"))

#### Beam search

Beam search is also a deterministic decoding, but offers an improvement over greedy decoding. A problem of greedy decoding is that we might miss the most likely sequence since we predict only the most probable word at each timestep. Beam search mitigates this by keeping a track of most probable *n* sequences at every step and ultimately selecting the most probable sequence.

<div>
<img src="https://github.com/andrejmiscic/NLP-workshop/raw/master/figures/beam.PNG" width="500"/>
</div>

In [None]:
def generate_text_beam(prompt="", max_length=64, num_beams=8):
  model.eval()
  input_ids = tokenizer.encode("<|endoftext|>" + prompt, return_tensors='pt').cuda()
  generated_ids = model.generate(input_ids, max_length=max_length, num_beams=num_beams,
                                 no_repeat_ngram_size=2).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
  return generated_text

In [None]:
generate_text_beam()

In [None]:
print(generate_text_beam("I believe"))
print(generate_text_beam("Data science"))
print(generate_text_beam("Just"))

#### Top-k/Top-p sampling

We've looked at two deterministic decoding schemes, let's now focus on non-deterministic that is based on sampling the next word from a probability distribution. Outputed probability distribution is over the entire model vocabulary (order of tens of thousands), it has most of its mass on a subset of most probable words and a very long tail. The tokens in the tail part would produce incoherent gibberish therefore we must somehow limit ourselves to only sample from most probable words. That's where top-k and top-p sampling come into play:

- [Top-k sampling](https://arxiv.org/abs/1805.04833) selects *k* most probable words and distributes their comulative probability over them. The problem is that we must choose a fixed sized parameter *k* which might lead to suboptimal results in some scenarios.
- [Top-p sampling](https://arxiv.org/abs/1904.09751) addresses this by selecting top words whose cumulative probability just exceeds p. This comulative probability is then again distributed among these words.

We'll use a combination of both in this notebook, but you're free to test different scenarios.

There is another parameter that we haven't introduced: `temperature` which controls the outputed distribution from softmax function. Regular softmax has `temperature` = 1. If `temperature` -> 0, we give more probability mass to more probable words (we go towards greedy decoding). Higher values cause a more uniform distribution.

<div>
<img src="https://github.com/andrejmiscic/NLP-workshop/raw/master/figures/topk.PNG" width="800"/>
</div>

In [None]:
def generate_text_sampling(prompt="", max_length=64, top_k=50, top_p=0.95, temp=1.0, num_return=1):
  model.eval()
  input_ids = tokenizer.encode("<|endoftext|>" + prompt, return_tensors='pt').cuda()
  generated_ids = model.generate(input_ids, do_sample=True, max_length=max_length, temperature=temp, 
                                 top_k=top_k, top_p=top_p, num_return_sequences=num_return).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
  return generated_text

In [None]:
generate_text_sampling(num_return=3, temp=0.7)

In [None]:
print(generate_text_sampling("I believe", num_return=3, temp=0.7))
print(generate_text_sampling("Data science", num_return=3, temp=0.7))
print(generate_text_sampling("Just", num_return=3, temp=0.7))