# Text Generation - Fine-tuning GPT-2

In this notebook we'll tackle the task of text generation with the notorious [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) model. We'll look at data preparation and fine-tuning process needed in order for GPT-2 to produce desired text. Our goal in this notebook is to fine-tune a pretrained model to produce motivational/inspirational quotes.

First things first, let's make sure we have a GPU instance in this Colab session:
- `Edit -> Notebook settings -> Hardware accelerator` must be set to GPU
- if needed, reinitiliaze the session by clicking `Connect` in top right corner

After the session is initilized, we can check our assigned GPU with the following command (fingers crossed it's a Tesla P100 :P):

In [None]:
!nvidia-smi

Let's install the *transformers* library from [Huggingface](https://huggingface.co/) that we're using, *gdown* for loading files from Drive and import everything we need.

In [None]:
!pip install transformers
!pip install gdown

In [None]:
import gc
import itertools
import os


import numpy as np
import random
import torch
import torch.nn as nn

from dataclasses import dataclass
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from tqdm import tqdm, trange
from transformers import GPT2TokenizerFast, TextDataset, GPT2LMHeadModel, DataCollatorForLanguageModeling, AdamW, get_linear_schedule_with_warmup

## Data

As we already mentioned, our goal is to fine-tune a GPT-2 model to produce motivational quotes. For this reason we are working with a dataset of 32716 quotes collected and cleaned from this [Kaggle source](https://www.kaggle.com/stuffbyyc/quotes). 

Let's load the data and take a look at some of the examples:

In [None]:
!wget https://raw.githubusercontent.com/andrejmiscic/NLP-workshop/master/Data/quotes.txt

In [None]:
with open("/content/quotes.txt", "r") as f:
    examples = [l.strip() for l in f.readlines()]

print(f"Our dataset contains {len(examples)} quotes.")
print(f"Some examples:")
print("- " + "\n- ".join(examples[:3]))

Good, after looking at some of these quotes we're very motivated to continue. Our next step is to design a dataset class that will hold our data and serve training samples.

Inputs to the GPT model are batches of same length input ids - indices of tokens in vocabulary (GPT uses [Byte Pair Encoding](https://arxiv.org/abs/1508.07909) input representation not WordPiece as BERT). Looking at our samples above we can notice that they differ in length quite a lot. Combining them into batches would require a lot of truncating and padding. We rather opt out for a different approach - we combine all the of quotes into one long text - a motivational essay if you will :). To delimit each of the samples we introduce a new token - example delimeter `<|endoftext|>`. To create training samples from this motivational essay, we "cut" it into blocks of tokens (each block is of predefined size - `block_size`). One problem remains, if we leave the quotes in the same order as they are in the dataset, the model might pickup certain dependencies between consecutive quotes that we wouldn't like it to learn. To mitigate this we create a dataset that already has combined data for all of required training epochs - and we shuffle the quotes for each epoch.

Finally, there is another reason to include the delimeter `<|endoftext|>`. The model will learn to produce a quote after it and also end the quote with it. Therefore we can prompt a fine-tuned model with `<|endoftext|>` to produce a new motivational quote.

In [None]:
class LinesTextDatasetWithEpochs(Dataset):  # based on TextDataset by Huggingface
  def __init__(self, examples, tokenizer, block_size, num_epochs, example_del="<|endoftext|>"):
    super(LinesTextDatasetWithEpochs, self).__init__()
    examples_input_ids = []
    for ex in examples:
      # we add the delimeter to each quote, tokenize it and convert the tokens to indices
      examples_input_ids.append(tokenizer.convert_tokens_to_ids(tokenizer.tokenize(example_del + ex)))

    # for each of the training epochs shuffle the quotes and combined them
    combined_input_ids = []
    for i in range(num_epochs):
      tmp = examples_input_ids.copy()
      random.shuffle(tmp)
      combined_input_ids.extend(list(itertools.chain.from_iterable(tmp)))

    # creating training samples by cutting the combined input into blocks of length block_size
    self.data = []
    for i in range(0, len(combined_input_ids) - block_size + 1, block_size):
      self.data.append(tokenizer.build_inputs_with_special_tokens(combined_input_ids[i: i + block_size]))

  def __getitem__(self, i):
    return torch.tensor(self.data[i], dtype=torch.long)

  def __len__(self):
    return len(self.data)

Below we implement the Trainer class that contains the main train loop.

In [None]:
class Trainer:
  def __init__(self, model):
    self.model = model

  def train(self, train_dataset, val_dataset, device, run_config):
    self.model = self.model.to(device)
    # create output folder if it doesn't yet exist
    if not os.path.isdir(run_config.output_dir): 
      os.makedirs(run_config.output_dir)
    
    # train dataloader will serve us the training data in batches
    train_dataloader = DataLoader(train_dataset, sampler=RandomSampler(train_dataset), 
                                  batch_size=run_config.batch_size, collate_fn=run_config.collate_fn)
    
    # optimizer and scheduler that modifies the learning rate during the training
    optimizer = AdamW(self.model.parameters(), lr=run_config.learning_rate)
    scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=run_config.num_warmup_steps,
                                                num_training_steps=len(train_dataloader)*run_config.num_epochs)
    
    print("Training started:")
    print(f"\tNum examples = {len(train_dataset)}")
    print(f"\tNum Epochs = {run_config.num_epochs}")

    global_step = 0  # to save after every save_steps if save_steps is >= 0

    train_iterator = trange(0, int(run_config.num_epochs), desc="Epoch")
    for epoch in train_iterator:
      epoch_iterator = tqdm(train_dataloader, desc="Iteration", position=0, leave=True)
      self.model.train()
      epoch_losses = []
      for step, inputs in enumerate(epoch_iterator):
        # move batch to GPU
        if isinstance(inputs, dict):
            for k, v in inputs.items():
                inputs[k] = v.to(device)
        else:
            inputs = inputs.to(device)

        # forward pass - model also outputs a computed loss
        outputs = self.model(**inputs)
        loss = outputs[0]

        epoch_losses.append(loss.item())

        # backward pass - backpropagation
        self.model.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        epoch_iterator.set_description(f"Training loss = {loss.item():.4f}")

        if run_config.save_steps > -1 and global_step > 0 and global_step % run_config.save_steps == 0:
          output_dir = os.path.join(run_config.output_dir, f"Step_{step}")
          self.model.save_pretrained(output_dir)
          test_loss = self.evaluate(self.model, val_dataset, device, run_config)
          print(f"After step {step + 1}: val loss ={test_loss}")

        global_step += 1
      
      if run_config.save_each_epoch:
        output_dir = os.path.join(run_config.output_dir, f"Epoch_{epoch + 1}")
        model.save_pretrained(output_dir)
        test_loss = self.evaluate(self.model, val_dataset, device, run_config)
        print(f"After epoch {epoch + 1}: val loss ={test_loss}")


  def evaluate(self, model, test_dataset, device, run_config):
    test_dataloader = DataLoader(test_dataset, sampler=SequentialSampler(test_dataset),
                                 batch_size=run_config.batch_size, collate_fn=run_config.collate_fn)
    self.model.eval()
    losses = []
    for inputs in tqdm(test_dataloader, desc="Evaluating", position=0, leave=True):
      # move batch to GPU
      if isinstance(inputs, dict):
        for k, v in inputs.items():
          inputs[k] = v.to(device)
      else:
        inputs = inputs.to(device)

      with torch.no_grad():
        loss = model(**inputs)[0]
      losses.append(loss.item())

    return np.mean(losses)

*RunConfig* holds the parameter for training/testing:

In [None]:
@dataclass
class RunConfig:
  learning_rate: float
  batch_size: int
  num_epochs: int
  num_warmup_steps: int = 1
  save_steps: int = -1
  save_each_epoch: bool = True
  output_dir: str = "/content/"
  collate_fn: None = None

## Training

We have now implemented everything to start fine-tuning. We can save the fine-tuned models to our Colab instance (available under `/content/`) or we can connect our Google Drive to Colab and use it as external memory. If you want to do the latter, run the cell below and follow instructions.

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

Google Drive is now accessable under `/content/drive/`.

Let's prepare the datasets, tokenizer and training parameters.


In [None]:
block_size = 128  # length of input samples
num_dataset_epochs = 8  # used to create the dataset, during training we'll only use 1 epoch

# split the examples to train and validation sets
train_examples, valid_examples = train_test_split(examples, test_size=0.2)

# instantiate a GPT2 tokenizer
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

# datasets and the collate function to combine examples to input batches
train_dataset = LinesTextDatasetWithEpochs(train_examples, tokenizer, block_size, num_dataset_epochs)
val_dataset = LinesTextDatasetWithEpochs(valid_examples, tokenizer, block_size, 1)
collate_call = DataCollatorForLanguageModeling(tokenizer, mlm=False)

In [None]:
run_config = RunConfig(
    learning_rate = 3e-5,
    batch_size = 32,  # start with 32 and decrease if you get CUDA out of memory exception
    num_epochs = 1,  # the dataset already encodes the epochs
    save_steps = len(examples) / (32 * 8),  # super ugly, but it just means we're saving after each epoch in dataset
    save_each_epoch = True,
    output_dir = "/content/drive/My Drive/NLP-workshop/GPT2/",
    collate_fn = collate_call
)

Let's instantiate the pretrained GPT-2 model. We are using the small version of GPT-2: 12 layers, 768 hidden dimension, 12 attention heads which combines for 117M parameters.

In [None]:
model = GPT2LMHeadModel.from_pretrained("gpt2")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

In [None]:
trainer = Trainer(model)
trainer.train(train_dataset, val_dataset, device, run_config)

If you happen to get a CUDA out of memory exception, do the following:
- cause another exception so python doesn't hold any references to trainer or model, e.g. run the bottom cell causing ZeroDivisionError
- run the cell below that empties GPU cache
- decrease the batch_size in run_config and rerun that cell
- reinstantiate the model and rerun training

In [None]:
1 / 0

In [None]:
model = None
trainer = None
gc.collect()
torch.cuda.empty_cache()

For the purposes of this workshop we already fine-tuned a GPT-2 model, let's load it.

In [None]:
!mkdir /content/gpt2-quotes
!gdown -O /content/gpt2-quotes/config.json https://drive.google.com/uc?id=10h6le0yxZ8z-HJwOmWgSFcf2eXI5im-v
!gdown -O /content/gpt2-quotes/pytorch_model.bin https://drive.google.com/uc?id=10kYtFp6tFRQClbLCi04xlGg-uXp3K3rb

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("/content/gpt2-quotes/").to(device)

## Evaluation

We now have a fine-tuned GPT-2 model ready to produce motivational quotes. GPT-2 outputs a probability distribution over the next token conditioned on previous ones. There are a couple of ways we can go about generating text:
- Greedy decoding
- Beam search
- Top-k/Top-p sampling

You can read more [here](https://huggingface.co/blog/how-to-generate).

#### Greedy decoding
This is the simplest approach, at every step we just select the most probable next word, i.e. the word with highest outputed probability. One can immediately see that after some text the model will start repeating itself. This would therefore be a bad decoding scheme if we want to produce long continuous text, but since we're producing fairly short quotes it might achieve okay results.

In [None]:
import logging
logging.getLogger("transformers.generation_utils").setLevel(logging.CRITICAL)

In [None]:
def generate_text_greedy(prompt="", max_length=64):
  model.eval()
  input_ids = tokenizer.encode("<|endoftext|>" + prompt, return_tensors='pt').to(device)
  generated_ids = model.generate(input_ids, max_length=max_length).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
  return generated_text

In [None]:
generate_text_greedy()

So deep. Since greedy decoding is deterministic, this is the only quote produced by default prompt, but we can extend the prompt with some text

In [None]:
print(generate_text_greedy("I believe"))
print(generate_text_greedy("Data science"))
print(generate_text_greedy("Just"))

#### Beam search

Beam search is also a deterministic decoding, but offers an improvement over greedy decoding. A problem of greedy decoding is that we might miss the most likely sequence since we predict only most probably words. Beam search mitigates this by keeping a track of most probable *n* sequences at every step and ultimately selecting the most probable sequence.

In [None]:
def generate_text_beam(prompt="", max_length=64, num_beams=8):
  model.eval()
  input_ids = tokenizer.encode("<|endoftext|>" + prompt, return_tensors='pt').cuda()
  generated_ids = model.generate(input_ids, max_length=max_length, num_beams=num_beams,
                                 no_repeat_ngram_size=2).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
  return generated_text

In [None]:
generate_text_beam()

In [None]:
print(generate_text_beam("I believe"))
print(generate_text_beam("Data science"))
print(generate_text_beam("Just"))

#### Top-k/Top-p sampling

We've looked at two deterministic decoding schemes, let's now focus on non-deterministic that is based on sampling the next word from a probability distribution. Outputed probability distribution is over the entire model vocabulary (order of tens of thousands), it has most of its mass on a subset of most probable words and a very long tail. The tokens in the tail part would produce incoherent gibberish therefore we must somehow limit ourselves to only sample from most probable words, that's where top-k and top-p sampling come into play:

- [Top-k sampling](https://arxiv.org/abs/1805.04833) selects *k* most probable words and distributes their comulative probability over them. The problem is that we must choose a fixed sized parameter *k* which might lead to suboptimal results in some scenarios.
- [Top-p sampling](https://arxiv.org/abs/1904.09751) addresses this by selecting top words whose cumulative probability just exceeds p. This comulative probability is then again distributed among these words.

We'll use a combination of both in this notebook, but you're free to test different scenarios.

There is another parameter that we haven't introduced: `temperature` which controls the outputed distribution from softmax function. Regular softmax has `temperature` = 1. If `temperature` -> 0, we give more probability mass to more probable words (we go towards greedy decoding). Higher values cause a more uniform distribution.

In [None]:
def generate_text_sampling(prompt="", max_length=64, top_k=50, top_p=0.95, temp=1.0, num_return=1):
  model.eval()
  input_ids = tokenizer.encode("<|endoftext|>" + prompt, return_tensors='pt').cuda()
  generated_ids = model.generate(input_ids, do_sample=True, max_length=max_length, temperature=temp, 
                                 top_k=top_k, top_p=top_p, num_return_sequences=num_return).cpu().tolist()

  generated_text = [tokenizer.decode(g, skip_special_tokens=True) for g in generated_ids]
  return generated_text

In [None]:
generate_text_sampling(num_return=3, temp=0.7)

In [None]:
print(generate_text_sampling("I believe", num_return=3, temp=0.7))
print(generate_text_sampling("Data science", num_return=3, temp=0.7))
print(generate_text_sampling("Just", num_return=3, temp=0.7))