# COLX 585 Trends in Computational Linguistics
## Generating Natural Language Text using GPT-2

### Goal of this tutorial:
- Know the background of GPT-2, a Language Model (LM)
- Learn how to train GPT-2 (toy version) from scratch on Yelp dataset
- Learn about deterministic decoding techniques like greedy search and beam search.
- Learn aboout stochastic decoding techniques like top-k and top-p (nucleus) sampling

###  General:
- This notebook was last tested on Python 3.6.4, PyTorch 0.4.0, **transformers 2.1.1** (note: tutorial might work only with this version of transformers)
- We would like to acknowledge the transformers repository from huggingface (https://github.com/huggingface/transformers) which we used as a reference to code up both GPT-2 and decoding techniques
- This tutorial gives cuda out of memory error when run in Colab. Hence, it is recommended to run it on a local machine (CPU) with jupyter notebook

### References
To know more about the above-mentioned concepts, take a look at the following articles:
1. GPT-1 Original Paper https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf
2. GPT-2 Original Paper https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
3. Neural Text Degeneration with Unlikelihood Training https://arxiv.org/pdf/1908.04319.pdf (see Section 3)
4. The Curious Case of Neural Text Degeneration https://arxiv.org/abs/1904.09751

## Recap of last week

### Bidirectional Encoder Representations from Transformers (BERT)

Last week, we saw BERT model, which is used to represent a piece of text (e.g., sentence, sentence pair, document) to solve a wide variety of natural language understanding tasks (e.g., question answering, natural language inference, text classification). At its core, BERT is based on **transformer** architecture to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. The pretraining objectives underlying BERT model are masked language modeling and next sentence prediction. BERT's model architecture is shown below.

<img src="https://drive.google.com/uc?id=1PXjMq2DEXFSsY1AsYhKIfUSreLxHtPR4" height="250"/>

Picture courtesy: https://arxiv.org/pdf/1810.04805.pdf




## Background


### Generative Pre-Training - Version 2 (GPT-2)

GPT-2 model is the state of the art model for natural language generation (NLG). GPT-2 model is based on **Transformer architecture** and uses language modeling (LM) objective, which works by maximizing the conditional probability of predicting a word in a text sequence (e.g., sentences) given the previous words of the target word. The model architecture is as shown below.

<img src="https://drive.google.com/uc?id=12dvGzUzM_hM_LTAT8XfmJmsL8cu1Jyx3" alt="GPT-2 model architecture" title="GPT-2 model" height=300 />

Unlike BERT model that learns **bidirectional** representation, GPT-2 learns **unidirectional** representation (left to right) by conditioning on only left context in all layers. While BERT model is used primarily for solving **natural language understanding** tasks, GPT-2 model is primarily used for **natural language generation** (see next section).

GPT-2 is trained on **WebText** that contains huge amounts of web pages. In this tutorial, we will stick to training GPT-2 on **Yelp dataset** which doesn't require much computing resources in comparison with training on WebText.


### Natural Language Generation
Once we train a GPT-2 model, we might be keen to generate text by prompting the model with a seed text (e.g., beginning of a news article). One of the most popular generation from GPT-2 is:

<img src="https://pbs.twimg.com/media/DzYpsJOU0AA1PO9.png" alt="Famous unicorn example from GPT-2 Generations" title="GPT2 - Generation" height=600 width=550 />

As you can observe, the resulting text seems to be **mostly coherent, grammatical, using long-term context and world knowledge**. We generally rely on either deterministic or stochastic decoding techniques to generate the text given a seed text and a trained language model.


### Deterministic Decoding Techniques

There are two widely used decoding techniques, **greedy search** and **beam search**. The former is a special case of the latter. 

### Greedy search

**Greedy search**, as the name suggests, outputs the token that received the highest probability at each time step. We have already used greedy search to generate translation from the decoder of the neural machine translation model. 

Let us see a sample generation using greedy search. During decoding, we first feed the seed text (prompt) as input to the model and predict the next most likely (argmax) word (first word in generation which is `the` in this example).

<img src="
https://drive.google.com/uc?id=1QBKIlCj3y8kZ27x5ojXcKANsOXga2qGf" alt="Greedy Search" title="Greedy Search"  height=400 />

Once we generate the first word, we append this word (`the`) to the input and predict the next most likely (argmax) word (second word in generation which is `scientist` in this example).

<img src="
https://drive.google.com/uc?id=12eTq-gnjB9MM7BZb3IfXmWzBSM1LaXTL" alt="Greedy Search" title="Greedy Search"  height=400 />

We typically repeat the process until we have generated a certain number of words.

<img src="
https://drive.google.com/uc?id=1feSME5YWr36MpZbfyWX0G1mdlXsv5j7G" alt="Greedy Search" title="Greedy Search"  height=370 />



### Beam Search 
**On the other hand, beam search** maintains a fixed-size (which we call beam size) set of partially-decoded sequences, called hypotheses. At each time step, beam search creates new hypotheses by appending each token in the vocabulary to each existing hypothesis, scoring the resulting sequences. 

Let us see an example for beam search (with beam size of 2). In the first timestep, we feed the input (say `the unicorns spoke perfect english`) to the model, predict the top 2 words (based on the model probability for a word given the input) and store them in beam.

<img src="
https://drive.google.com/uc?id=1gPCwzOUZOHbKl2gpHPtATr2YpLHUL-v_" alt="Beam Search" title="Beam Search"  height=370 />

**In the second timestep**, for each word in the beam, we append the word in the original input, feed to the model, keep track of all the resulting sequences (input along with generated word) along with their probabilites. We pick the top 2 resulting sequences and store them in beam.

<img src="
https://drive.google.com/uc?id=1xuCidto4yjCBawiz9STcrUZ4UAVyANgr" alt="Beam Search" title="Beam Search"  height=370 />

**In the third timestep**, for each sequence in the beam, we append the sequence in the original input, feed to the model, keep track of all the resulting sequences (input along with generated word) along with their probabilites. We pick the top 2 resulting sequences and store them in beam. We repeat the process until we have generated the required number of words. The sequence in the beam with the highest probability is the generated sequence.

<img src="
https://drive.google.com/uc?id=1hP8lfxkvVDN9q3Bu1WKWfKqjJMrHMVMo" alt="Beam Search" title="Beam Search"  height=370 />

In an open-domain generation, beam search generally leads to **degenerate text** with lot of repetitions (as seen in the below figure) compared to the admirable quality of the text decoded by stochastic method (see next section) like top-k sampling.

<img src="
https://drive.google.com/uc?id=1Zcmch-8vxIhAImn7RKJObzz0DSw5tabi" alt="Beam Search" title="Beam Search"  height=270 />

Picture courtesy: https://arxiv.org/pdf/1904.09751.pdf





### Stochastic Decoding Techniques
Stochastic decoding techniques sample from a model-dependent distribution at each step. The two successful techniques in this category are **top-k and nucleus (top-p)** sampling. To prevent sampling low probability tokens, a typical approach is to restrict sampling to a subset of the vocabulary at each step. 

The top-k sampler restricts sampling to the k most-probable tokens as shown below in an example.

<img src="
https://drive.google.com/uc?id=1rjd8FjS_r1MmXwn8Hnbms7NgJJTa3x2j" alt="Beam Search" title="Beam Search"  height=270 />


Instead, **the nucleus (top-p) sampler** restricts sampling to the smallest set of tokens with total mass above a threshold p (which is a continuous value that ranges between 0 and 1) as shown below in an example.

<img src="
https://drive.google.com/uc?id=1P3n2ygqxEQsYre-i_CvWQAOsqlzdho4Z" alt="Beam Search" title="Beam Search"  height=270 />


## Implementation

### GPT-2 Pre-training 

We will now perform the following:
- Create a toyish GPT-2 configuration by modifying the original configuration.
- Train a GPT-2 language model (using toyish hyperparameter configuration) from scratch on Yelp dataset
- Load the original GPT-2 language model (pretrained by the original authors)
- Prompt the model with seed text and decode using 
 - Greedy search
 - Top-k sampling
 - Top-p (or nucleus) sampling

### Prepare the GPT-2 training samples
* Please download Yelp dataset from https://www.kaggle.com/omkarsabnis/yelp-reviews-dataset
* the file we're going to use: **yelp.csv**

As usual, let us start by loading the drive.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Install the transformers library (contains GPT-2 model)

In [2]:
!pip install transformers==2.1.1



Import the required libraries

In [3]:
'''
One place for all the imports
'''

import os
import json
import random
from tqdm import tqdm, trange
import numpy as np

import torch
from torch.utils.data import Dataset, RandomSampler, DataLoader

# import GPT-2 specific classes
from transformers import (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, AdamW, WarmupLinearSchedule)

import logging
logging.getLogger('transformers.tokenization_utils').setLevel(logging.ERROR)

# set the seed
manual_seed = 123
random.seed(manual_seed)
np.random.seed(manual_seed)
torch.manual_seed(manual_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
if n_gpu > 0:
  torch.cuda.manual_seed(manual_seed)

Load GPT-2 config from library

In [4]:
cache_dir = "/tmp" # to store pretrained checkpoints

# load config
original_config = GPT2Config.from_pretrained("gpt2", cache_dir=cache_dir)

Modify the config (hyperparameters) to train a toyish GPT-2 based language model

In [5]:
# construct a tutorial configuration file to setup a small version of GPT-2
config = original_config
config.n_layer = 2 # Number of hidden layers in the Transformer encoder. (default 12)
config.n_embd = 60 # Dimensionality of the embeddings and hidden states. (default 768)
config.n_head = 2 # Number of attention heads for each attention layer in the Transformer encoder. (default 12)

Instantiate the toyish GPT-2 model and GPT-2 tokenizer

In [6]:
# load model
model = GPT2LMHeadModel(config)
model.to(device)

# load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2", do_lower_case=False, cache_dir=cache_dir)

Prepare the Yelp dataset used to train our LM

In [7]:
'''
prepare the GPT-2 training samples
* download Yelp dataset from https://www.kaggle.com/omkarsabnis/yelp-reviews-dataset
* the file we're going to use: yelp.csv
'''
import csv

class TextDataset(Dataset):
  def __init__(self, tokenizer, file_path, block_size=512):
    assert os.path.isfile(file_path)
    print('creating features from dataset: %s'%file_path.split('/')[-1])
    # read raw text from the csv file.
    raw_text = ''
    csv_reader = csv.DictReader(open(file_path))
    li = 0 # line index
    for row in csv_reader:
      raw_text = raw_text + '%s '%row["text"]
      li = li + 1
      if li > 300: # breaking after reading 300th line.
        break

    # tokenize raw text
    tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(raw_text))
    # create segments
    self.examples = []
    for i in range(0, len(tokenized_text)-block_size+1, block_size): # Truncate in block of block_size
      self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i:i+block_size]))
  
  def __len__(self):
    return len(self.examples)
  
  def __getitem__(self, item):
    return torch.tensor(self.examples[item])
    
# path of yelp.csv file 
train_file = "yelp.csv" 
block_size = 64

train_dataset = TextDataset(tokenizer, train_file, block_size=block_size)  
print("%d instances of block size %d created"%(len(train_dataset), block_size))

creating features from dataset: yelp.csv
760 instances of block size 64 created


Print a sample batch (of size 2) from the training dataset

In [8]:
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=2)
for step, batch in enumerate(train_dataloader):
  print(batch.size())
  break

torch.Size([2, 64])


Hyperparameters of our toyish GPT-2 model

In [9]:
train_batch_size = 50 # batch size 
max_steps = 25 # maximum training steps
learning_rate = 5e-5 # the initial learning rate for Adam.
adam_epsilon = 1e-8 # epsilon for Adam optimizer.
warmup_steps = 0 # linear warmup over warmup_steps.
num_train_epochs = 1 # total number of training epochs to perform.
max_grad_norm = 1.0 # max. gradient norm.
weight_decay = 0.0 # weight deay if we apply some.

Iterator for creating batches from training set

In [10]:
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset, sampler=train_sampler, batch_size=train_batch_size)
num_train_epochs = max_steps // (len(train_dataloader)) + 1

Prepare optimizer and schedule (linear warmup and decay)

In [11]:
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in model.named_parameters() if not any(nd in n for nd in no_decay)], 'weight_decay': weight_decay},
    {'params': [p for n, p in model.named_parameters() if any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=learning_rate, eps=adam_epsilon)
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=warmup_steps, t_total=max_steps)

Kick-start the training

In [12]:
# bookkeeping variables
global_step = 0
tr_loss = 0.0
cur_epoch = 0

model.zero_grad() # clears the gradient buffer
train_iterator = trange(int(num_train_epochs), desc="Epoch", disable=False)

# run through the dataset
model.train() # set the mode as training
for _ in train_iterator:
  epoch_iterator = tqdm(train_dataloader, desc="Iteration", disable=False)
  for step, batch in enumerate(epoch_iterator):
    # read a batch
    inputs, labels = (batch, batch)
    inputs = inputs.to(device)
    labels = labels.to(device)

    # pass the data to the model
    outputs = model(inputs, labels=labels)  # note that labels are shifted inside the model (reference: https://huggingface.co/transformers/model_doc/gpt2.html)
    
    # seek the loss
    loss = outputs[0] 
    tr_loss += loss.item()
    
    # backprop
    loss.backward()

    # gradient clipping
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
    
    # update model parameters
    optimizer.step()  # update the model parameter
    scheduler.step()  # update learning rate schedule
    model.zero_grad() # clear the gradient buffer
    
    global_step += 1

    if max_steps > 0 and global_step > max_steps:
      epoch_iterator.close()
      break

  if max_steps > 0 and global_step > max_steps:
    train_iterator.close()
    break
    
  cur_epoch += 1
  print("%d epoch loss: %.2f"%(cur_epoch, tr_loss/global_step))

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]
	add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
	add_(Tensor other, *, Number alpha) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
  exp_avg.mul_(beta1).add_(1.0 - beta1, grad)

Iteration:  12%|█▎        | 2/16 [00:00<00:00, 15.36it/s][A
Iteration:  25%|██▌       | 4/16 [00:00<00:00, 15.65it/s][A
Iteration:  38%|███▊      | 6/16 [00:00<00:00, 16.26it/s][A
Iteration:  50%|█████     | 8/16 [00:00<00:00, 16.50it/s][A
Iteration:  62%|██████▎   | 10/16 [00:00<00:00, 16.60it/s][A
Iteration:  75%|███████▌  | 12/16 [00:00<00:00, 16.95it/s][A
Iteration: 100%|██████████| 16/16 [00:00<00:00, 17.48it/s]
Epoch:  50%|█████     | 1/2 [00:00<00:00,  1.08it/s]
Iteration:   0%|          | 0/16 [00:00<?, ?it/s][A
Iteration:  12%|█▎        | 2/16 [00:00<00:00, 19.18it/s][A

1 epoch loss: 10.80



Iteration:  25%|██▌       | 4/16 [00:00<00:00, 18.86it/s][A
Iteration:  38%|███▊      | 6/16 [00:00<00:00, 18.47it/s][A
Iteration:  50%|█████     | 8/16 [00:00<00:00, 14.54it/s]
Epoch:  50%|█████     | 1/2 [00:01<00:01,  1.49s/it]


## Text Generation with GPT-2 
Now that we got a glimpse of training GPT-2 (toy version) from scratch, we can move to generating some interesting text from GPT-2 given some seed text.

Let us create some seed text.

In [13]:
'''
define some seed text to be given as input to GPT-2 model
'''

SEED_TEXT = """In a shocking finding, scientist discovered a herd of unicorns living in a remote, 
previously unexplored valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English. The"""

Since our GPT-2 model trained on Yelp is toyish, we will use the original GPT-2 model pretrained on WebText for the rest of the tutorial.

In [14]:
'''
load model and tokenizer for GPT-2 (pretrained version)
'''
cache_dir = "/tmp" # to store pretrained checkpoints

# load model
model = GPT2LMHeadModel.from_pretrained('gpt2', cache_dir=cache_dir)
model.to(device)
model.eval()

# load tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2", do_lower_case=False, cache_dir=cache_dir)

Set the number of words to generate

In [15]:
'''
parameter for generation
'''
num_words_to_generate = 70

### Greedy Search

Let us start with the most simplest but not very effective decoding technique, greedy search. Greedy search outputs the token that received the highest probability at each time step. 

We will first tokenize the seed text (`In a shocking finding, scientist ...`) and extract the GPT-2 tokens using GPT-2 tokenizer.



In [16]:
# tokenize all the words in seed text using GPT-2 tokenizer
seed_tokens = tokenizer.encode(SEED_TEXT, add_special_tokens=False)
print(len(seed_tokens)) # there are 53 GPT-2 tokens in the seed text
print(seed_tokens[0:10]) # token ids for first 10 tokens
print(tokenizer.decode(seed_tokens[0:10])) # raw text based on first 10 tokens

53
[818, 257, 14702, 4917, 11, 11444, 5071, 257, 27638, 286]
In a shocking finding, scientist discovered a herd of


We will convert the token ids to tensor so that we can feed it to the model.

In [17]:
 # create the tensor to store the model input (initially the input is just the seed text)
generated = torch.tensor(seed_tokens, dtype=torch.long, device=device).unsqueeze(0)
print(generated)
print(generated.size())

tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383]], device='cuda:0')
torch.Size([1, 53])


We will now feed the tensor containing token ids to the GPT-2 model for inference. 

In [18]:
# prepare gpt-2 model input
inputs = {'input_ids': generated}

# feed input to the model
outputs = model(**inputs)[0]

print(outputs.size()) # [batch size, number of tokens in input, number of tokens in GPT-2 vocabulary]

torch.Size([1, 53, 50257])


The `outputs` tensor contains the logits (unnormalized probability distribution over words in GPT-2 vocabulary) for each token in the input and we will extract the logits from the last token (53rd token) to decide the token to be generated.



In [19]:
# extract the next token logits (unnormalized probability distribution)
next_token_logits = outputs[:, -1, :]
print(next_token_logits.size()) # unnormalized probability distribution for next word prediction

torch.Size([1, 50257])


**In greedy sampling, we choose the word with the highest probability as the word to be generated.**

<img src="
https://drive.google.com/uc?id=1QBKIlCj3y8kZ27x5ojXcKANsOXga2qGf" alt="Greedy Search" title="Greedy Search"  height=400 />

In [20]:
# find the token with the highest probability (argmax)
next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)
print(next_token) # id for the predicted token

tensor([[4837]], device='cuda:0')


Now we can append this word to the input tensor (`generated`) and print out the generated text so far. 

In [21]:
print('before appending the new generated token...')
print(generated)
# add the generated token to the input
generated = torch.cat((generated, next_token), dim=1)
print('after appending the new generated token...')
print(generated)
print('our text (seed text + generated text) so far...')
print(tokenizer.decode(generated.squeeze().tolist(), clean_up_tokenization_spaces=True))

before appending the new generated token...
tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383]], device='cuda:0')
after appending the new generated token...
tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383,  4837]], device='cuda:0')
our text (seed tex

We can repeat the process with the new `generated` (input) tensor and predict the rest of the words you want to generate.

And the full fledged code for greedy search will be:

In [22]:
def greedy_search(seed_text):  
  # tokenize all the words in seed text using GPT-2 tokenizer
  seed_tokens = tokenizer.encode(seed_text, add_special_tokens=False)
  
  # create the tensor to store the model input (initially the input is just the seed text)
  generated = torch.tensor(seed_tokens, dtype=torch.long, device=device).unsqueeze(0)

  with torch.no_grad():
    for _ in range(num_words_to_generate): # run over number of word to generate
      # prepare gpt-2 model input
      inputs = {'input_ids': generated}

      # feed input to the model
      outputs = model(**inputs)[0]

      # extract the next token logits (unormalized probability distribution)
      next_token_logits = outputs[:, -1, :]

      # find the token with highest probability (argmax)
      next_token = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)

      # add the generated token to the input
      generated = torch.cat((generated, next_token), dim=1)
  
  # convert the model generation from token ids to raw text
  generated = generated[:, len(seed_tokens):].tolist() # seed tokens are already in raw form
  for g in generated: # for every generated token
    text = tokenizer.decode(g, clean_up_tokenization_spaces=True)

  return text

print("greedy search's seed text = %s"%SEED_TEXT)
print("greedy search's continuation text: %s"%greedy_search(SEED_TEXT))

greedy search's seed text = In a shocking finding, scientist discovered a herd of unicorns living in a remote, 
previously unexplored valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English. The
greedy search's continuation text:  researchers say that the unicorns were able to communicate with each other through their

speech, and that they were able to communicate with each other through their eyes.

The researchers say that the unicorns were able to communicate with each other through their eyes.

The researchers say that the unicorns were able to communicate with each other through


Greedy search clearly suffers from repetition problem. In general, greedy search doesn't work well for open-domain generation.

To overcome the repetition problem and also encourage diverse text, we will be exploring the stochastic methods that works by preventing sampling of low probability tokens. A typical approach is to restrict sampling to a subset of the vocabulary at each step.

### Top-k sampling
The top-k sampler restricts sampling to the k most-probable tokens. This was the decoding strategy used by the original GPT-2 paper.

We will first tokenize the seed text (`In a shocking finding, scientist ...`) and extract the GPT-2 tokens using GPT-2 tokenizer.

In [23]:
# tokenize all the words in seed text using GPT-2 tokenizer
seed_tokens = tokenizer.encode(SEED_TEXT, add_special_tokens=False)
print(len(seed_tokens)) # there are 53 GPT-2 tokens in the seed text
print(seed_tokens[0:10]) # token ids for first 10 tokens
print(tokenizer.decode(seed_tokens[0:10])) # raw text based on first 10 tokens

53
[818, 257, 14702, 4917, 11, 11444, 5071, 257, 27638, 286]
In a shocking finding, scientist discovered a herd of


We will convert the token ids to tensor so that we can feed it to the model.

In [24]:
# create the tensor to store the model input (initially the input is just the seed text)
generated = torch.tensor(seed_tokens, dtype=torch.long, device=device).unsqueeze(0)
print(generated)
print(generated.size())

tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383]], device='cuda:0')
torch.Size([1, 53])


We will now feed the tensor containing token ids to the GPT-2 model for inference. 

In [25]:
# prepare gpt-2 model input
inputs = {'input_ids': generated}

# feed input to the model
outputs = model(**inputs)[0]

print(outputs.size()) # batch size x number of tokens in input x number of tokens in GPT-2 vocabulary

torch.Size([1, 53, 50257])


The `outputs` tensor contains the logits (unnormalized probability distribution over words in GPT-2 vocabulary) for each token in the input and we will extract the logits from the last token (53rd token) to decide the token to be generated.

In [26]:
# extract the next token logits (unnormalized probability distribution)
next_token_logits = outputs[:, -1, :]
print(next_token_logits.size()) # unnormalized probability distribution for next word prediction

torch.Size([1, 50257])


**In top-k sampling, we sample the word from the k most probable tokens to generate the next word.**

<img src="
https://drive.google.com/uc?id=1rjd8FjS_r1MmXwn8Hnbms7NgJJTa3x2j" alt="Top-k Sampling" title="Top-k Sampling"  height=270 />

So we first remove all tokens with a probability less than the last token of the top-k list



In [27]:
# Remove all tokens with a probability less than the last token of the top-k list
indices_to_remove = next_token_logits < torch.topk(next_token_logits, 40)[0][..., -1, None] # top_k = 40
next_token_logits[indices_to_remove] = -float('Inf') # substitute negative infinity so that those words would never be sampled
print(next_token_logits.size()) # unnormalized probability distribution for next word prediction

torch.Size([1, 50257])


Now we can sample from top-k probability distribution

In [28]:
# Sample from top-k probability distribution
next_token = torch.multinomial(torch.nn.functional.softmax(next_token_logits, dim=-1), num_samples=1)
print(next_token) # id for the predicted token

tensor([[1074]], device='cuda:0')


Now we can append this word to the input tensor (`generated`) and print out the generated text so far. 

In [29]:
print('before appending the new generated token...')
print(generated)
# add the generated token to the input
generated = torch.cat((generated, next_token), dim=1)
print('after appending the new generated token...')
print(generated)
print('our text (seed text + generated text) so far...')
print(tokenizer.decode(generated.squeeze().tolist(), clean_up_tokenization_spaces=True))

before appending the new generated token...
tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383]], device='cuda:0')
after appending the new generated token...
tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383,  1074]], device='cuda:0')
our text (seed tex

We can repeat the process with the new `generated` (input) tensor and predict the rest of the words you want to generate.

And the full fledged code for top-k sampling will be:

In [30]:
def topk_sampling(seed_text, top_k=100, filter_value=-float('Inf')): 
  # tokenize all the words in seed text using GPT-2 tokenizer
  seed_tokens = tokenizer.encode(seed_text, add_special_tokens=False)

  # create the tensor to store the model input (initially the input is just the seed text)
  generated = torch.tensor(seed_tokens, dtype=torch.long, device=device).unsqueeze(0)

  with torch.no_grad():
    for _ in range(num_words_to_generate): # run over number of word to generate
      # prepare gpt-2 model input
      inputs = {'input_ids': generated}

      # feed input to the model
      outputs = model(**inputs)[0]

      # extract the next token logits (unormalized probability distribution)
      next_token_logits = outputs[:, -1, :] 
      
      # Safety check just in case number of words in the vocabulary is less than top_k, 
      # we need to set top_k to number of words in the vocabulary
      top_k = min(top_k, next_token_logits.size(-1))  

      # Remove all tokens with a probability less than the last token of the top-k list
      indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None]
      next_token_logits[indices_to_remove] = filter_value # substitute negative infinity so that those words would never be sampled

      # Sample from top-k probability distribution
      next_token = torch.multinomial(torch.nn.functional.softmax(next_token_logits, dim=-1), num_samples=1)
      
      # add the generated token to the input
      generated = torch.cat((generated, next_token), dim=1)

  # convert the model generation from token ids to raw text
  generated = generated[:, len(seed_tokens):].tolist() # seed tokens are already in raw form
  for g in generated: # for every generated token
    text = tokenizer.decode(g, clean_up_tokenization_spaces=True)

  return text
 
print("top-k sampling's seed text = %s"%SEED_TEXT)
print("top-k sampling's continuation text = %s"%topk_sampling(SEED_TEXT, top_k=40))

top-k sampling's seed text = In a shocking finding, scientist discovered a herd of unicorns living in a remote, 
previously unexplored valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English. The
top-k sampling's continuation text =  study was published in the journal Philosophical Transactions of the Royal Society B.

But despite that astonishing discovery, scientists still think they have a long way to go before researchers believe they have had a unicorn.

According to John D. Hines, an ecologist at the University of Georgia Institute of Technology, "there were no studies


### Top-p (or nucleus) sampling

On the other hand, the nucleus (top-p) sampler restricts sampling to the smallest set of tokens with total mass above a threshold p (which is a continuous value that ranges between 0 and 1).

We will first tokenize the seed text (`In a shocking finding, scientist ...`) and extract the GPT-2 tokens using GPT-2 tokenizer.

In [31]:
# create the tensor to store the model input (initially the input is just the seed text)
generated = torch.tensor(seed_tokens, dtype=torch.long, device=device).unsqueeze(0)
print(generated)
print(generated.size())

tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383]], device='cuda:0')
torch.Size([1, 53])


We will now feed the tensor containing token ids to the GPT-2 model for inference. 

In [32]:
# prepare gpt-2 model input
inputs = {'input_ids': generated}

# feed input to the model
outputs = model(**inputs)[0]

print(outputs.size()) # batch size x number of tokens in input x number of tokens in GPT-2 vocabulary

torch.Size([1, 53, 50257])


The `outputs` tensor contains the logits (unnormalized probability distribution over words in GPT-2 vocabulary) for each token in the input and we will extract the logits from the last token (53rd token) to decide the token to be generated.

In [33]:
# extract the next token logits (unnormalized probability distribution)
next_token_logits = outputs[:, -1, :]
print(next_token_logits.size()) # unnormalized probability distribution for next word prediction

torch.Size([1, 50257])


**In top-p sampling, we sample from the smallest set of tokens with total mass above a threshold p (which is a continuous value that ranges between 0 and 1) as shown below in an example.**

<img src="
https://drive.google.com/uc?id=1P3n2ygqxEQsYre-i_CvWQAOsqlzdho4Z" alt="top-p sampling" title="top-p sampling"  height=270 />

So we first compute the cummulative probability (the last column in the right most table in the figure).


In [34]:
# compute the cummulative probability
sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True) # sort words based on probability
cumulative_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)
print(cumulative_probs.size()) # 1 x number of tokens in GPT-2 vocabulary

torch.Size([1, 50257])


And the we remove tokens with cumulative probability above the given threshold.

In [35]:
# Remove tokens with cumulative probability above the threshold
sorted_indices_to_remove = cumulative_probs > 0.9 # top-p=0.9
# Shift the indices to the right to keep also the first token above the threshold
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0
# scatter sorted tensors to original indexing
indices_to_remove = sorted_indices_to_remove.scatter(dim=1, index=sorted_indices, src=sorted_indices_to_remove)
next_token_logits[indices_to_remove] = -float('Inf') # substitute negative infinity so that those words would never be sampled
print(next_token_logits.size()) # unnormalized probability distribution for next word prediction

torch.Size([1, 50257])


Now we can sample from top-p probability distribution

In [36]:
# Sample from top-p probability distribution
next_token = torch.multinomial(torch.nn.functional.softmax(next_token_logits, dim=-1), num_samples=1)
print(next_token) # id for the predicted token     

tensor([[2050]], device='cuda:0')


Now we can append this word to the input tensor (`generated`) and print out the generated text so far. 

In [37]:
print('before appending the new generated token...')
print(generated)
# add the generated token to the input
generated = torch.cat((generated, next_token), dim=1)
print('after appending the new generated token...')
print(generated)
print('our text (seed text + generated text) so far...')
print(tokenizer.decode(generated.squeeze().tolist(), clean_up_tokenization_spaces=True))

before appending the new generated token...
tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383]], device='cuda:0')
after appending the new generated token...
tensor([[  818,   257, 14702,  4917,    11, 11444,  5071,   257, 27638,   286,
         28000, 19942,  2877,   287,   257,  6569,    11,   220,   198,  3866,
          8647, 31286,  1850, 19272,    11,   287,   262,   843,   274, 21124,
            13,  3412,   517,  6452,   284,   262,   198,   260,   325,   283,
          3533,   373,   262,  1109,   326,   262, 28000, 19942,  5158,  2818,
          3594,    13,   383,  2050]], device='cuda:0')
our text (seed tex

We can repeat the process with the new `generated` (input) tensor and predict the rest of the words you want to generate.

And the full fledged code for top-p sampling will be:

In [38]:
def top_p_sampling(seed_text, top_p=0.9, filter_value=-float('Inf')):  
  # tokenize all the words in seed text using GPT-2 tokenizer
  seed_tokens = tokenizer.encode(seed_text, add_special_tokens=False)
  
  # create the tensor to store the model input (initially the input is just the seed text)
  generated = torch.tensor(seed_tokens, dtype=torch.long, device=device).unsqueeze(0)
  
  with torch.no_grad():
    for _ in range(num_words_to_generate):
      # prepare gpt-2 model input
      inputs = {'input_ids': generated}

      # feed input to the model
      outputs = model(**inputs)[0]

      # extract the next token logits (unormalized probability distribution)
      next_token_logits = outputs[:, -1, :]
      
      # compute the cummulative probability
      sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
      cumulative_probs = torch.cumsum(torch.nn.functional.softmax(sorted_logits, dim=-1), dim=-1)

      # Remove tokens with cumulative probability above the threshold
      sorted_indices_to_remove = cumulative_probs > top_p
      # Shift the indices to the right to keep also the first token above the threshold
      sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
      sorted_indices_to_remove[..., 0] = 0
      # scatter sorted tensors to original indexing
      indices_to_remove = sorted_indices_to_remove.scatter(dim=1, index=sorted_indices, src=sorted_indices_to_remove)
      next_token_logits[indices_to_remove] = filter_value # substitute negative infinity so that those words would never be sampled

      # Sample from top-p probability distribution
      next_token = torch.multinomial(torch.nn.functional.softmax(next_token_logits, dim=-1), num_samples=1)
      
      # add the generated token to the input
      generated = torch.cat((generated, next_token), dim=1)
  
  # convert the model generation from token ids to raw text
  generated = generated[:, len(seed_tokens):].tolist() 
  for g in generated:
    text = tokenizer.decode(g, clean_up_tokenization_spaces=True)
  
  return text

print("nucleus (top-p) sampling's seed text = %s"%SEED_TEXT)
print("nucleus (top-p) sampling's continuation text = %s"%top_p_sampling(SEED_TEXT, top_p=0.9))

nucleus (top-p) sampling's seed text = In a shocking finding, scientist discovered a herd of unicorns living in a remote, 
previously unexplored valley, in the Andes Mountains. Even more surprising to the
researchers was the fact that the unicorns spoke perfect English. The
nucleus (top-p) sampling's continuation text =  Langans knew one such unicorn -- a seabird like one created by their ancestors from India.

Sixty-four years later, archaeologists will unveil the first evidence of a "real historical living ancestor."

This fossil found near a city in the Patagonian border region of the Andes, National Geographic reports.

The


That's it!

If you're curious, try out the online GPT-2 demo: https://transformer.huggingface.co/doc/gpt2-large