Experimenting Text Generation with Transformers using HuggingFace. Also exploring different decoding methods like Beam Search, Top-K sampling and Top-P sampling

In [2]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/cd/40/866cbfac4601e0f74c7303d533a9c5d4a53858bd402e08e3e294dd271f25/transformers-4.2.1-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 9.1MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 55.0MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 41.7MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=cfd26b9ee3bbd

In [3]:
SEED = 34 #Reproducability
MAX_LEN = 70 #Maximum number of words in the output

In [4]:
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer #Retrieve Transformers
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-large") #Extract GPT2 Large Tokenizer
GPT2 = TFGPT2LMHeadModel.from_pretrained("gpt2-large", pad_token_id=tokenizer.eos_token_id) #Extract GPT2 Large model

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=764.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=3096618024.0, style=ProgressStyle(descr…




All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [5]:
GPT2.summary()

Model: "tfgp_t2lm_head_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
transformer (TFGPT2MainLayer multiple                  774030080 
Total params: 774,030,080
Trainable params: 774,030,080
Non-trainable params: 0
_________________________________________________________________


Decoding Methods

1. First Pass (Greedy Search):

The word with the highest probability is predicted as the next word using the below equation

>$w_{t} = argmax_{w}P(w|w_{1:t-1})$

at each timestep $t$. 

In [6]:
import tensorflow as tf
tf.random.set_seed(SEED)

In [7]:
input_sequence = "This is a simple sequence, based on"

In [8]:
input_ids = tokenizer.encode(input_sequence, return_tensors='tf') #Encoding the input sequence

In [9]:
greedy_output = GPT2.generate(input_ids, max_length = MAX_LEN) #Text generated based on the Greedy Search

In [10]:
print('Output: \n')
print(tokenizer.decode(greedy_output[0], skip_special_tokens = True))

Output: 

This is a simple sequence, based on the idea that the first two steps are the most important.

The first step is to find the first element in the sequence. This is done by using the first element of the sequence as a key.

The second step is to find the second element in the sequence. This is done by using


2. Beam Search:

Since the Greedy Search always gives priority to the word with the highest probability, it masks the words with the lowest probability. This is resolved by Beam Search

When using Beam Search the model tracks and keeps the $num_beams$ of hypotheses at each time step,  so the model is able to compare the alternative paths as its generate text. $n\_gram$ penalty can be included by setting $no\_repeat\_ngram\_size = 2$ which ensures that no 2 grams appear twice. The $num\_return\_sequences = 5$ is set, inorde to see what the other 5 beams looked like.

The parameters has to be set in Generate function to use the Beam Seach 

In [11]:
beam_outputs = GPT2.generate(
    input_ids, 
    max_length = MAX_LEN, 
    num_beams = 5, 
    no_repeat_ngram_size = 2, 
    num_return_sequences = 5, 
    early_stopping = True
)

In [11]:
print("Output:\n" + 100 * '-')

In [12]:
for i, beam_output in enumerate(beam_outputs):
      print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

0: This is a simple sequence, based on the fact that the first letter of each word is the same as the last letter in the previous word.

For example, if you want to say "I love you", you would write "i-love-you" and then write the rest of the words in reverse order. If you wanted to
1: This is a simple sequence, based on the fact that the first letter of each word is the same as the last letter in the previous word.

For example, if you want to say "I love you", you would write "i-love-you" and then write the rest of the words in this sequence: "love", "
2: This is a simple sequence, based on the fact that the first letter of each word is the same as the last letter in the previous word.

For example, if you want to say "I love you", you would write "i-love-you" and then write the rest of the words in reverse order. This is called a
3: This is a simple sequence, based on the fact that the first letter of each word is the same as the last letter in the previous word.

For example, if yo

From the above outputs it is noticed that, 5 different beam hypothesis are all same. The variation can be seen by increasing the $num\_beams$

3. Basic Sampling:

Instead Predicting the next word based on the highest probability, the next word can be randomly picked based on the Conditional Probability distribution.

$w_{t} =  P(w|w_{1:t-1})$

The $temperature$ parameter increases the chances of highest probability words and decreases the chances of low probability words in the sampling.

$do\_sample = True$ is set to implement sampling

In [13]:
sample_output = GPT2.generate(
                             input_ids, 
                             do_sample = True, 
                             max_length = MAX_LEN, 
                             top_k = 0, 
                             temperature = 0.8)

In [20]:
print('Output : \n')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True))

Output : 

This is a simple sequence, based on the following Python code:

import random height = random.randint(1, 6) # height of the image, in pixels # randomization is done each frame, based on the image size. # The average number of bits per pixel is given by the inverse of the X-coordinate.


4. Top-K Sampling:

The Top-K most likely words are selected and the entire probability mass is shifted to these $K$ words. In this case, it avoids the phenonmenon of increasing the chances of highest probability words and decreasing the chances of highest probability words. Instead it just removes low probability words all together.

Top-K is set - in need of many of the top words to consider our conditional probability distribution.

In [21]:
sample_output = GPT2.generate(input_ids, do_sample = True, max_length = MAX_LEN, top_k = 50)

In [22]:
print("Output:\n")
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:

This is a simple sequence, based on a random sample and it doesn't require much in the way of memory allocation and doesn't make heavy use of random number generation. As an extreme example, it might allocate a vector or integer array of length 1,000,000 and a random object of width 100. With the "random" parameter, you ...


5. Top-P Sampling:

Instead of choosing the Top-K most likely words, we choose the smallest set of words with the total probability more than $p$, and then the entire proability mass is shifted to the words in this set.

The major difference between the Top-K and Top-P is, the Top-K value will be static and the number of words chosen will be always same. But in case of Top-P sampling, the size of the set can change.

In [23]:
sample_output = GPT2.generate(input_ids, do_sample = True, max_length = MAX_LEN, top_p = 0.8, top_k = 0)

In [24]:
print('Output:\n')
print(tokenizer.decode(sample_output[0], skip_special_tokens = True), '...')

Output:

This is a simple sequence, based on a static search:

for(i=0; i< 16; i++) {

We replace the eight bytes with a MZ bytes, and then if the result is not an NNN byte, use an explicit or getGzdMZ() to compute the remainder. If the ...


6. Check the Diversity of the generated sentences by setting, `top-k`, `top-p` and `temperature`

In [25]:
sample_outputs = GPT2.generate(input_ids, do_sample = True, max_length = 2*MAX_LEN, temperature = .7, 
                               top_k = 50, top_p = 0.85, num_return_sequences = 5)

In [26]:
print('Output:\n')
for i, sample_output in enumerate(sample_outputs):
    print("{}: {}...".format(i, tokenizer.decode(sample_output, skip_special_tokens = True)))

Output:

0: This is a simple sequence, based on a set of equations that can be solved for a given set of inputs.

If the input is a positive number, the output is a number from 1 to 9. If the input is a negative number, the output is a number from 0 to 9.

The solution is a list of numbers from 1 to 9.

The function calculates the sum of the numbers in the list.

The function returns the number in the list.

If the input is a positive number, the output is a number from 1 to 9. If the input is a negative number, the output is a number from 0 to 9.

...
1: This is a simple sequence, based on the simple rule of the first line.

The first line is the line where the program starts.

The second line is the line where the program stops.

The third line is the line where the program continues.

The fourth line is the line where the program ends.

This is the basic sequence for a program.

The basic sequence for a program is the same as the first line.

The first line is the line where the pro