In [1]:
!pip install -U transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/98/87/ef312eef26f5cecd8b17ae9654cdd8d1fae1eb6dbd87257d6d73c128a4d0/transformers-4.3.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 6.0MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 38.9MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 42.7MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=dbfe7

In [3]:
import tensorflow as tf
from transformers import  GPT2Tokenizer, TFGPT2LMHeadModel

In [4]:
tf.__version__

'2.4.1'

In [6]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

model = TFGPT2LMHeadModel.from_pretrained("gpt2", pad_token_id = tokenizer.eos_token_id)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=497933648.0, style=ProgressStyle(descri…




All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


# Greedy Search

Greedy search simply selects the word with the highest probability as its next word.

In [7]:
input_ids = tokenizer.encode('I enjoy walking with my cute dog',return_tensors='tf')

greedy_output = model.generate(input_ids, max_length=50)

print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll


# Beam Search

Beam search reduces the risk of missing hidden high probability word sequences by keeping the most likely num_beans of hypothesis at each time step and eventually choosing the hypothesis that has the overall highest probability.

In [8]:
beam_output = model.generate(
    input_ids,
    max_length = 50,
    num_beams = 5,
    early_stopping = True
)

print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I'm not sure if I'll ever be able to walk with him again. I'm not sure if I'll


Try by setting no_repeat_ngram_size=2 so that no 2-gram appears twice.

In [10]:
beam_output = model.generate(
    input_ids,
    max_length = 50,
    num_beams = 5,
    no_repeat_ngram_size = 2,
    early_stopping=True
)

print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break


Another important feature about beam search is that we can compare the top beams after generation and choose the generated beams that fits our purpose.

In transformers, we set the parameter num_return_sequences to the number of highest scoring beams that should be returned
So make num_return_sequences <= num_beams.

In [14]:
# set return_num_sequences > 1

beam_outputs = model.generate(
    input_ids,
    max_length=50,
    num_beams=5,
    no_repeat_ngram_size=2,
    num_return_sequences=5,
    early_stopping=True
)

for i, beam_output in enumerate(beam_outputs):
  print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

0: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a break
1: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to get back to
2: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to take a break
3: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with her again.

I've been thinking about this for a while now, and I think it's time for me to get back to
4: I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with him again.

I've been thinking about this for a while now, and I think it's time for me to take a step


Why beam search might not be the best possible option ?

1. Beam search can work well in tasks where the length of the desired generation is more or less predictable as in machine translation or summarization.
2. It heavily suffers from repititive generation.This is hard to control with n-gram or other penalties, since finding a good trade-offbetween forces "no-repitition" and repeating cycles of identical n-grams requires a lot of finetuning.
3. High quality of human language does not follow a distribution of high probability next words.

# Sampling

sampling means randomly picking the next word according to its conditional probability distribution. The language generation using sampling is not deterministic.

In transformers, we set do_sample=True and deactivate Top-K sampling with top_k=0.

In [15]:
tf.random.set_seed(10)

sample_output = model.generate(
    input_ids, 
    do_sample = True,
    max_length = 50,
    top_k = 0
)

print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

I enjoy walking with my cute dog and I take after my car every once in a while. When I get home I'm exploring different areas of the house and in some areas I even took a little naps because (I guess) I'm way


A trick to make the distribution sharper is by lowering the temperature of the softmax.

set temperature=0.7

In [17]:
tf.random.set_seed(10)

sample_output = model.generate(
    input_ids, 
    do_sample = True,
    max_length = 50,
    top_k = 0,
    temperature = 0.7
)

print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

I enjoy walking with my cute dog and I'm also a car enthusiast. When I left college, I did a lot of exploring and hiking and camping with my dog, but I didn't have no motivation to do it anymore. I regretted it when


while setting temperature -> 0, temperature scaled sampling becomes equal to greedy decoding and will suffer from the same problems as before. 

# Top-K Sampling

The K most likely next words are filtered and the probability mass is redistributed among only those K next words.

In [19]:
tf.random.set_seed(0)

sample_output = model.generate(
    input_ids,
    do_sample = True,
    max_length = 50,
    top_k = 50
)

print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

I enjoy walking with my cute dog because of the warmth and safety that comes with it," says Mary Anne Anderson.

Anderson is a retired veterinarian with a passion for the care of animals and the environment, helping to manage landfills and land


# Top-p (nucleus) Sampling

Top-p sampling chooses from the smallest possible set of words whose cumulative probability exceeds the probability p.
The probability mass is then redistributed among this set of words.
This way, the size of theset of words can dynamizally increase and decrease according to the next word's probability distribution.

Activate Top-p sampling by setting 0 < top_p < 1

In [21]:
tf.random.set_seed(0)

sample_output = model.generate(
    input_ids,
    do_sample = True,
    max_length = 50,
    top_p = 0.92,
    top_k = 0
)

print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

I enjoy walking with my cute dog because of the security and safety, and she wasn't scared by the obstacles. But so do her two grown-ups and now she's over scared by the amount of metal there is."

It might seem


Top-p seems more elegant than Top-K. Top_p can also be used in combination with Top-K, which can avoid very low ranked words while allowing for some dynamic selection.

To get multiple independently samples outputs set num_return_sequences > 1

In [24]:
tf.random.set_seed(0)

sample_output = model.generate(
    input_ids,
    do_sample = True,
    max_length = 50,
    top_k = 50,
    top_p = 0.95,
    num_return_sequences = 3
)

for i, j in enumerate(sample_output):
  print("{}: {}".format(i, tokenizer.decode(j, skip_special_tokens=True)))

0: I enjoy walking with my cute dog because of the warmth and safety that comes with it," says Mary Anne Meehan, whose husband, Terry, is a mechanic. "He loves being around people, doing chores." Mary Anne, 58, and
1: I enjoy walking with my cute dog, so he was excited that I was going to show him the door," Coughlin wrote to a friend. "So excited and excited that after 10 minutes we went to the door. It is a great little
2: I enjoy walking with my cute dog and I am a regular walking partner. We do both walking and cycling daily. I don't like driving, but I do like having a nice home environment! I can only say I am a happy camper too
