# Text Generation using GPT2

Copyright @ 2020 **ABCOM Information Systems Pvt. Ltd.** All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

See the License for the specific language governing permissions and limitations under the License.

# Installing packages

In [None]:
!pip install git+https://github.com/huggingface/transformers

In [None]:
!pip install --upgrade pyarrow

# Loading Data

In [None]:
# Download the Shakespeare's text.
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

In [None]:
!mkdir output

# Fine tuning for new dataset

In [None]:
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/language-modeling/run_language_modeling.py

In [None]:
!python run_language_modeling.py \
    --output_dir=output \
    --model_type=gpt2 \
    --model_name_or_path=gpt2 \
    --do_train \
    --train_data_file='/content/input.txt' \
    --per_gpu_train_batch_size=1 \
    --save_steps=-1 \
    --num_train_epochs=2

# Loading Tokenizer and Model

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

tokenizer = GPT2Tokenizer.from_pretrained('/content/output')
model = GPT2LMHeadModel.from_pretrained('/content/output')

# Generating Text

## Greedy Search
This is a very basic searching algorithm which selects the word with highest probability as its next word and doesn't use other words with lesser probability.
The code for implementing greedy search with our model is given below.

In [None]:
ids = tokenizer.encode('[BOS] The King must leave the throne now . [EOS]',
                      return_tensors='pt')

greedy_outputs = model.generate(ids, max_length=300)

print("Output:\n" + 100 * '-')
for i, greedy_output in enumerate(greedy_outputs):
  print("\n"+"==="*10)
  print("{}: {}".format(i+1, tokenizer.decode(greedy_output, skip_special_tokens=False)))

## Beam Search
It is a search algorithm which considers the probabilities of consequent no (num_beams) of words not like greedy search which simply selects word with highest probability. It then multiplies these probabilities with the previous ones for each case. Then, it selects the sequence of words which had higher overall probability after multiplication.

The code for implementing beam search with our model is given below.

We set num_beams > 1 and early_stopping=True so that generation is finished when all beam hypotheses reached the endprompts token.

In [None]:
# activate beam search and early_stopping
beam_output = model.generate(
    ids, 
    max_length=300, 
    num_beams=4, 
    early_stopping=True
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

# Sampling
Sampling means randomly picking the next word according to its conditional probability distribution.

In [None]:
import tensorflow as tf

In [None]:
tf.random.set_seed(0)
sample_output = model.generate(
    ids, 
    do_sample=True, 
    max_length=300
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

As we can see it produce much better results than previous ones and the text is also starting to make some sense.

## Top-K Sampling

In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words.

In [None]:
tf.random.set_seed(0)

# set top_k to 50
sample_output2 = model.generate(
    ids, 
    do_sample=True, 
    max_length=300, 
    top_k=50
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output2[0], skip_special_tokens=True))

Now, after implementing top-k sampling, we should try out top-p sampling

## Top-p (Nucleus) sampling

It is selecting the highest probability tokens whose cumulative probability mass
exceeds the pre-chosen threshold p.

In [None]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# deactivate top_k sampling and sample only from 92% most likely words
sample_output3 = model.generate(
    ids, 
    do_sample=True, 
    max_length=300, 
    top_p=0.92,
)

print("Output:\n" + 100 * '-')
print(tokenizer.decode(sample_output3[0], skip_special_tokens=True))

## Combining Sampling

In [None]:
# set seed to reproduce results. Feel free to change the seed though to get different results
tf.random.set_seed(0)

# set top_k = 50 and set top_p = 0.95 
final_outputs = model.generate(
    ids,
    do_sample=True, 
    max_length=300, 
    top_k=40, 
    top_p=0.95, 
)

print("Output:\n" + 100 * '-')
for i, final_output in enumerate(final_outputs):
  print("{}: {}".format(i, tokenizer.decode(final_output, skip_special_tokens=True)))