<a href="https://colab.research.google.com/github/antoninodimaggio/Transformers/blob/master/Generating_Text_Using_GPT_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Text Using GPT-2
Training a robust language model from scratch takes a lot of time, a lot of data, and a lot of money. Therefore, it is much easier for us to use a pretrained model.Thanks to [Hugging Face](https://huggingface.co/) this is short and easy. This notebook demonstrates how we can use a smaller version of [OpenAI's GPT-2](https://openai.com/blog/better-language-models/) to generate text given a prompt.


## Some Useful Links
* Good place to start learning Hugging Face: [How to train a new language model from scratch using Transformers and Tokenizers](https://huggingface.co/blog/how-to-train)
* All of the Hugging Face [transformer notebooks](https://huggingface.co/transformers/notebooks.html)

In [1]:
!nvidia-smi

Sat Jun 13 04:19:05 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   66C    P8    11W /  70W |      0MiB / 15079MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [0]:
%%capture
!pip install transformers

import torch
from transformers import AutoTokenizer, AutoConfig, AutoModelWithLMHead

## Tokenizer and Model
* Hugging Face's AutoModels makes it easy to define our tokenizer and model
* AutoModels abstract away some configuration details making things easier

In [3]:
tokenizer = AutoTokenizer.from_pretrained('gpt2')
config = AutoConfig.from_pretrained('gpt2')
model = AutoModelWithLMHead.from_pretrained('gpt2', config=config)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=665.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=548118077.0, style=ProgressStyle(descri…




## Let's Generate Some Text
* Here we give the model a prompt say, 'My dream is to travel', and we have the model build off this and generate some text
* Some of the generated text will not be perfect but it truly is amazing when you get a complete, coherent, well-written block of text
* Explanation of generate params [docs](https://huggingface.co/transformers/main_classes/model.html?highlight=generate)
  * max_length: max number of tokens in completed text
  * min_length: min number of tokens in completed text
  * temperature: Float value controlling randomness in boltzmann
 distribution. Lower temperature results in less random completions. As the
 temperature approaches zero, the model will become deterministic and
 repetitive.
  * top_p: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling. Must be between 0 and 1. Default to 1.
  * num_sequences: how many different generated text sequences we want (gives us some choices)

In [8]:
prompt = 'My dream is to travel'

e_prompt = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)

output_sequences = model.generate(
    input_ids= e_prompt,
    max_length=100,
    min_length=50,
    temperature=1.,
    top_p=0.95,
    do_sample=True,
    num_return_sequences=10)


def clean_prediction(text):
    """ clean up some loose ends that GPT-2 leaves behind"""
    token = '<|endoftext|>'
    while len(token)>1:
        text = text.replace(token, '')
        token = token[:-1]
    text = text.strip()
    if text[-1] == '"' and text.count('"') % 2: text = text[:-1]
    return text.strip()

stop_token = '\n'
for output_idx, output in enumerate(output_sequences):
    text = tokenizer.decode(output.tolist(), clean_up_tokenization_spaces=True)
    text = clean_prediction(text)
    text = text[: text.find(stop_token)]
    print('GENERATED TEXT: ' + text + '\n')

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


GENERATED TEXT: My dream is to travel to the moon. And I want my daughter to have that dream."

GENERATED TEXT: My dream is to travel to the other side of the world," he told The Daily Beast. "I love the idea of going with people I love and I just want to get to know them."

GENERATED TEXT: My dream is to travel with a little dignity... and a little beauty."

GENERATED TEXT: My dream is to travel to the United States from New Jersey and become fluent in English," says Mr Sanders. "I have a lot of family here, so it's something of a dream come true to go to a place like this." The idea is not new. He has visited the city and his wife, Ann, a business associate, once at the State Fair, and he is a lifelong resident. "It's a great city, with great schools," he says. But he says th

GENERATED TEXT: My dream is to travel with the beautiful and unique and have fun, be creative, and show people how it goes. My dream is to travel with the people and share my passion with them.

GENERATED TEXT: