# Word-Level Text Generation with GPT-2

GPT-2 is a large transformer-based language model trained on a dataset of 8 milion web pages. It's objective is to predict the next word, based on all the previous words within some text.

We'll use Hugging Face Tranformers library which provides over 32+ pretrained models for NLG and NLU (ready to use in PyTorch abd TensorFlow 2.0).

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
%cd 'drive/MyDrive/Colab Notebooks/nlg_tales_generation'

Mounted at /content/drive
[Errno 2] No such file or directory: 'drive/MyDrive/Colab Notebooks/nlg_tales_generation'
/content


In [2]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 8.1 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 53.0 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 62.9 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 59.4 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninsta

In [3]:
from transformers import (
    GPT2Tokenizer,
    DataCollatorForLanguageModeling,
    TextDataset,
    GPT2LMHeadModel,
    TrainingArguments,
    Trainer,
    pipeline)

In [11]:
train_path = '/content/train.txt'
test_path = '/content/test.txt'

## Text tokenization

Tokenizer splits text into tokens (words or subwords, punctuation etc.) and then converts them into numbers (ids) to be able to feed them to the model.

When using a pretrained transformers model, the associated pretrained tokenizer should be used in order to preserve the same way of transforming words into tokens (as during pretraining).
We can use either the tokenizer class associated to the model (eg. GPT2Tokenizer) or the AutoTokenizer class.

Size of text corpus used to train transformers results in a big vocabulary size that requires an increased memory and time complexity. To avoid it, transformers models use subword tokenization (a hybrid between word-level and character level tokenization).

GPT-2 uses Byte-Pair Encoding (BPE) with space tokenization as pretokenization. Its vocabulary size is 50,257 with 256 bytes base tokens.

Learn more:
* [GPT2Tokenizer Docs](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2tokenizer)
* [Preprocessing data](https://huggingface.co/transformers/preprocessing.html)
* [Summary of tokenizers](https://huggingface.co/transformers/tokenizer_summary.html)

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

When feeding the sentence to the tokenizer, it returns a dictionary with a list of input_ids (indeces corresponding to each token). There is also an argument called attention mask which indicates to the model which tokens should be attended to and which not (to skip padded tokens).

In [6]:
print('vocabulary size: %d, max squence length: %d' % (tokenizer.vocab_size, tokenizer.model_max_length))
print('tokenize sequence "Once upon a time in a little village":', tokenizer('Once upon a time in a little village'))

vocabulary size: 50257, max squence length: 1024
tokenize sequence "Once upon a time in a little village": {'input_ids': [7454, 2402, 257, 640, 287, 257, 1310, 7404], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}


A DataCollator is a function used to form a batch from train and test dataset. DataCollatorForLanguageModelling dynamically padds inputs to the maximum length of a batch if they are not all of the same length. GPT-2 uses causual language modeling (not masked language modeling) - its goal is to predict the token following a sequence of tokens (so the model only attends to the left context). That is why, mlm should be set to False.

At this point, we don't fit tokenizer and data collator to the data - it will be loaded as a part of Trainer object later.

Learn more:
* [Casual Language Modelling](https://huggingface.co/transformers/task_summary.html#causal-language-modeling)
* [Data Collator](https://github.com/huggingface/transformers/blob/master/src/transformers/data/data_collator.py)

In [7]:
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

## Load dataset

In order to use text data in the model, we should load it as a Dataset object (from PyTorch). The Dataset object needs to contain the definition of \__init\__, \__getitem\__ and \__len\__. [This tutorial](https://huggingface.co/transformers/custom_datasets.html) provides examples of custom dataset objects.

We'll use HuggingFace implementation of TextDataset. It splits the text into consecutive blocks of certain length, e.g., it will cut the text every 1024 tokens.

Learn more:
* [Hugging Face implementation](https://github.com/huggingface/transformers/blob/master/src/transformers/data/datasets/language_modeling.py)
* [Stack Overflow explanation](https://stackoverflow.com/questions/60001698/how-exactly-should-the-input-file-be-formatted-for-the-language-model-finetuning)

In [12]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=train_path,
    block_size=128)
     
test_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=test_path,
    block_size=128)



In [13]:
print(tokenizer.decode(train_dataset[5]))

till they reached a great stone wall, many, many feet high.

'Now, prince,' said the magpie, 'the three bulrushes are behind that
wall.'

The prince wasted no time. He set his horse at the wall and leaped over
it. Then he looked about for the three bulrushes, pulled them up and
set off with them on his way home. As he rode along one of the bulrushes
happened to knock against something. It split open and, only think! out
sprang a lovely girl, who said: 'My heart's


## Fine-tune model

Transformers library allows to fine-tune an existing (pretrained) model or train a model from scratch (with a custom configuration).

We'll use GPT-2 pretrained model by loading it with .from_pretrained() method. Just like with the tokenizer, the model can be loaded with the class associated to the model (eg. GPT2LMHeadModel) or with the AutoModel class.

GPT2LMHeadModel is the GPT-2 model dedicated to language modeling tasks.

Learn more:
* [GPT2LMHeadModel](https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel)
* [Fine-tuning a model](https://huggingface.co/transformers/training.html)



In [14]:
model = GPT2LMHeadModel.from_pretrained('gpt2')

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

The Trainer class provides an interface for feature-complete training - it enables training, fine-tuning, and evaluating any transformers model. It takes as input: the model, training arguments, datasets, data collator, tokenizer etc.

The Training Arguments is a subset of arguments that relate to the training loop - we can set up eg: batch size, learning rate, number of epochs.

Learn more:
* [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer)
* [Training Arguments](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments)

In [15]:
training_args = TrainingArguments(
    output_dir = 'data/out', # the output directory for the model predictions and checkpoints
    overwrite_output_dir = True, # overwrite the content of the output directory
    per_device_train_batch_size = 32, # the batch size for training
    per_device_eval_batch_size = 32, # the batch size for evaluation
    learning_rate = 5e-5, # defaults to 5e-5
    num_train_epochs = 3, # total number of training epochs to perform
)

trainer = Trainer(
    model = model,
    args = training_args,
    data_collator=data_collator,
    train_dataset = train_dataset,
    eval_dataset = test_dataset
)

In [16]:
trainer.train()

***** Running training *****
  Num examples = 25362
  Num Epochs = 3
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 2379


Step,Training Loss
500,3.391
1000,3.2438
1500,3.1625
2000,3.1105


Saving model checkpoint to data/out/checkpoint-500
Configuration saved in data/out/checkpoint-500/config.json
Model weights saved in data/out/checkpoint-500/pytorch_model.bin
Saving model checkpoint to data/out/checkpoint-1000
Configuration saved in data/out/checkpoint-1000/config.json
Model weights saved in data/out/checkpoint-1000/pytorch_model.bin
Saving model checkpoint to data/out/checkpoint-1500
Configuration saved in data/out/checkpoint-1500/config.json
Model weights saved in data/out/checkpoint-1500/pytorch_model.bin
Saving model checkpoint to data/out/checkpoint-2000
Configuration saved in data/out/checkpoint-2000/config.json
Model weights saved in data/out/checkpoint-2000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=2379, training_loss=3.2050266081468055, metrics={'train_runtime': 2712.0475, 'train_samples_per_second': 28.055, 'train_steps_per_second': 0.877, 'total_flos': 4970166386688000.0, 'train_loss': 3.2050266081468055, 'epoch': 3.0})

In [17]:
trainer.save_model()

Saving model checkpoint to data/out
Configuration saved in data/out/config.json
Model weights saved in data/out/pytorch_model.bin


## Text generation

In order to use a model for inference, we should use a pipeline. The pipeline object is a wrapper around all the other available pipelines, eg using pipeline with task parameter set to "text-generation" references to the task-specific pipeline: TextGenerationPipeline. TextGenerationPipeline uses any ModelWithLMHead to predict the next words following a specified prefix.

The pipeline object (defined as generator in this case) takes arguments which are defined in PretrainedConfig (section: _Parameters for sequence generation_).

Learn more:
* [Pipeline](https://huggingface.co/transformers/main_classes/pipelines.html)
* [TextGenerationPipeline](https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TextGenerationPipeline)
* [PretrainedConfig](https://huggingface.co/transformers/main_classes/configuration.html#transformers.PretrainedConfig)

In [18]:
generator = pipeline('text-generation', tokenizer='gpt2', model='data/out')

loading configuration file data/out/config.json
Model config GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torch_dtype": "float32",
  "transformers_version": "4.9.2",
  "use_cache": true,
  "vocab_size": 50257
}

loading configuration file data/out/config.json
Model config GPT2C

In [21]:
print(generator('Once upon a time', max_length=1000)[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time some
young and wicked man had been killed by the wolves that fed on the wood. When
this happened his brother was taken to a place where he had lived
to-night, and the mother and the child went out in search of food. One of
these days he took a wolf and, coming to her, had him tied to a tree
and carried away. When he had gone out and taken the wolf, he told her
his story. After that another wolf had been killed, this time by the
whale who had hunted it. Thus one night the brothers had their supper
and they set out together. The night before they could find
some berries they brought back a wolf, who carried a bundle of wood. So the
brother set out to hunt the wolf and brought him back the bundle. Soon
after that he came to a forest where another wolf had killed his brother,
also a bundle, and the youngest became the youngest. They set out together
to hunt together in the forest together, but the youngest, still sleeping
in the bedspread of the bear, knocked at the door a

## Text generation with different decoding methods

Better decoding methods play an important role in improving performance of language models. Huggingface transformers allow to easily implement such decoding methods as: Greedy search, Beam search, Top-K sampling and Top-p sampling.

Learn more:
* [Different decoding methods for language generation](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb)

_Greedy search_ simply selects the word with the highest probability as the next word (the default text generation mode). This method can possibly result in the model repeating itself and missing high probability words hidden behind low probability words.

_Beam search_ evaluates num_beams consecutive words and selects the ones with the highest overall probability. It reduces the risk of missing hidden high probability words.

In [25]:
text_beam = generator('Once upon a time',
                      max_length=500,
                      num_beams=5)
print(text_beam[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time there was a king who had a son, and he
had a daughter, and she was very beautiful, and he had two sons, and
they were very good-looking, and they were very good-looking, too, and they were
very good-looking, too, and they were very good-looking, too, and they were
very good-looking, too, and they were very good-looking, too, and they were
very good-looking, too, and they were very good-looking, too, and they were
very good-looking, too, and they were very good-looking, too, and they were very good-looking,
too, and they were very good-looking, too, and they were very good-looking, too, and they
were very good-looking, too, and they were very good-looking, too, and they were very good-looking,
too, and they were very good-looking, too, and they were very good-looking, too, and they were
very good-looking, too, and they were very good-looking, too, and they were very good-looking,
too, and they were very good-looking, too, and they were very good-looking, too, and they w

On the other hand, text generated by humans doen not follow a distribution of high probability next words, that's why it's worth introducing some randomness while decoding model output.

We can introduce _random sampling_ of next word, that is, picking the next word acording to its conditional probability distribution. What's more, by adding _softmax temperature_ we can make the distribution sharper (increasing the likelihood of high probability words and the opposite for low probability words).

In [23]:
text_random_sampling = generator('Once upon a time',
                                 max_length=1000,
                                 top_k=0,
                                 do_sample=True,
                                 temperature=0.7)
print(text_random_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time this place was called "the
land of the giants." The name of it was "the Hyacinth," and the
name of its people was "the fairies." The name of their kind was
"the fairies." The place was like a great garden, so that
the natural light was so slight that even the dead could see it at
sight. All the people were asleep and dreaming, but they were
quite wise and clever. They had a large stable built in their
own land to keep the fairies and the fairies' children.

One day the fairies awoke and saw a little girl in bed. She was
beautiful, so elegant and so young, but she was not as beautiful as
her sisters had been. The fairies said to her, "Hey, what's that, little
girl?" and she said, "Well, look, my dear mother, you are a little
old lady, but what's the matter?"

"Oh! a little old lady, just come with me," said the fairies. "Go at once; it is
time to go to sleep."

"Oh! how glad my mother is!" said the little girl. "I will take you home
with me."

"You're not going to be sn

_Tok k-sampling_ method limits the sampling pool to k words with the highest probability thus it allows us to eliminate the most unlikely words.

On the other hand, _Top-p sampling_ chooses from the smallest possible set of words whose cumulative probability exceeds the probability p.

In [26]:
text_k_sampling = generator('Once upon a time',
                            max_length=1000,
                            top_k=40,
                            do_sample=True)
print(text_k_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time they told such a story to the boy,
that he felt very much moved.  They gave him a basket and shoes, and said
they would never refuse him a basket with the same name on them, for he would
never refuse them the first time!  The second time, though he was very
young, the boy gave more and sweeter compliments on his new basket.

'What did you want, my little boy?' he asked.  "'It was you who had invented
the golden-pigeon cap, and I am your friend-boy-to-the-fairy-apple,
that we have all been waiting for, and it was me who first told you
it.'  Then they carried it away to the prince's palace, and put it on his
back, and the prince never looked at it till they had all lost all
their money.  'Well,' said the young man, 'when I come back I will show you some
things I could carry with me to fetch some apples.'

The boy's father gave each of the boy and the mother a basket, and said he
would certainly find them useful.

'What have you to do with the apples?  To-morrow I'll make

In [28]:
text_p_sampling = generator('Once upon a time',
                            max_length=1000,
                            top_k=0,
                            top_p=0.92,
                            do_sample=True)
print(text_p_sampling[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Once upon a time there lived an old man who came to seek
his fortune in beauty and wealth.  He gave his old mistress the
wonderful portrait that now stood in his stead.

He therefore entrusted the request which she now gave him.  He made a
charity of money to whoever could give it to him, and placed
her on his journeymen, and thus far he never made any mistake.

"Wherefore, my child?" said the old man.

"Well, it is all right, my child, that you have settled in home for
all eternity.  I have, by the hand of my beloved, replaced the old
olde house with a new one.  I want to give you my first diamond and
silver.  For I am well satisfied with what I have obtained from you.  You
must now purchase both of these things."

"What is that then?" cried the old woman, turning to the youth.  "What are
you saying?"

"I am saying that the lady who raised me here, when I grew up, suffered a great
distress from the well-ordered world, which she bore me to do; but from time to time,
she will forget me;