# Epic Text Generation with GPT-2

In this exercise a language model called the Generative Pre-Trained Transformer 2 ([GPT-2](https://en.wikipedia.org/wiki/GPT-2)) will be used to create synthetic text using a prompt from an epic poem. GPT-2 was born from the laboratory at [OpenAI](https://en.wikipedia.org/wiki/OpenAI). It was scaled up from the original GPT with training on ten times more data, and it excels at predicting the next word from a body of text. To illustrate this adaptive ability we will select a few lines from the [Iliad](https://www.gutenberg.org/files/16452/16452-h/16452-h.htm) by Homer. We will then observe how well GPT-2 performs when it takes a prompt from an ancient poem. 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/texttpu/butler.txt
/kaggle/input/texttpu/derby.txt
/kaggle/input/texttpu/cowper.txt


In [2]:
# Load library tools and text
from transformers import pipeline, set_seed, BertTokenizer
from numpy.random import default_rng

path = '../input/texttpu/cowper.txt'

with open(path) as txt:    
    text = txt.readlines() 

A sequence of lines will be randomly selected from the Iliad text. The lines will then be joined into a single string. This will be used as a prompt for the generator.

In [3]:
LINES = 3
rng = default_rng()

# Select a random sample from text
RAND_START = rng.integers(len(text)-LINES)
STOP = RAND_START + LINES

prompt = ''.join(text[RAND_START:STOP])
print(prompt)

Your clamors, that your heaven-taught Kings may speak?
Scarce were they settled, and the clang had ceased,
When Agamemnon, sovereign o'er them all,



The prompt is then tokenized to find its token length. 

In [4]:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
tokens = tokenizer.tokenize(prompt)
token_length = len(tokens)
print(f'Token Length: {token_length}')

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Token Length: 41


Since the token length of the prompt must be greater than the generator's `max_length` parameter, we will multiply the token length of the prompt by an arbitrary positive integer. As this integer value increases, so does the amount of synthetic text generated by GPT-2. 

In [5]:
MULTIPLIER = 5
MAX_LENGTH = MULTIPLIER * token_length
print(f'Max Length: {MAX_LENGTH}')
assert MAX_LENGTH > token_length, f'MAX_LENGTH must be greater than {token_length} tokens'

Max Length: 205


Lastly, a generator pipeline is created. This GPT-2 generator returns a list of dictionaries so indexing is required to get only the text. Here three sequences are returned for comparison. We'll append them to a list and print.

In [6]:
SEED = 123
NUM_SEQ = 3
seq_list = [] 

generator = pipeline('text-generation', model='gpt2-medium')
set_seed(SEED)

for sequence in range(NUM_SEQ):
    seq_list.append(generator(prompt, max_length=MAX_LENGTH, num_return_sequences=NUM_SEQ)[sequence]['generated_text'])

Downloading:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [7]:
for sequence in seq_list:
    print(sequence)
    print('-' * 50, end='\n')

Your clamors, that your heaven-taught Kings may speak?
Scarce were they settled, and the clang had ceased,
When Agamemnon, sovereign o'er them all,
And great Lion Olympos, and mighty Zeus,
With the great Eagle, had set to flying;
So that from the sky he could not see
The multitude, nor do he now dwell,
Whilst he looks on them;
When the clouds are dry over them for the sea's tide
Beats them on shore and brings them to sea;
And all the stars, when they were bright, did they not shine,
Which, after a storm, in the heavens,
Omni is their sign, are silent and bright
With their own, and so may their song.
Now what is the cause of those cries!
'Why,' said Apollo, 'shall thy gods tremble
For not a moment, that thou shouldest be heard?
--------------------------------------------------
Your clamors, that your heaven-taught Kings may speak?
Scarce were they settled, and the clang had ceased,
When Agamemnon, sovereign o'er them all,
Gladly sat, and with his two wives
Thou wert to be queen.
A woma

In conclusion, the GPT-2 generator returned 3 sequences, each with the prompt completed with synthetic text. We can see how this language model performs with an ancient poem that contrasts with modern writing. With prompts from different genres, GPT-2 could be used to create text datasets for classification exercises. Another task might be to identify synthetic text versus text from the original human prompt source. A system to filter out text from artificial intelligence may be needed in the future, but it remains to be seen how technologies like GPT-2 will be truly used by society. 