# GA Capstone
## Transformer Modeling with GPT2

Creating a model to generate text

A lot of the below is adapted from the gpt2 tutorial at https://huggingface.co/docs/transformers/v4.22.2/en/tasks/language_modeling

### Imports and Preliminaries

In [1]:
# data formatting for model
from datasets import Dataset, DatasetDict

# train/test split
from sklearn.model_selection import train_test_split

# tokenizer
from transformers import AutoTokenizer

# lm collator
from transformers import DataCollatorForLanguageModeling

# model and support
from transformers import TFAutoModelForCausalLM, create_optimizer, AdamWeightDecay

# other utilities
from itertools import chain
import os
import random
import re

In [2]:
# set the model we are using
MODELS = [
    'gpt', # original GPT
    'distilgpt2', # 84M features
    'gpt2', # 117M features
    'gpt2-medium', # 355M features
    'gpt2-large', # 744M features
    'ctrl',
    'transformerxl',
    'reformer',
    'xlnet'
]
    
model_type = 'distilgpt2'

In [3]:
# directories
MODEL_NAME = 'shakespeare'
N_EPOCHS = 20
DIR_MODEL = '../models/'
DIR_DATA = '../data/'

In [4]:
# regexes
RE_SENTENCE = re.compile(r'\w.*?[.?!:;]', re.S)
RE_WHITESPACE = re.compile(r'\s+')
RE_BLANKLINE = re.compile(r'\n\n')

### Load and Format Data

In [5]:
# load data
paths = [
    os.path.join(DIR_DATA, 'shakespeare-sonnets.clean.txt'),
    os.path.join(DIR_DATA, 'shakespeareplays.txt')
]

text = list()

for path in paths:
    with open(path, 'r') as f:
        text.append([line.strip() for line in f.readlines() if line.strip()])

text = ' '.join(chain(*text))

In [6]:
# Split into relevant subsets
# POEMS
#poems = RE_BLANKLINE.split(text)

# LINES
#lines = [line.strip() for line in text.split('\n')]

# SENTENCES
sentences = RE_SENTENCE.findall(text)
sentences = [RE_WHITESPACE.sub(' ', sentence) for sentence in sentences]

print(f'# Sentences: {len(sentences)}')
#print(f'# Poems: {len(poems)}\n# Sentences: {len(sentences)}\n# Lines: {len(lines)}\n# Chars: {len(text)}')

# Sentences: 76578


In [7]:
# test code - small sample of lines
nlines = 2500
sentences = sentences[500:500 + nlines]

In [8]:
# split train and test
lines_train, lines_test = train_test_split(sentences, test_size=0.05, shuffle=False)
len(lines_train), len(lines_test)

(2375, 125)

### Cleaning and Data Preparation

In [9]:
# create Dataset and DatasetDict instances - I think this is needed for model
train_dataset = Dataset.from_dict({'text': lines_train})
test_dataset = Dataset.from_dict({'text': lines_test})
datasets = DatasetDict({'train': train_dataset, 'test': test_dataset})
datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 2375
    })
    test: Dataset({
        features: ['text'],
        num_rows: 125
    })
})

In [10]:
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_type)

In [11]:
# Preprocessing function for tokenizer to use with map() method of datasetdict
def token_preproc(data):
    return tokenizer(data['text'])

In [12]:
# tokenize data
tokened_data = datasets.map(token_preproc, batched=True, num_proc=4, remove_columns=['text'])
tokened_data

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 2375
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 125
    })
})

In [13]:
for _ in range(4):
    n = random.randint(0, len(tokened_data['train']))
    print(n, tokenizer.convert_ids_to_tokens(tokened_data['train'][n]['input_ids']), lines_train[n])

973 ['And', 'ĠI', 'Ġwill', 'Ġdo', 'Ġso', '.'] And I will do so.
1419 ['Ay', ',', 'Ġsir', ',', 'Ġhe', ',', 'Ġsir', ',', 'ĠâĢ', 'Ļ', 's', 'Ġa', 'Ġgood', 'Ġwork', 'man', ',', 'Ġa', 'Ġvery', 'Ġgood', 'Ġtailor', '.'] Ay, sir, he, sir, ’s a good workman, a very good tailor.
961 ['It', 'Ġwas', 'Ġthis', 'Ġvery', 'Ġsword', 'Ġentrenched', 'Ġit', '.'] It was this very sword entrenched it.
2157 ['Therefore', 'Ġyou', 'Ġmust', 'Ġdie', '.'] Therefore you must die.


In [14]:
# pad encodings and prep for modeling
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors='tf')
collator

DataCollatorForLanguageModeling(tokenizer=PreTrainedTokenizerFast(name_or_path='distilgpt2', vocab_size=50257, model_max_len=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}), mlm=False, mlm_probability=0.15, pad_to_multiple_of=None, tf_experimental_compile=False, return_tensors='tf')

### Modeling

In [15]:
# instantiate model
model_path = os.path.join(DIR_MODEL, f'{model_type}.{MODEL_NAME}.{str(N_EPOCHS)}')

if not os.path.exists(model_path):
    model = TFAutoModelForCausalLM.from_pretrained(model_type, pad_token_id = tokenizer.eos_token_id)
else:
    model = TFAutoModelForCausalLM.from_pretrained(model_path)
    
#model.resize_token_embeddings(len(tokenizer))

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [16]:
# convert data to special format for tf model
tokenizer.pad_token = tokenizer.eos_token
tf_train_set = model.prepare_tf_dataset(tokened_data['train'], shuffle=True, batch_size=32, collate_fn=collator)
tf_test_set = model.prepare_tf_dataset(tokened_data['test'], shuffle=False, batch_size=32, collate_fn=collator)
tf_train_set

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(32, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(32, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(32, None), dtype=tf.int64, name=None))>

In [17]:
# compile model
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [18]:
# fit model (if pretrained does not exist)
if not os.path.exists(model_path):
    model.fit(tf_train_set, validation_data=tf_test_set, epochs=N_EPOCHS)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [19]:
# save model
if not os.path.exists(model_path):
    os.makedirs(model_path)
    model.save_pretrained(model_path)

### Test Model

In [20]:
# function to get predicted text
def test(text, max_new=50, temp=1, top_k=50, rep_penalty=1.5, len_penalty=0.75, n_seq=1):
    tokened = tokenizer(text, return_tensors='tf')
    output = model.generate(**tokened,
                            do_sample=True,
                            max_new_tokens=max_new, 
                            temperature=temp, 
                            top_k=top_k, 
                            repetition_penalty=rep_penalty,
                            length_penalty=len_penalty,
                            num_return_sequences=n_seq)
    return tokenizer.decode(output[0], skip_special_tokens=False)

In [24]:
test_lines = [sentences[random.randint(0,len(sentences) - 1)] for _ in range(10)]
fragment_ratio = 0.4

for line in test_lines:
    words = line.split()
    stop_ix = int(len(words) * fragment_ratio) or 1
    fragment = ' '.join(words[0:stop_ix])
    print(f'Original: {fragment}')
    output = test(fragment,
                  temp=0.5,
                  max_new=100,
                  top_k=200,
                  rep_penalty=1.5,
                  len_penalty=0.75,
                  n_seq=1)
    print(f'Output: {output}\n')

Original: Fair, kind, and true,’ is all
Output: Fair, kind, and true,’ is all the King does. And yet it shall be no more than a pastime for his lordship to come here tonight: ‘I will not steal my riches nor give them away; but I am free now To do what you please! But in your hand are they bound? The sword from me comes when thou dost hold thee captive— That which so holds thy heart When thine own tongue flies out of balance with mine eye For that whose motion makes love fair as if she were both proud

Original: You’ll be gone, sir
Output: You’ll be gone, sir. But no more for your sake: I shall stay here where you live till death is found! The rest of the world will die if they do not repent Of their sinning and virtue-saving deeds which were done Before God himself; when nature approves them with a single hand And in this hope to show mercy upon him For goodness does well help but fails To give it grace by all that doth make It possible without constraint That heaven knows what may com

### Conclusion

Text is generated that at many times sounds reasonably Shakespearian. Soon to be combined with classification model to test this out.