# GA Capstone
## Causal Language Transformer Modeling with GPT2

Creating a model to generate text

A lot of the below is adapted from the gpt2 tutorial at https://huggingface.co/docs/transformers/v4.22.2/en/tasks/language_modeling

### Imports and Preliminaries

In [1]:
# data formatting for model
from datasets import Dataset, DatasetDict

# train/test split
from sklearn.model_selection import train_test_split

# tokenizer
from transformers import AutoTokenizer

# lm collator
from transformers import DataCollatorForLanguageModeling

# model and support
from transformers import TFAutoModelForCausalLM, create_optimizer, AdamWeightDecay

# support
from itertools import chain
import os
import random
import re
import json

# custom utilities
from utilities.utilities import load_config, get_dataset_from_config
from utilities.utilities import split_text_and_labels
from utilities.utilities import generate_from

In [2]:
CONFIG_FILE = 'config.json'

config_vars = load_config(CONFIG_FILE)
config_vars

{'RANDOM_SEED': 1,
 'MODEL_DIR': '../models/',
 'DATA_DIR': '../data/',
 'CAUSAL_N_EPOCHS': 8,
 'CLASS_N_EPOCHS': 2,
 'BATCH_SIZE': 16,
 'CAUSAL_MODEL': 'distilgpt2',
 'CLASS_MODEL': 'distilbert-base-uncased',
 'MODEL_NAME': 'shakespeare',
 'DATA_SHAKESPEARE': ['shakespeare-sonnets.clean.txt', 'shakespeareplays.txt'],
 'DATA_OTHER': ['belloc_hilaire-sonnets_and_verse.clean.txt',
  'blake_william-poems.clean.txt',
  'browning_elizabeth-sonnets_from_the_portuguese.clean.txt',
  'daniel_samuel_and_constable_henry-elizabethan_sonnet_cycles.clean.txt',
  'donne_john-poetry_vol_1.clean.txt',
  'drayton_michael_et_al-elizabethan_sonnet_cycles.clean.txt',
  'farjeon_eleanor-sonnets_and_poems.clean.txt',
  'keats_john-poems_1820.clean.txt',
  'lodge_thomas_and_fletcher_giles-elizabethan_sonnet_cycles.clean.txt',
  'lovell_robert_and_southey_robert-poems.clean.txt',
  'milton_john-poetical_works.clean.txt',
  'seward_anna-sonnets-and-odes.clean.txt',
  'shelley_percy-complete_poetic_works.clean.

In [3]:
# random seed
SEED = config_vars['RANDOM_SEED'] if 'RANDOM_SEED' in config_vars else 1

# pretrained model designator
MODEL_TYPE = config_vars['CAUSAL_MODEL'] if 'CAUSAL_MODEL' in config_vars else 'distilgpt2'

# model batch size
BATCH_SIZE = config_vars['BATCH_SIZE'] if 'BATCH_SIZE' in config_vars else 16

# model num epochs
N_EPOCHS = config_vars['CAUSAL_N_EPOCHS'] if 'CAUSAL_N_EPOCHS' in config_vars else 8

# whether to downsample
SAMPLE = config_vars['N_SAMPLES'] if 'N_SAMPLES' in config_vars else 1

In [4]:
# directories and other constants, from config.json

# model name for saving
MODEL_NAME = config_vars['MODEL_NAME'] if 'MODEL_NAME' in config_vars else 'shakespeare'

# directory for saved models
MODEL_DIR = config_vars['MODEL_DIR'] if 'MODEL_DIR' in config_vars else '../models/'

# full model save path
MODEL_PATH = os.path.join(MODEL_DIR, f'{MODEL_NAME}.{MODEL_TYPE}.{str(N_EPOCHS)}')

### Load and Format Data

In [5]:
# load data - just load the shakespeare stuff
data = get_dataset_from_config(config_vars, limit=SAMPLE)[1]
len(data), data[:2]

(76578,
 [('From fairest creatures we desire increase, That thereby beauty’s rose might never die, But as the riper should by time decease, His tender heir might bear his memory:',
   1),
  ('But thou, contracted to thine own bright eyes, Feed’st thy light’s flame with self-substantial fuel, Making a famine where abundance lies, Thyself thy foe, to thy sweet self too cruel:',
   1)])

In [6]:
# split train and test - we don't need a test set here
data_train, data_val = train_test_split(data, test_size=0.01, random_state=SEED)

# this is the labeled dataset - split into text and label lists
data_train = split_text_and_labels(data_train)
data_val = split_text_and_labels(data_val)

# we don't need labels for causal LM
data_train = data_train['text']
data_val = data_val['text']

len(data_train), len(data_val)

(75812, 766)

### Cleaning and Data Preparation

In [7]:
# create Dataset and DatasetDict instances - I think this is needed for model
train_dataset = Dataset.from_dict({'text': data_train})
val_dataset = Dataset.from_dict({'text': data_val})
datasets = DatasetDict({'train': train_dataset, 'val': val_dataset})
datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 75812
    })
    val: Dataset({
        features: ['text'],
        num_rows: 766
    })
})

In [8]:
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_TYPE)

In [9]:
# Preprocessing function for tokenizer to use with map() method of datasetdict
def token_preproc(data):
    return tokenizer(data['text'])

In [10]:
# tokenize data
tokened_data = datasets.map(token_preproc, batched=True, num_proc=4, remove_columns=['text'])
tokened_data

     

#0:   0%|          | 0/19 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/19 [00:00<?, ?ba/s]

  

#2:   0%|          | 0/19 [00:00<?, ?ba/s]

#3:   0%|          | 0/19 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 75812
    })
    val: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 766
    })
})

In [11]:
# pad encodings and prep for modeling
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors='tf')

### Modeling

In [12]:
# instantiate model

if not os.path.exists(MODEL_PATH):
    model = TFAutoModelForCausalLM.from_pretrained(MODEL_TYPE, pad_token_id = tokenizer.eos_token_id)
else:
    model = TFAutoModelForCausalLM.from_pretrained(MODEL_PATH)
    
#model.resize_token_embeddings(len(tokenizer))

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [13]:
# convert data to special format for tf model
tokenizer.pad_token = tokenizer.eos_token
tf_train_set = model.prepare_tf_dataset(tokened_data['train'], shuffle=True, batch_size=32, collate_fn=collator)
tf_val_set = model.prepare_tf_dataset(tokened_data['val'], shuffle=False, batch_size=32, collate_fn=collator)
tf_train_set

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(32, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(32, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(32, None), dtype=tf.int64, name=None))>

In [14]:
# compile model
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [15]:
# fit model (if pretrained does not exist)
if not os.path.exists(MODEL_PATH):
    model.fit(tf_train_set, validation_data=tf_val_set, epochs=N_EPOCHS)

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


In [16]:
# save model
if not os.path.exists(MODEL_PATH):
    os.makedirs(MODEL_PATH)
    model.save_pretrained(MODEL_PATH)

### Test Model

In [17]:
test_lines = [data_train[random.randint(0,len(data) - 1)] for _ in range(10)]
fragment_ratio = 0.7

generated = list()
for line in test_lines:
    words = line.split()
    fragment_end_ix = int(len(words) * fragment_ratio) or 1
    line = ' '.join(words[:fragment_end_ix])
    generated.append((line, generate_from(line, model, tokenizer)))
    print(len(generated), end=', ')
    
print('Done!\n\n')

for line in generated:
    print(f'Original: {line[0]}\nGenerated: {line[1]}\n\n')

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, Done!


Original: Have I once lived to see
Generated: Have I once lived to see in this a man? Would he had not been born of such a burden as thee, for that fortune hath been in him since my father died. For one thing: ’tis the least false which may be true; It hath in one thing only known to me. In another it is the very privilege That thou comest with mine eye When thou hast first received her fair queen and her child So well deserved when she died— But there must at least be more from me than they show From


Original: Live And deal with
Generated: Live And deal with all our fortunes. This is the Duke of Norfolk that brings it forward, Or any of the country’s great armies come To fight against this sight and love, That no man in France but himself Will be consul to you if I do not make his Grace a queen? If my true heart so true As he hath done her wrongings (Who bears here by false title), Make him peace for herself! The King shall live well enough ere

### Evaluation

There is a perplexity metric which, in my understanding, provides a score for how confused the model was in generating next words. Due to dependency issues in using [Hugging Face's Evaluate library](https://huggingface.co/docs/evaluate/index) I have not used that here. I also feel that the perplexity metric is maybe too generalized for what I'm looking for. See the classification notebook for my alternative metric for determining accuracy. 

### Conclusion

The fine-tuned GPT2 model generates text that, to me, seems reasonably Shakespearean. The real determination will be pitting this against other period works, which often sound similar. The classification notebook will explore a classification model that seeks to classify Shakespearean vs. Non-Shakespearean text.