# GA Capstone
## Causal Language Transformer Modeling with GPT2

Creating a model to generate text

A lot of the below is adapted from the gpt2 tutorial at https://huggingface.co/docs/transformers/v4.22.2/en/tasks/language_modeling

### Imports and Preliminaries

In [1]:
# data formatting for model
from datasets import Dataset, DatasetDict

# train/test split
from sklearn.model_selection import train_test_split

# tokenizer
from transformers import AutoTokenizer

# lm collator
from transformers import DataCollatorForLanguageModeling

# model and support
from transformers import TFAutoModelForCausalLM, create_optimizer, AdamWeightDecay

# other utilities
from itertools import chain
import os
import random
import re
import json
from utilities.utilities import load_config, get_dataset_from_config
from utilities.utilities import split_text_and_labels

In [None]:
CONFIG_FILE = 'config.json'

config_vars = load_config(CONFIG_FILE)
config_vars

In [2]:
# pretrained model designator
MODEL_TYPE = config_vars['CAUSAL_MODEL'] if 'CAUSAL_MODEL' in config_vars else 'distilgpt2'

# model batch size
BATCH_SIZE = config_vars['BATCH_SIZE'] if 'BATCH_SIZE' in config_vars else 16

# model num epochs
N_EPOCHS = config_vars['CAUSAL_N_EPOCHS'] if 'CAUSAL_N_EPOCHS' in config_vars else 8

# whether to downsample
SAMPLE = config_vars['N_SAMPLES'] if 'N_SAMPLES' in config_vars else 1

In [3]:
# directories and other constants, from config.json

# model name for saving
MODEL_NAME = config_vars['MODEL_NAME'] if 'MODEL_NAME' in config_vars else 'shakespeare'

# directory for saved models
DIR_MODEL = config_vars['MODEL_DIR'] if 'MODEL_DIR' in config_vars else '../models/'

# directory for text data
DIR_DATA = config_vars['DATA_DIR'] if 'DATA_DIR' in config_vars else '../data/'

### Load and Format Data

In [5]:
# load data - just load the shakespeare stuff
data = get_dataset_from_config(config_vars, limit=SAMPLE)[1]
len(data), data[:2]

In [8]:
# split train and test - we don't need a test set here
data_train, data_val = train_test_split(data, test_size=0.01)

# this is the labeled dataset - split into text and label lists
data_train = split_text_and_labels(data_train)
data_val = split_text_and_labels(data_val)

# we don't need labels for causal LM
data_train = data_train['text']
data_val = data_val['test']

len(data_train), len(data_val)

(74663, 1915)

### Cleaning and Data Preparation

In [9]:
# create Dataset and DatasetDict instances - I think this is needed for model
train_dataset = Dataset.from_dict({'text': data_train})
val_dataset = Dataset.from_dict({'text': data_val})
datasets = DatasetDict({'train': train_dataset, 'val': val_dataset})
datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 74663
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1915
    })
})

In [10]:
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_type)

In [11]:
# Preprocessing function for tokenizer to use with map() method of datasetdict
def token_preproc(data):
    return tokenizer(data['text'])

In [12]:
# tokenize data
tokened_data = datasets.map(token_preproc, batched=True, num_proc=4, remove_columns=['text'])
tokened_data

     

#0:   0%|          | 0/19 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/19 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/19 [00:00<?, ?ba/s]

 

#3:   0%|          | 0/19 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 74663
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 1915
    })
})

In [13]:
# pad encodings and prep for modeling
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors='tf')

### Modeling

In [14]:
# instantiate model
model_path = os.path.join(DIR_MODEL, f'{model_type}.{MODEL_NAME}.{str(N_EPOCHS)}')

if not os.path.exists(model_path):
    model = TFAutoModelForCausalLM.from_pretrained(model_type, pad_token_id = tokenizer.eos_token_id)
else:
    model = TFAutoModelForCausalLM.from_pretrained(model_path)
    
#model.resize_token_embeddings(len(tokenizer))

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at distilgpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


In [15]:
# convert data to special format for tf model
tokenizer.pad_token = tokenizer.eos_token
tf_train_set = model.prepare_tf_dataset(tokened_data['train'], shuffle=True, batch_size=32, collate_fn=collator)
tf_val_set = model.prepare_tf_dataset(tokened_data['val'], shuffle=False, batch_size=32, collate_fn=collator)
tf_train_set

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(32, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(32, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(32, None), dtype=tf.int64, name=None))>

In [16]:
# compile model
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [17]:
# fit model (if pretrained does not exist)
if not os.path.exists(model_path):
    model.fit(tf_train_set, validation_data=tf_val_set, epochs=N_EPOCHS)

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [18]:
# save model
if not os.path.exists(model_path):
    os.makedirs(model_path)
    model.save_pretrained(model_path)

### Test Model

In [19]:
# function to get predicted text
def test(text, max_new=50, temp=1, top_k=50, rep_penalty=1.5, len_penalty=0.75, n_seq=1):
    tokened = tokenizer(text, return_tensors='tf')
    output = model.generate(**tokened,
                            do_sample=True,
                            max_new_tokens=max_new, 
                            temperature=temp, 
                            top_k=top_k, 
                            repetition_penalty=rep_penalty,
                            length_penalty=len_penalty,
                            num_return_sequences=n_seq)
    return tokenizer.decode(output[0], skip_special_tokens=False)

In [20]:
test_lines = [sentences[random.randint(0,len(sentences) - 1)] for _ in range(10)]
fragment_ratio = 0.4

for line in test_lines:
    words = line.split()
    stop_ix = int(len(words) * fragment_ratio) or 1
    fragment = ' '.join(words[0:stop_ix])
    print(f'Original: {fragment}')
    output = test(fragment,
                  temp=0.5,
                  max_new=100,
                  top_k=200,
                  rep_penalty=1.5,
                  len_penalty=0.75,
                  n_seq=1)
    print(f'Output: {output}\n')

Original: I will
Output: I will not. But I know my love is dead, And that it shall be with her; and do so by the time of death: For if she sleep tonight or tomorrow night, She’ll die for me in a coffin where no man can touch him With this dearest handkerchief to his head—a story Which must have been more true than truth! That he would draw upon your face To make you laugh at what thou hast said thus far As never before done here on Earth?

Original: I do
Output: I do not understand this man. But I know him by his name, and will tell you what he is; And to my knowledge the King of England hath sent for me— For here we are at odds with our countrymen! He’s a traitor that would have had us dead in arms upon any other occasion That might have been so grievously wronged as it is against your Majesty To make such an act himself As may be proved guilty or given up again Against all those which did bear them

Original: Sir
Output: Sir, you are a very poor gentleman. I’ll tell thee what is true

### Conclusion

Text is generated that at many times sounds reasonably Shakespearian. Soon to be combined with classification model to test this out.