# Modeling on Sentences

Creating a model based on sentences

A lot of the below is adapted from the gpt2 tutorial at https://huggingface.co/docs/transformers/v4.22.2/en/tasks/language_modeling

## Imports and Preliminaries

In [1]:
# data formatting for model
from datasets import Dataset, DatasetDict

# train/test split
from sklearn.model_selection import train_test_split

# tokenizer
from transformers import AutoTokenizer

# lm collator
from transformers import DataCollatorForLanguageModeling

# model and support
from transformers import TFAutoModelForCausalLM, create_optimizer, AdamWeightDecay

# other utilities
from itertools import chain
import os
import random
import re

In [2]:
# set the model we are using
MODELS = [
    'gpt', # original GPT
    'distilgpt2', # 84M features
    'gpt2', # 117M features
    'gpt2-medium', # 355M features
    'gpt2-large', # 744M features
    'ctrl',
    'transformerxl',
    'reformer',
    'xlnet'
]
    
model_type = 'distilgpt2'

In [3]:
# directories
MODEL_FORMAT = 'sentences-2'
DIR_MODEL = '../models/'
DIR_DATA = '../data/'

In [47]:
# other special constants
EOL_TOKEN = r'|eol|'

In [67]:
# regexes
RE_SENTENCE = re.compile(r'\w.*?[.?!:;]', re.S)
RE_WHITESPACE = re.compile(r'\s+')

## Load and Format Data

In [68]:
# load data
paths = [
    os.path.join(DIR_DATA, 'shakespeare-sonnets.clean.txt'),
    os.path.join(DIR_DATA, 'browning-sonnets.clean.txt'),
    os.path.join(DIR_DATA, 'daniel-constable-sonnets.clean.txt'),
    os.path.join(DIR_DATA, 'drayton-griffin-smith-sonnet-cycles.clean.txt'),
    os.path.join(DIR_DATA, 'farjeon-sonnets.clean.txt'),
    os.path.join(DIR_DATA, 'lovell-southey-sonnets.clean.txt')
]

text = list()

for path in paths:
    with open(path, 'r') as f:
        text.append([line.strip() for line in f.readlines() if line.strip()])

text = ' '.join(chain(*text))
sentences = RE_SENTENCE.findall(text)
sentences = [RE_WHITESPACE.sub(' ', sentence) for sentence in sentences]
len(text), len(sentences), sentences[0:4]

(338274,
 3836,
 ['From fairest creatures we desire increase, That thereby beautys rose might never die, But as the riper should by time decease, His tender heir might bear his memory:',
  'But thou, contracted to thine own bright eyes, Feedst thy lights flame with self-substantial fuel, Making a famine where abundance lies, Thyself thy foe, to thy sweet self too cruel:',
  'Thou that art now the worlds fresh ornament, And only herald to the gaudy spring, Within thine own bud buriest thy content, And tender churl makst waste in niggarding:',
  'Pity the world, or else this glutton be, To eat the worlds due, by the grave and thee.'])

In [69]:
# split train and test
lines_train, lines_test = train_test_split(sentences, test_size=0.05)
len(lines_train), len(lines_test)

(3644, 192)

## Cleaning and Data Preparation

In [70]:
# create Dataset and DatasetDict instances - I think this is needed for model
train_dataset = Dataset.from_dict({'text': lines_train})
test_dataset = Dataset.from_dict({'text': lines_test})
datasets = DatasetDict({'train': train_dataset, 'test': test_dataset})
datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 3644
    })
    test: Dataset({
        features: ['text'],
        num_rows: 192
    })
})

In [71]:
# Create tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_type) #, additional_special_tokens=[EOL_TOKEN])

In [53]:
# Preprocessing function for tokenizer to use with map() method of datasetdict
def token_preproc(data):
    return tokenizer(data['text'])

In [54]:
# tokenize data
tokened_data = datasets.map(token_preproc, batched=True, num_proc=4, remove_columns=['text'])
tokened_data

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

      

#0:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#1:   0%|          | 0/1 [00:00<?, ?ba/s]

 

#2:   0%|          | 0/1 [00:00<?, ?ba/s]

#3:   0%|          | 0/1 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3200
    })
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 169
    })
})

In [55]:
for _ in range(10):
    n = random.randint(0, len(tokened_data['train']))
    print(n, tokenizer.convert_ids_to_tokens(tokened_data['train'][n]['input_ids']), lines_train[n])

370 ['O', 'ft', 'Ġand', 'Ġin', 'Ġvain', 'Ġmy', 'Ġrebel', 'Ġthoughts', 'Ġhave', 'Ġventured', 'Ġ', '|eol|', 'ĠTo', 'Ġstop', 'Ġthe', 'Ġpassage', 'Ġof', 'Ġmy', 'Ġvanquished', 'Ġheart', ';', 'Ġ', '|eol|'] Oft and in vain my rebel thoughts have ventured |eol| To stop the passage of my vanquished heart; |eol|
2712 ['Nature', 'Ġyou', 'Ġmade', 'Ġof', 'Ġpure', 'Ġand', 'Ġfaire', 'st', 'Ġmould', ',', 'Ġ', '|eol|', 'ĠThe', 'Ġpomp', 'Ġand', 'Ġglory', 'Ġof', 'Ġman', 'Ġto', 'Ġdepress', ',', 'Ġ', '|eol|', 'ĠAnd', 'Ġas', 'Ġyour', 'Ġslaves', 'Ġin', 'Ġth', 'ral', 'dom', 'Ġthem', 'Ġto', 'Ġhold', ';', 'Ġ', '|eol|'] Nature you made of pure and fairest mould, |eol| The pomp and glory of man to depress, |eol| And as your slaves in thraldom them to hold; |eol|
1022 ['Much', 'Ġsorrow', 'Ġin', 'Ġitself', 'Ġmy', 'Ġlove', 'Ġd', 'oth', 'Ġmove', ',', 'Ġ', '|eol|', 'ĠMore', 'Ġmy', 'Ġdespair', 'Ġto', 'Ġlove', 'Ġa', 'Ġhopeless', 'Ġbliss', ',', 'Ġ', '|eol|', 'ĠMy', 'Ġfolly', 'Ġmost', 'Ġto', 'Ġlove', 'Ġwhom', 'Ġsure', 'Ġt

In [56]:
# pad encodings and prep for modeling
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors='tf')
collator

DataCollatorForLanguageModeling(tokenizer=PreTrainedTokenizerFast(name_or_path='distilgpt2', vocab_size=50257, model_max_len=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>', 'additional_special_tokens': ['|eol|']}), mlm=False, mlm_probability=0.15, pad_to_multiple_of=None, tf_experimental_compile=False, return_tensors='tf')

## Modeling

In [57]:
# instantiate model
model_path = os.path.join(DIR_MODEL, f'{model_type}.{MODEL_FORMAT}')

if not os.path.exists(model_path):
    model = TFAutoModelForCausalLM.from_pretrained(model_type, pad_token_id = tokenizer.eos_token_id)
else:
    model = TFAutoModelForCausalLM.from_pretrained(model_path)
    
#model.resize_token_embeddings(len(tokenizer))

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at ../models/distilgpt2.sentences-2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


<tf.Variable 'tfgpt2lm_head_model_1/transformer/wte/weight:0' shape=(50258, 768) dtype=float32, numpy=
array([[-0.13957573, -0.0419271 ,  0.00102456, ..., -0.15793537,
         0.02027452,  0.09916671],
       [ 0.05113355, -0.07777626,  0.02975089, ...,  0.07002807,
        -0.0025974 ,  0.04696284],
       [-0.118467  ,  0.03001494,  0.20445536, ...,  0.0537251 ,
        -0.11486189, -0.14482298],
       ...,
       [ 0.1986433 ,  0.01529839,  0.01595281, ..., -0.0820727 ,
         0.07669435, -0.02853521],
       [ 0.02912564,  0.06097761,  0.03508702, ...,  0.03643134,
         0.09626068,  0.06251005],
       [ 0.00911447,  0.03074369, -0.00523561, ...,  0.01551951,
         0.02257361, -0.01079206]], dtype=float32)>

In [61]:
# convert data to special format for tf model
tokenizer.pad_token = tokenizer.eos_token
tf_train_set = model.prepare_tf_dataset(tokened_data['train'], shuffle=True, batch_size=64, collate_fn=collator)
tf_test_set = model.prepare_tf_dataset(tokened_data['test'], shuffle=False, batch_size=64, collate_fn=collator)
tf_train_set

<PrefetchDataset element_spec=({'input_ids': TensorSpec(shape=(64, None), dtype=tf.int64, name=None), 'attention_mask': TensorSpec(shape=(64, None), dtype=tf.int64, name=None)}, TensorSpec(shape=(64, None), dtype=tf.int64, name=None))>

In [62]:
# compile model
optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


In [63]:
# fit model (if pretrained does not exist)
if not os.path.exists(model_path):
    model.fit(tf_train_set, validation_data=tf_test_set, epochs=1)



In [64]:
# save model
if not os.path.exists(model_path):
    os.makedirs(model_path)
    model.save_pretrained(model_path)

## Test Model

In [65]:
# function to get predicted text
def test(text, max_new=50, temp=1, top_k=50, rep_penalty=1.5, len_penalty=0.75, n_seq=1):
    tokened = tokenizer(text, return_tensors='tf')
    output = model.generate(**tokened,
                            do_sample=True,
                            max_new_tokens=max_new, 
                            temperature=temp, 
                            top_k=top_k, 
                            repetition_penalty=rep_penalty,
                            length_penalty=len_penalty,
                            num_return_sequences=n_seq)
    return tokenizer.decode(output[0], skip_special_tokens=False)

In [66]:
lines = [line for line in text.split(EOL_TOKEN)]
test_lines = [lines[random.randint(0,len(lines) - 1)] + EOL_TOKEN for _ in range(10)]

for line in test_lines:
    output = test(line,
                  temp=0.5,
                  max_new=100,
                  top_k=200,
                  rep_penalty=1.5,
                  len_penalty=0.75,
                  n_seq=1)
    #output = output.replace(EOL_TOKEN, '\n')
    print(f'Original: {line}\nOutput: {output}\n')

Original:  He still could face his soul and lie to her. |eol|
Output:  He still could face his soul and lie to her. |eol| And she would never, ever be in love with thee!

And so I have found that my heart is not sweet; but it may bear on me as well: For now thou art the one who loves thyself best? That which if you find yourself a friend of mine own hand more than your hearts do know.  My breast might see what he did steal from him when they were born—though by their nature no longer can touch them like this,—and yet these are all things

Original:  Some fresher stamp of the time-bettering days. |eol|
Output:  Some fresher stamp of the time-bettering days. |eol| Visions in my heart, when I have been fair! 
I am not a bird nor an angel; but do you know how to fly? My wings are so high and heavy that they move|eol| And take away their beauty from me: for though it is too late then thou art worthy. But if thy love be overstered with some sweet gift,—and still remains here—it must hold the