### Fine Tuning GPT-2 Model

This notebook will fine-tune a GPT-2 model for the given genre using the given set of training parameters (refer to notebook variables cell to modify genre and training params). It saves the model and generates lyrics using the model. This notebook also allows for generating text and computing perplexity on saved models saves. At the very end, there is a report of hyperparameter tuning GPT-2.

Before running, please modify notebook variables as you wish.

In [None]:
# import necessary libaries
import pandas as pd
import math
import numpy as np
from gpt2_utils import Dset 
from gpt2_utils import get_model_tokenizer, train_model, generate_texts, load_model, compute_perplexity

Set notebook variables

In [None]:
GENRE = 'country' # either "metal" or "country"

# training parameters 
MAX_SEQ_LEN = 10 # maximum token length for each lyric datapoint
EPOCHS = 1
BATCH_SIZE = 4
LR = 1e-3 # learning rate
PCT_TRAIN = 0.05 # percent of training and validation data to use as a decimal (0-1)

# name of this trained model, will be used for filename when saving the model
MODEL_INSTANCE_NAME = 'Test'
# sets the path to the saved model to use when computing perplexity
# currently set to the model fine tuned in this current run of notebook
LOAD_MODEL_PATH = f"gpt2_trained_models/{GENRE.lower()}/{MODEL_INSTANCE_NAME}"

Read in train, vallidation, and test data

In [None]:
# read in cleaned data
if GENRE == 'country':
    train_lines = pd.read_csv('data/country_train.csv', header=None).values.tolist()
    val_lines = pd.read_csv('data/country_val.csv', header=None).values.tolist()
    test_lines = pd.read_csv('data/country_test.csv', header=None).values.tolist()

elif GENRE == 'metal':
    train_lines = pd.read_csv('data/metal_train.csv', header=None).values.tolist()
    val_lines = pd.read_csv('data/metal_val.csv', header=None).values.tolist()
    test_lines = pd.read_csv('data/metal_test.csv', header=None).values.tolist()

else:
    raise ValueError('Incorrect genre given.')

In [None]:
print('Total train lines :', len(train_lines))
print('Total  lines : ', len(val_lines))
print('Total test lines : ', len(test_lines))

In [None]:
train_end = math.ceil(len(train_lines)*PCT_TRAIN)
train_lines = train_lines[0:train_end]

val_end = math.ceil(len(val_lines)*PCT_TRAIN)
val_lines = val_lines[0:val_end]

print('Train lines to use :', len(train_lines))
print('Val lines to use : ', len(val_lines))
print('Test lines to use: ', len(test_lines))

Fine Tuning GPT-2 Model

In [None]:
# get model and tokenizer
model, tokenizer = get_model_tokenizer(MAX_SEQ_LEN)

In [None]:
# encode data
train_encodings = [tokenizer(text=x, return_tensors='tf', padding='max_length', max_length=MAX_SEQ_LEN, truncation=True) for x in train_lines]
train_encodings = [enc['input_ids'].numpy().tolist()[0] for enc in train_encodings]

val_encodings = [tokenizer(text=x, return_tensors='tf', padding='max_length', max_length=MAX_SEQ_LEN, truncation=True) for x in val_lines]
val_encodings = [enc['input_ids'].numpy().tolist()[0] for enc in val_encodings]

test_encodings = [tokenizer(text=x, return_tensors='tf', padding='max_length', max_length=MAX_SEQ_LEN, truncation=True) for x in test_lines]
test_encodings = [enc['input_ids'].numpy().tolist()[0] for enc in test_encodings]

In [None]:
# create training, valdation, and testing datasets
dset_train = Dset(train_encodings)
dset_val = Dset(val_encodings)
dset_test = Dset(test_encodings)

In [None]:
# NOTE: only run if you want to fine tune a model. It make take a long time to run depending 
# on the training parameters you set in notebook variables.

# fine tune the model
model = train_model(model, dset_train, dset_val, GENRE, MODEL_INSTANCE_NAME, batch_size=BATCH_SIZE, epochs=EPOCHS, lr=LR)

In [None]:
# generate lyrics
gen_texts = generate_texts(model, tokenizer, 15)
for text in gen_texts:
    print(''.join(text))

Generate Text from a Loaded Model

In [None]:
loaded_model = load_model(LOAD_MODEL_PATH)
save_text_path = f"generated_txts/{MODEL_INSTANCE_NAME}.txt"
gen_texts = generate_texts(loaded_model, tokenizer, 2, save_text_path)
for text in gen_texts:
    print(''.join(text))

Compute Perplexity

In [None]:
# compute perplexity of on test data
test_lines_flt = np.array(test_lines).flatten().tolist()
ppl = compute_perplexity(LOAD_MODEL_PATH, tokenizer, test_lines_flt, MAX_SEQ_LEN)
ppl

Hyperparameter Tuning

We tuned the number of epochs, batch size, and percent of training data used. Tuning percent of data used may seem like an odd parameter to tune, but we noticed some significant differences in the quality of output sentences when training with different portions of training data and decided to tune this as well. Furhter, we could not train always train models on the full set of training data as it would take too long or overload our computers, so we report the percent of training data we used below. While we could tune the epochs parameter for country models, we could not do this for metal models as it would take too long too train with more than 1 epoch even when using a small portion of training data. The best country and metal parameters bolded below. Best parameters were assesed by choosing those that balanced having a low perplexity and generating sensible texts.



|Genre |Number of Epochs   | Batch Size |Percent of Train Data   |  Mean validation Perplexity |  Generated Examples |
|---|---|---|---|---|---|
|Country| 1  | 200  | 100%  | 637.6  | -ide me tonight <br> y never been good <br> i've been <br> here before <br> the things i've made <br> cause i feel like a bird in the grass <br> in and on <br> the love in your eyes <br> i need you  |
|__Country__| 1  | 100  | 50%  | 1957.4  |  to leave the wind away <br> how you call me... <br> up in my heart <br> all that i've missed <br>, the best of her <br> and all i used to do it go to  |
|Country|  1 | 100  | 25%  | 2184.99  |    and we've gonna see <br>  it's in one <br> like a little woman <br> me than we're walking over my lips <br> but all in heaven you're walking in the <br> to lose me more i've been good on <br> what you're all the sun <br> the little to make your life.|
|Country|  5 | 200  | 50%  | 126.26  |    . <br> w <br>sy dogs and pine trees <br>w <br>sy-doodle <br>sy-doodle <br>fore the leaves at home <br>sy-dons with the each other

|
|Metal|  1 | 200  | 100%  | 931.29  |  's magic so strange a mystery <br>  <br>, its the last time, i'll come <br> <br> and you need it <br> and let's find a way <br> the time <br> <br> the light <br>and <br> <br> that we see their way forward<br> <br> by their own|
|__Metal__|  1 | 100  | 50%  | 3102.7  |  in the beast and the sky <br>  on the night <br> for peace, we all are calling <br> <br> and fear <br> like a second <br>, death and we fight <br> at the world <br> with the blood <br> the chains of the war|
|Metal|  1 | 100  | 25%  | 2844.39  |  to the future, the fight to kill my <br> it down just here, it goes away! <br> life <br> us <br>, the truth <br> <br> the one-through |