<div style=background-color:#EEEEFF>

## 4. Fine-tune GPT-2 to tell jokes

It turns out that vanilla GPT-T doesn't tell very good jokes.  That's not very surprising, given that GPT-2 is optimized to generate a wide variety of text; most text doesn't take the form of Q/A jokes!  
    
We can improve GPT-2's joke-telling ability with "fine-tuning", by running some additional training on top of the pre-trained GPT-2 model, using a dataset of Q/A jokes.  For this, we'll use the jokes training set we used previously to train our BERT joke classifier.  

<div style=background-color:#EEEEFF>

As before, we'll load our short jokes training set.  This time, we only want to load the "real" jokes, not the "fake" jokes, because we're trying to get GPT-2 to generate punchlines that look like real punchlines.  We only load the training dataset---we won't evaluate as we train, because unlike with classification, there isn't a simple quantitative metric to assess "is this a good joke?".

In [1]:
from datasets import load_dataset

train_files = ['data/short_jokes_train.csv']   # Only need the "real" joke training data
downsample = 100   # Let's start by using only a subset of the data

dataset = load_dataset('csv', data_files={'train':train_files})

# Remove any badly-formatted data and downsample, if requested
dataset = dataset.filter(lambda ex,j: ((type(ex['setup'])==str) & (type(ex['punchline'])==str) & 
                                       (j%downsample==0)),
                         with_indices=True)    
print('{} rows in the train dataset'.format(dataset['train'].num_rows)+'.')

Using custom data configuration default-818532f3475b5e35
Reusing dataset csv (/home/jupyter-genevievegraves/.cache/huggingface/datasets/csv/default-818532f3475b5e35/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/1 [00:00<?, ?it/s]

Loading cached processed dataset at /home/jupyter-genevievegraves/.cache/huggingface/datasets/csv/default-818532f3475b5e35/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-d7eb08caaadb6229.arrow


1028 rows in the train dataset.


<div style=background-color:#EEEEFF>

We then need to load our model, using the pre-trained version as our starting checkpoint, along with its tokenizer.  
    
We'll parse the joke setups and punchlines the same way we did before for the BERT classifier, then pass the "Question:/Answer:" format jokes through the GPT-2 tokenizer.
    
We're doing this a little differently from how we passed the data to BERT.  Because GPT-2 is such a large model, we're going to use batch-accumulation for the model gradients.  We'll let that gradient accumulation process drive the data batching, rather than doing it ahead of time.  This means the text padding will happen at that stage, rather than here in the data-prep stage.
    
We reformat the data for PyTorch, and then take only the *input_ids* column to pass to the model.  

In [2]:
import data_tools as dtools
import model_tools as mtools

checkpoint, tokenizer, model = mtools.load_model('gpt2')    

# Tokenize the dataset

def tokenize_function(example):
    # Reformat the jokes strings into the "Question: XX Answer: YY" format
    full_qa = dtools.joke_as_qa(example['setup'], example['punchline'])
    # Split the questions from the answers (these are our two sequences)
    q = [x[:x.find('Answer:')].strip() for x in full_qa]
    a = [x[x.find('Answer:'):].strip() for x in full_qa]
    # Tokenize the sequences
    #  - pad and truncate to make all the same length for happy PyTorch tensors
    output = tokenizer(q, a, padding="max_length", max_length=60, truncation=True)
    # Give attention to the first pad token because we want it to learn to generate
    #     <|endoftext|> tokens!
    for am in output['attention_mask']:
        if 0 in am:
            pad_start = am.index(0)
            am[pad_start] = 1
    return output

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format("torch")    
train_dataset = tokenized_datasets['train']['input_ids']

Using checkpoint "gpt2"


Using pad_token, but it is not set yet.
Loading cached processed dataset at /home/jupyter-genevievegraves/.cache/huggingface/datasets/csv/default-818532f3475b5e35/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-6812a04ec3e6e694.arrow


<div style=background-color:#EEEEFF>

Finally, we pass the tokenized data to the model and do our fine-tuning training pass.  The model training loop is encapsulated in the *train_generator* function, which handles the gradient accumulation and gradient descent.  
    
We'll start by training for 3 epochs on our small test dataset.

In [3]:
model = mtools.train_generator(train_dataset, model, tokenizer, epochs=3)

Running on cuda:0
Training epoch 0
0


1028it [00:17, 59.25it/s]


Training epoch 1
tensor(9.3609, device='cuda:0', grad_fn=<NllLossBackward>)


1028it [00:17, 58.61it/s]


Training epoch 2
tensor(10.8984, device='cuda:0', grad_fn=<NllLossBackward>)


1028it [00:17, 58.19it/s]


<div style=background-color:#EEEEFF>

Now we're ready to train the generator on the entire dataset, and over more epochs.  Running on the full dataset takes about half an hour per epoch, so we recommend running it from the command line in a detached screen, as we have done before.

* `$> screen -S train_generator`
* `$> python fine_tune.py --train data/short_jokes_train.csv --nepochs=10`

Then "Ctl-a d" to detach.

<div style=background-color:#EEEEFF>

When the model is done training, it gets stored in the *models/* directory.  The default filename includes identifying information about the run (base model, subset fraction, number of epochs) as well as a date stamp.
    
In [5.Performance](5.Performance.ipynb), we'll take a look at how much better (or not) the fine-tuned generator does at joke-telling compared with the original pre-trained GPT-2 generator.