<div style=background-color:#EEEEFF>

# Can you teach an AI to tell jokes?

In this tutorial we will explore what happens when we get an AI to tell jokes using Transformer models and PyTorch, how much we can improve the performance with fine-tuning, and some of the major limitations we encounter along the way.  

<div style=background-color:#EEEEFF>

We will do the following:
    
* Create a jokes dataset
* Use a pretrained Transformer model to generate punchlines
* Train a "joke classifier" to distinguish real punchlines from the "fake" Transformer-generated ones
* Fine-tune the Transformer model to improve its generated punchlines
* Use the joke classifier to measure the improvement from fine-tuning

The full version of this tutorial, including longer model training runs, can be found in [FullTutorial.ipynb](FullTutorial.ipynb)---this is a quick overview to showcase the methodology.

<div style=background-color:#EEEEFF>

## 1. The Jokes Dataset
    
For this project, we'll use a large public dataset: ["One Million Reddit Jokes"](https://www.kaggle.com/pavellexyr/one-million-reddit-jokes).
    
Jokes can take many forms.  Here, we'll limit ourselves to jokes that include a "setup" question, followed by a short "punchline" answer, and a few other requirements:
    
* Question/Answer format
* Short punchline (20 words max)
* No missing or deleted punchlines
* At least one up-vote
* Remove duplicates

This leaves a final dataset of ~150,000 short "Q/A" format jokes, which we will split into "train" and "test" sets. We'll also write out a "mini" subset of train/test jokes for quick experimentation.  (For details on *why* we made these choices, see [1.JokesDataset.ipynb](1.JokesDataset.ipynb).)
    
Let's make our jokes dataset:

In [None]:
file_path = '/opt/tljh/user/share/nlp_punchlines_data/one-million-reddit-jokes.csv'

from clean_data import clean_data
clean_data(file=file_path)

print('Done.')

<div style=background-color:#EEEEFF>

## 2. "Fake" Punchlines, generated by an AI

Let's see if an existing, pre-trained Transformer model can tell jokes.  (See [2.FakePunchlines.ipynb](2.FakePunchlines.ipynb) for the detailed version.)

<div style=background-color:#EEEEFF>

We'll use GPT-2, the freely-available forerunner of the recent GPT-3 text-generation model that generated tons of press last year. GPT-2 and GPT-3 are both *auto-regressive* models, which are optimized for generating text.  They take a prompt and generate additional new text that "continues" the thread.

We'll format our jokes as:
    
> "Question: [joke setup, ends in '?'] Answer: [joke punchline]"
    
We load the GPT-2 model and its associated tokenizer (which converts text into numeric model input) and pass it a "prompt", in the form 
    
> "Question: [joke setup] Answer:" 
    
GPT-2 will then generate the continuing text that should come after "Answer:".  
    
The code checks to see if a GPU is available, and to use it if it is.  The GPU gets a 6.5x speedup over the CPU in our tests, but that still means it will take hours to generate fake punchlines for our full dataset of 140,000+ jokes.

Let's do a small run with our "minitest" dataset of 300 jokes (split 70%/30% for training/testing) and take a look (takes ~2 minutes with GPU):

In [None]:
from fake_punchlines import add_fake_punchlines
add_fake_punchlines('data/short_jokes_minitrain.csv')
add_fake_punchlines('data/short_jokes_minitest.csv')
print('Done.')

In [None]:
# Take a look at exampes of the real jokes and the fake punchlines we just created
import pandas as pd
mini_jokes = pd.read_csv('data/short_jokes_minitest.csv')
mini_fakes = pd.read_csv('data/short_jokes_minitest_fake.csv')
for i in [0,1,4]:
    print('      Question: "{}"'.format(mini_jokes.iloc[i]['setup']))
    print('Real Punchline: "{}"'.format(mini_jokes.iloc[i]['punchline']))
    print('GPT2 Punchline: "{}"'.format(mini_fakes.iloc[i]['punchline']))
    print('----')

<div style=background-color:#EEEEFF>
    
A few interesting things to note:

* Responses are on-topic and sound (mostly) like coherent English.  This is what GPT-2 is good at!
* Responses ramble on and cut off arbitrarily.  We set a 30-token limit if no end-of-string (EOS) token is received; an EOS token is basically *never* generated.  GPT-2 is not good at knowing when to shut up!
* GPT-2 often answers questions with more questions (although structuring our prompts with explicit "Question:/Answer:" format seems to have helped a lot compared to my previous tests...)

<div style=background-color:#EEEEFF>

Generating punchlines takes ~30 seconds per 100 jokes, even with the GPU.
    
This means generating fake punchlines for the full dataset of ~150,000 jokes will take almost 12 hours!

Here, we'll stick with our quick 300-joke training set for illustration purposes.  Instructions for doing a full model training, like the one used to power jokes.cloudburst.host, are in the [Full Tutorial](FullTutorial.ipynb).

<div style=background-color:#EEEEFF>

## 3. A Classifier to recognize "real" and "fake" punchlines
    
GPT-2 straight out of the box has trouble telling jokes: she tends to ramble on and on, she often answers questions with more questions, and her jokes aren't very funny.  

She can't fool a human into thinking that she's a real, human comedian.  But can GPT-2 fool another AI?
    
Let's find out, by training a classifier to distinguish "real" from "fake" jokes!

<div style=background-color:#EEEEFF>

For this exercise, we'll use a different type of Transformer model: an *auto-encoding* model called BERT.  The process of training and testing BERT is implemented in the *classify_punchines()* function in [punchline_classifier.py](punchline_classifier.py).  
    
We need to train BERT on a joke dataset that includes both real jokes and fake jokes.  We also need a test set of jokes to quantify how well he does.
    
It takes several hours to train BERT on the full dataset.  Here, we'll just use the small 300-joke dataset we created earlier.  Instructions for training on the full dataset are in the [Full Tutorial](FullTutorial.ipynb).

In [None]:
from punchline_classifier import train_punchline_classifier

# 210 "real" and 210 "fake" jokes in our training set
train_files = ['data/short_jokes_minitrain.csv','data/short_jokes_minitrain_fake.csv']
# 90 "real" and 90 "fake" jokes in our test set
test_files = ['data/short_jokes_minitest.csv','data/short_jokes_minitest_fake.csv']

# Set downsample=1 or leave out to train on the full training set (it defaults to 1)
model = train_punchline_classifier(train_files, test_files)

print('Done.')

<div style=background-color:#EEEEFF>

Notice that BERT is able to achieve 95%+ accuracy, even just training on a small subset of the available training data.
    
So the answer is "No", out-of-the-box GPT-2 cannot fool BERT with her joke-telling abilities.
    
But what if we train her *specifically to tell jokes*?  Can she get better at it?

<div style=background-color:#EEEEFF>

## 4. Fine-Tuning a Text Generator 
    
Next, we'll use our training set of short Q/A-style jokes to do some "fine-tuning" on the GPT-2 generator model.  This process makes use of the pre-trained GPT-2 ability to generate realistic English language text, but then trains a few more neural network layers to specifically generate the kind of text we're looking for.  
    
This fine-tuning model training is implemented in [fine_tune.py](fine_tune.py).  If we only use our "minitrain" set, we don't have enough to get improvement from fine-tuning, so instead we'll use our full training set, downsampled by a factor of 10x---this runs in about 10 minutes on a GPU and shows significant improvement over un-tuned models.  

In [None]:
from fine_tune import fine_tune

fine_tune(train_files='data/short_jokes_train.csv',
          use_model='gpt2', downsample=10, nepochs=3)

print('Done.')

<div style=background-color:#EEEEFF>

## 5. How well does our fine-tuned Joke Generator perform?
    
Okay, so now we've done a round of fine-tuning on the joke generator.  Let's see if it performs any better than vanilla out-of-the-box GPT-2.  We'll do this two ways (see [Full Tutorial](FullTutorial.ipynb) for additional performance metrics and an in-depth analysis): 
    
- By passing our AI-generated jokes to the Punchline Classifier to see if it can fool the classifier
- By playing around with the joke-generator to see if it can make us laugh!

In [None]:
import pandas as pd
mini_jokes = pd.read_csv('data/short_jokes_minitest.csv')
mini_fakes = pd.read_csv('data/short_jokes_minitest_fake.csv')

from test_generator import load_all_models, generate_punchlines, get_class_predictions

# These are the models we trained.  Feel free to substitute your own when you have created some!
generator_filename = 'models/JokeGen_gpt2.pt'
classifier_filename = 'models/ClassifyJokes_bert.pt'

# Load the models
load_all_models(generator_filename=generator_filename, classifier_filename=classifier_filename)
# Generate punchlines using vanilla GPT-2 and our fine-tuned version
generated_gpt2, generated_ft = generate_punchlines(mini_jokes)
# Get predictions (1="real" joke, 0="fake" joke)
p_human, p_gpt2, p_ft = get_class_predictions(mini_jokes, generated_gpt2, generated_ft)

<div style=background-color:#EEEEFF>

Almost all (>97%) human-generated jokes are recognized as being "real" jokes. In contrast, almost none of the jokes generated by out-of-the-box GPT-2 can convince the classifier they are "real".

The fine-tuned model does better, convincing the classifier that it has produced a "real" joke about 25-30% of the time.  This increases to ~40% if you take the time to fine-tune with the full training dataset over more epochs (recall we only trained with 1/10th of the training set for 3 epochs here).

<div style=background-color:#EEEEFF>

Both out-of-the-box GPT-2 and our Fine-tuned model have a tendency to produce multiple "Question:... Answer:..." sequences in the punchline (see [5.Performance](5.Performance.ipynb) for details).  For example, vanilla GPT-2 produced this punchline in one of our runs:

- Question: "Did you know Google now has a platform for recording your bowel movements?"
- Answer: "Google+ Question: Does the internet only allow you to search for words on the internet? Answer: Yes. Answer: The internet"
    
This would make more sense if we truncated the punchline after its first "Answer", before it starts asking more questions and supplying more answers, i.e.:
    
- Question: Did you know Google now has a platform for recording your bowel movements? 
- Answer: Google+ 
    
If we "cheat", we can clean up the generated output by truncating it before any redundant "Question:" or "Answer:" sequences, can we do a better job of fooling the classifier?

In [None]:
def strip_extras(text):
    text = text.replace('\n','')
    while text.count('Question:') > 1:
        text = text[:text.rfind('Question:')]
    while text.count('Answer:') > 1:
        text = text[:text.rfind('Answer:')]
    return text

p_human, p_gpt2, p_ft = get_class_predictions(mini_jokes, 
                                              [strip_extras(x) for x in generated_gpt2], 
                                              [strip_extras(x) for x in generated_ft])

<div style=background-color:#EEEEFF>

That clearly helped!  Now >50% of the punchlines generated by the fine-tuned model can fool the classifier (while vanilla GPT-2 is still struggling at a below 10% hit rate).  
    
Now let's bundle up the fine-tuned generator to tell a joke.  We'll do the cleaning process on the result, and then we'll pass it through the classifier.  If the classifier thinks it's a real joke, we'll display the results.  Otherwise, we'll generate a new punchline and keep trying until we get one that can fool the classifier.

In [None]:
from test_generator import tell_a_joke

setup = "How many nerds does it take to change a lightbulb?"

joke = tell_a_joke(setup)

<div style=background-color:#EEEEFF>

Try running the same joke setup several times!  Usually, the joke generator will come up with something that is approved by the classifier within 1-3 attempts.  
    
Now try some different joke setups.

In [None]:
setup = "Why did the chicken cross the road?"

joke = tell_a_joke(setup)

<div style=background-color:#EEEEFF>

It isn't always funny, but at least these punchlines sound like they could be punchlines.  It's almost like your 5-year-old kid is coming up with them...