<div style=background-color:#EEEEFF>

# Can you teach an AI to tell jokes?

In this tutorial we will explore what happens when we get an AI to tell jokes using Transformer models and PyTorch, how much we can improve the performance with fine-tuning, and some of the major limitations we encounter along the way.  

<div style=background-color:#EEEEFF>

This tutorial will walk through the following:

* [1.JokesDataset](1.JokesDataset.ipynb): Create a dataset of jokes to use for experimentation and model training
* [2.FakePunchlines](2.FakePunchlines.ipynb): Use a pretrained Transformer model to generate punchlines
* [3.PunchlineClassifier](3.PunchlineClassifier.ipynb): Train a "joke classifier" to tell the difference between "real" human-generated punchlines and "fake" Transformer-generated punchlines
* [4.FineTune](4.FineTune.ipynb): Use our jokes dataset to fine-tune the Transformer models and improve the punchlines they generate
* [5.Performance](5.Performance.ipynb): Use our joke classifier to quantify the improved performance we got from fine-tuning
    
Each step has its own Jupyter Notebook, so you can walk through the details of each processing step if you desire.  This notebook walks through a high-level view of the problem, with each step implemented as a single command.  

<div style=background-color:#EEEEFF>

NOTE: Some commands make take several hours to run&mdash;these commands are better run from the command line using `screen` or `tmux`, so that they can be run as a detached background process that does not require a persistent internet connection.  You can get a terminal window by selecting **File&rarr;New&rarr;Terminal**.  You can find documentation for `screen` [here](https://linuxize.com/post/how-to-use-linux-screen/) and for `tmux` [here](https://linuxize.com/post/getting-started-with-tmux/).  Both are already installed on Cloudburst.

<div style=background-color:#EEEEFF>

## 1. The Jokes Dataset
    
For this project, we'll use a large, publicly available dataset of jokes: the "One Million Reddit Jokes" dataset, which covers jokes from the /r/jokes subreddit from April 1, 2020 and earlier. The jokes dataset is provided here in `./data/one-million-reddit-jokes.csv`. You can also download the jokes dataset directly from Kaggle [here](https://www.kaggle.com/pavellexyr/one-million-reddit-jokes).

<div style=background-color:#EEEEFF>

Jokes can take many forms; some are just narrative stories with amusing or suprising endings. Some are "My life" (as in, "my life is a joke", which appears 9 separate times in this dataset).
    
For this project, we'll restrict ourselves to a narrow, semi-formulaic set of jokes that include a "setup" question, followed by a "punchline" answer, and we'll require the punchlines to be rather short: 20 words or less.  
    
We also want to clean up the data to make sure we have a set of "real" jokes, removing jokes that have subsequently been deleted or removed and whose punchlines are therefore missing, and those that did not get a single upvote.  Finally, we want to remove duplicate jokes, while summing the upvotes across every instance of a unique joke.  

<div style=background-color:#EEEEFF>

This leaves us with a final dataset of ~150,000 short "Q/A" format jokes.
    
We will split this dataset into "train" and "test" sets, and also copy a small subset of the "test" jokes into a "minitest" dataset that we can use for development purposes.
    
If your interested in the details of *why* we made these choices, or how we do the data cleaning, see [1_JokesDataset.ipynb](1_JokesDataset.ipynb).

<div style=background-color:#EEEEFF>

You can run this functionality here in the Notebook by executing the following cell:

In [None]:
from clean_data import clean_data
clean_data(file='data/one-million-reddit-jokes.csv')

<div style=background-color:#EEEEFF>

Alternatively, you can run the cleaning script in the terminal from the command line with:

<div style=background-color:#EEEEFF>

## 2. "Fake" Punchlines, generated by an AI

Before we try to train an AI to tell jokes, we should check what can be done out-of-the-box with existing, pre-trained Transformer models.  We'll use GPT-2, which is the freely-available forerunner of the recent GPT-3 text-generation model that generated tons of press last year. 
    
GPT-2 and GPT-3 are both *auto-regressive* models, which use only the "decoder" part of the Transformer neuron architecture and are optimized for generating text.  They are built to take a text prompt and generate additional new text that "continues" the thread.

Given a joke setup, can GPT-2 produce a plausible punchline? And is it funny?
    
A more detailed walk-through of this step can be found in [2_FakePunchlines.ipynb](2_FakePunchlines.ipynb)

<div style=background-color:#EEEEFF>

Because GPT-2 is trained to be general-purpose text generator, and not necessarily to answer questions or provide punchlines, we give it a few in-text clues to help it recognize the Q/A joke format we are trying to produce, so that each joke is formatted as:
    
> "Question: [joke setup, ends in '?'] Answer: [joke punchline]"
    
We then load the GPT-2 model and its associated tokenizer, which will convert text strings into numeric input for the model.  We pass it a "prompt", in the form 
    
> "Question: [joke setup] Answer:" 
    
and ask it to generate the continuing text that should come after "Answer:".  

<div style=background-color:#EEEEFF>

The code checks to see if a GPU is available, and to use it if it is.  The GPU gets a 6.5x speedup over the CPU in our tests, but that still means it will take hours to generate fake punchlines for our full dataset of 140,000+ jokes.

Let's do a small run with our "minitest" dataset and take a look:

In [None]:
# This runs locally in the Notebook on our small 300-joke "mini-test" set.  

from fake_punchlines import add_fake_punchlines

add_fake_punchlines('data/short_jokes_minitest.csv')

In [None]:
# Take a look at exampes of the real jokes and the fake punchlines we just created
import pandas as pd
mini_jokes = pd.read_csv('data/short_jokes_minitest.csv')
mini_fakes = pd.read_csv('data/short_jokes_minitest_fake.csv')
for i in range(3):
    print('Question: {}'.format(mini_jokes.iloc[i]['setup']))
    print('  Answer: {}'.format(mini_jokes.iloc[i]['punchline']))
    print('    Fake: {}'.format(mini_fakes.iloc[i]['punchline']))
    print('----')

<div style=background-color:#EEEEFF>
    
There are a few interesting things to note here.

* Responses are generally on-topic and sound (mostly) like coherent English.  This is what GPT-2 is good at!
* The responses just ramble on and cut off arbitrarily.  We set a 30-token limit if no end-of-string (EOS) token is received; an EOS token is basically *never* generated.  GPT-2 is not good at knowing when to shut up!
* GPT-2 often answers questions with more questions (although structuring our prompts with explicit "Question:/Answer:" format seems to have helped a lot compared to my previous tests...)

<div style=background-color:#EEEEFF>
    
It takes about 30 seconds to generate punchlines for a batch of 100 jokes (your mileage may vary somewhat).  
    
In the next step, we'll train a joke classifier to distinguish between real and fake jokes, but to do that, we'll need to generate a big dataset of fake punchlines.  Based on our small batch example, generating fake punchlines for the full dataset will take almost 12 hours!
    
If you have a stable internet connection and can leave your laptop open, you can run them right here in the notebook and hope that you don't get disconnected.

However, a better choice is to run them from the terminal, using screen or tmux to background the process. That way, you launch the run, close your laptop, and walk away. The process will run overnight and the output will be waiting for you when you get up in the morning.

To run in the background:
* Open a terminal
* Enter `screen`
* Run the following command:
```
    python fake_punchlines.py data/short_jokes_train.csv
```
* Detach from the `screen`.
    
Then do the same for `data/short_jokes_test.csv`.
    
More detailed instructions on how to do this can be found in the [FakePunchlines Notebook](2_FakePunchlines.ipynb)

<div style=background-color:#EEEEFF>

## 3. A Classifier to recognize "real" and "fake" punchlines
    
We've seen that GPT-2 straight out of the box has some trouble telling jokes.  For one thing, she tends to ramble on and on, much longer than typical Q/A joke punchlines.  She also often answers questions with questions, much more so than actual punchlines do.  If the jokes make sense at all, they usually aren't very funny.  
    
In short, GPT-2 can't fool a human into thinking that she's a real, human comedian.
    
But can GPT-2 fool another AI?

<div style=background-color:#EEEEFF>

Let's find out, by training a classifier to distinguish "real" from "fake" jokes.  
    
For this exercies, we'll use a different type of Transformer model: an *auto-encoding* model.  Models of this type use only the "encoder" part of the Transformer neuron architecture, and are optimized for making sense of a text-string (including classifying it).  The particular auto-encoder we use here is called BERT.  
    
Can GPT-2 fool BERT into thinking her jokes are real?  Or will BERT be able to tell the difference between her jokes and those from a human(?) comedian on Reddit?

<div style=background-color:#EEEEFF>

The process of training and testing BERT to distinguish between real and fake jokes is implemented here in the *classify_punchines()* function in [punchline_classifier.py](punchline_classifier.py).  
    
We need to give him a training set of jokes, including both real jokes and fake jokes, for him to use to train the classifier.  We also give him a test set of jokes, including both real and fake, so that he can quantify how well he is able to do.
    
It takes several hours to train BERT on the full dataset, so we've made it easy to "downsample" the dataset by a large factor (e.g., 20x).  Even using this small subset of the training data, BERT is able to achieve good quality differentiation between the real and fake jokes.  Training on the full dataset only improves the results by a small amount.  
    
If you want to train on the full dataset, we recommend doing it from the command line using *screen*, as explained in [3.PunchlineClassifier](3.PunchlineClassifier.ipynb).  
    
For now, let's just train on 1/20th of the data, right here in the notebook (should take < 10 minutes).

In [None]:
from punchline_classifier import train_punchline_classifier

train_files = ['data/short_jokes_train.csv','data/short_jokes_train_fake.csv']
test_files = ['data/short_jokes_test.csv','data/short_jokes_test_fake.csv']

# Set downsample=1 or leave out to train on the full training set (it defaults to 1)
model = train_punchline_classifier(train_files, test_files, downsample=20)  

<div style=background-color:#EEEEFF>

Notice that BERT is able to achieve 97%+ accuracy, even just training on a small subset of the available training data.
    
So the answer is "No", out-of-the-box GPT-2 cannot fool BERT with her joke-telling abilities.
    
But what if we train her *specifically to tell jokes*?  Can she get better at it?

<div style=background-color:#EEEEFF>

## 4. Fine-Tuning a Text Generator 
    
Next, we'll use our training set of short Q/A-style jokes to do some "fine-tuning" on the GPT-2 generator model.  This process makes use of the pre-trained GPT-2 ability to generate realistic English language text, but then trains a few more neural network layers to specifically generate the kind of text we're looking for.  
    
This fine-tuning model training is implemented in *fine_tune.py*.  As an example, we'll run it on 1/100th of our joke training set and train for 3 epochs, just to get the code working quickly.  

If you are interested, you can see more details about setting up the dataset and models for this round of training in [4.FineTune](4.FineTune.ipynb)

In [None]:
from fine_tune import fine_tune

fine_tune(train_files='data/short_jokes_train.csv',
          use_model='gpt2', downsample=100, nepochs=3)

<div style=background-color:#EEEEFF>

We recommend training on the full dataset for about 10 epochs, which should be done in *screen* from the command line:
    
* `$> screen -S train_generator`
* `$> python fine_tune.py --train data/short_jokes_train.csv --nepochs=10`

Then "Ctl-a d" to detach.

<div style=background-color:#EEEEFF>

## 5. How well does our fine-tuned Joke Generator perform?
    
Okay, so now we've done a round of fine-tuning on the joke generator.  Let's see if it performs any better than vanilla out-of-the-box GPT-2.  We'll do this two ways (see [5.Performance](5.Performance.ipynb) for additional performance metrics and an in-depth analysis): 
    
- By passing our AI-generated jokes to the Punchline Classifier to see if it can fool the classifier
- By playing around with the joke-generator to see if it can make us laugh!

In [None]:
from test_generator import load_all_models, generate_punchlines, get_class_predictions

# These are the models we trained.  Feel free to substitute your own when you have created some!
generator_filename = 'models/JokeGen_gpt2_1.00subset_10epochs_2022-01-07.pt'
classifier_filename = 'models/ClassifyJokes_bert_1.00subset_2021-12-16.pt'

# Load the models
load_all_models(generator_filename=generator_filename, classifier_filename=classifier_filename)
# Generate punchlines using vanilla GPT-2 and our fine-tuned version
generated_gpt2, generated_ft = generate_punchlines(mini_jokes)
# Get predictions (1="real" joke, 0="fake" joke)
p_human, p_gpt2, p_ft = get_class_predictions(mini_jokes, generated_gpt2, generated_ft)

<div style=background-color:#EEEEFF>

In our runs, almost all (>97%) of the human-generated jokes are recognized as being "real" jokes. In contrast, almost none of the jokes generated by out-of-the-box GPT-2 can convince the classifier they are "real".

The fine-tuned model does better, convincing the classifier that it has produced a "real" joke about 35-40% of the time.

It turns out that both out-of-the-box GPT-2 and our Fine-tuned model have a tendency to produce multiple "Question:... Answer:..." sequences in the punchline (see [5.Performance](5.Performance.ipynb) for details).  For example, vanilla GPT-2 produced this punchline in one of our runs:

- Question: Did you know Google now has a platform for recording your bowel movements? 
- Answer: Google+ Question: Does the internet only allow you to search for words on the internet? Answer: Yes. Answer: The internet
    
This would make more sense if we truncated the punchline after its first "Answer", before it starts asking more questions and supplying more answers, i.e.:
    
- Question: Did you know Google now has a platform for recording your bowel movements? 
- Answer: Google+ 
    
If we "cheat", we can clean up the generated output by truncating it before any redundant "Question:" or "Answer:" sequences, can we do a better job of fooling the classifier?

In [None]:
def strip_extras(text):
    text = text.replace('\n','')
    while text.count('Question:') > 1:
        text = text[:text.rfind('Question:')]
    while text.count('Answer:') > 1:
        text = text[:text.rfind('Answer:')]
    return text

p_human, p_gpt2, p_ft = get_class_predictions(mini_jokes, 
                                              [strip_extras(x) for x in generated_gpt2], 
                                              [strip_extras(x) for x in generated_ft])

<div style=background-color:#EEEEFF>

That clearly helped!  Now 60-70% of the punchlines generated by the fine-tuned model can fool the classifier (while vanilla GPT-2 is still struggling at a below 10% hit rate).  
    
Now let's bundle up the fine-tuned generator to tell a joke.  We'll do the cleaning process on the result, and then we'll pass it through the classifier.  If the classifier thinks it's a real joke, we'll display the results.  Otherwise, we'll generate a new punchline and keep trying until we get one that can fool the classifier.

In [None]:
from test_performance import tell_a_joke

setup = "How many nerds does it take to change a lightbulb?"

joke = tell_a_joke(setup)

<div style=background-color:#EEEEFF>

Try running the same joke setup several times!  Usually, the joke generator will come up with something that is approved by the classifier within 1-3 attempts.  
    
Now try some different joke setups.

In [None]:
setup = "Why did the chicken cross the road?"

joke = tell_a_joke(setup)

<div style=background-color:#EEEEFF>

It isn't always funny, but at least these punchlines sound like they could be punchlines.  It's almost like your 5-year-old kid is coming up with them...