<div style=background-color:#EEEEFF>

# Can you teach an AI to tell jokes?

In this tutorial we will explore what happens when we get an AI to tell jokes using Transformer models and PyTorch, how much we can improve the performance with fine-tuning, and some of the major limitations we encounter along the way.  

<div style=background-color:#EEEEFF>

This tutorial will walk through the following:

* [1.JokesDataset](1.JokesDataset.ipynb): Create a dataset of jokes to use for experimentation and model training
* [2.FakePunchlines](2.FakePunchlines.ipynb): Use a pretrained Transformer models to generate punchlines
* [3.PunchlineClassifier](3.PunchlineClassifier.ipynb): Train a "joke classifier" to tell the difference between "real" human-generated punchlines and "fake" Transformer-generated punchlines
* [4.FineTune](4.FineTune.ipynb): Use our jokes dataset to fine-tune the Transformer models and improve the punchlines they generate
* [5.Performance](5.Performance.ipynb): Use our joke classifier to quantify the improved performance we got from fine-tuning
* [6.NLPChallenges](6.NLPChallenges.ipynb): Examine some of the problems in current NLP models, as highlighted by this joke-telling exercise
    
Each step has its own Jupyter Notebook, so you can walk through the details of each processing step if you desire.  This notebook walks through a high-level view of the problem, with each step implemented as a single command.  

<div style=background-color:#EEEEFF>

NOTE: Some commands make take several hours to run&mdash;these commands are better run from the command line using `screen` or `tmux`, so that they can be run as a detached background process that does not require a persistent internet connection.  You can get a terminal window by selecting **File&rarr;New&rarr;Terminal**.  You can find documentation for `screen` [here](https://linuxize.com/post/how-to-use-linux-screen/) and for `tmux` [here](https://linuxize.com/post/getting-started-with-tmux/).  Both are already installed on Cloudburst.

<div style=background-color:#EEEEFF>

## 1. The Jokes Dataset
    
For this project, we'll use a large, publicly available dataset of jokes: the "One Million Reddit Jokes" dataset, which covers jokes from the /r/jokes subreddit from April 1, 2020 and earlier. The jokes dataset is provided here in `./data/one-million-reddit-jokes.csv`. You can also download the jokes dataset directly from Kaggle [here](https://www.kaggle.com/pavellexyr/one-million-reddit-jokes).

<div style=background-color:#EEEEFF>

Jokes can take many forms; some are just narrative stories with amusing or suprising endings. Some are "My life" (as in, "my life is a joke", which appears 9 separate times in this dataset).
    
For this project, we'll restrict ourselves to a narrow, semi-formulaic set of jokes that include a "setup" question, followed by a "punchline" answer, and we'll require the punchlines to be rather short: 20 words or less.  
    
We also want to clean up the data to make sure we have a set of "real" jokes, removing jokes that have subsequently been deleted or removed and whose punchlines are therefore missing, and those that did not get a single upvote.  Finally, we want to remove duplicate jokes, while summing the upvotes across every instance of a unique joke.  

<div style=background-color:#EEEEFF>

This leaves us with a final dataset of ~150,000 short "Q/A" format jokes.
    
We will split this dataset into "train" and "test" sets, and also copy a small subset of the "test" jokes into a "minitest" dataset that we can use for development purposes.
    
If your interested in the details of *why* we made these choices, or how we do the data cleaning, see [1_JokesDataset.ipynb](1_JokesDataset.ipynb).

<div style=background-color:#EEEEFF>

You can run this functionality here in the Notebook by executing the following cell:

In [9]:
from clean_data import clean_data
clean_data(file='data/one-million-reddit-jokes.csv')

Reading in raw jokes data...
   1000000 jokes in the raw dataset
    146796 jokes in the final dataset (Q|A format, short punchlines, 1+ upvotes)
Joke splits written to:
    146796 in data/short_jokes_all.csv
    102757 in data/short_jokes_train.csv
     44039 in data/short_jokes_test.csv
       300 in data/short_jokes_minitest.csv


<div style=background-color:#EEEEFF>

Alternatively, you can run the cleaning script in the terminal from the command line with:

<div style=background-color:#EEEEFF>

## 2. "Fake" Punchlines, generated by an AI

Before we try to train an AI to tell jokes, we should check what can be done out-of-the-box with existing, pre-trained Transformer models.  We'll use GPT-2, which is the freely-available forerunner of the recent GPT-3 text-generation model that generated tons of press last year. 
    
GPT-2 and GPT-3 are both *auto-regressive* models, which use only the "decoder" part of the Transformer neuron architecture and are optimized for generating text.  They are built to take a text prompt and generate additional new text that "continues" the thread.

Given a joke setup, can GPT-2 produce a plausible punchline? And is it funny?
    
A more detailed walk-through of this step can be found in [2_FakePunchlines.ipynb](2_FakePunchlines.ipynb)

<div style=background-color:#EEEEFF>

Because GPT-2 is trained to be general-purpose text generator, and not necessarily to answer questions or provide punchlines, we give it a few in-text clues to help it recognize the Q/A joke format we are trying to produce, so that each joke is formatted as:
    
> "Question: [joke setup, ends in '?'] Answer: [joke punchline]"
    
We then load the GPT-2 model and its associated tokenizer, which will convert text strings into numeric input for the model.  We pass it a "prompt", in the form 
    
> "Question: [joke setup] Answer:" 
    
and ask it to generate the continuing text that should come after "Answer:".  

<div style=background-color:#EEEEFF>

The code checks to see if a GPU is available, and to use it if it is.  The GPU gets a 6.5x speedup over the CPU in our tests, but that still means it will take hours to generate fake punchlines for our full dataset of 140,000+ jokes.

Let's do a small run with our "minitest" dataset and take a look:

In [1]:
# This runs locally in the Notebook on our small 300-joke "mini-test" set.  

from fake_punchlines import add_fake_punchlines

add_fake_punchlines('data/short_jokes_minitest.csv')

300 jokes in the dataset
Using checkpoint "gpt2"


Using pad_token, but it is not set yet.


Working on jokes 0:100 out of 300 -- 0:00:00.000005
Running on cuda:0


100%|██████████| 100/100 [00:28<00:00,  3.49it/s]


Done writing 0:100 to CSV
Working on jokes 100:200 out of 300 -- 0:00:32.385260
Running on cuda:0


100%|██████████| 100/100 [00:28<00:00,  3.50it/s]


Done writing 100:200 to CSV
Working on jokes 200:300 out of 300 -- 0:01:00.942600
Running on cuda:0


100%|██████████| 100/100 [00:28<00:00,  3.49it/s]

Done writing 200:300 to CSV





In [3]:
# Take a look at exampes of the real jokes and the fake punchlines we just created

import pandas as pd
mini_jokes = pd.read_csv('data/short_jokes_minitest.csv')
mini_fakes = pd.read_csv('data/short_jokes_minitest_fake.csv')
for i in range(3):
    print('Question: {}'.format(mini_jokes.iloc[i]['setup']))
    print('  Answer: {}'.format(mini_jokes.iloc[i]['punchline']))
    print('    Fake: {}'.format(mini_fakes.iloc[i]['punchline']))
    print('----')

Question: Did you know Google now has a platform for recording your bowel movements?
  Answer: It's called Google Sheets.
    Fake: Yes, there are multiple platforms that you can connect to to record bowel movements such as xray, ultrasound, etc. There are a few things you
----
Question: What do you call a boat full of dentists?
  Answer: A tooth ferry
    Fake: A boat full of dentists. In this case, this is an absolutely fantastic combination of both, one with a proper proper dental service and one with
----
Question: How do you know someone is feeling horny?
  Answer: They click on this post
    Fake: I know someone who feels horny. I don't want to go through that experience again, and that is a good thing. Most people who get horny
----


<div style=background-color:#EEEEFF>
    
There are a few interesting things to note here.

* Responses are generally on-topic and sound (mostly) like coherent English.  This is what GPT-2 is good at!
* The responses just ramble on and cut off arbitrarily.  We set a 30-token limit if no end-of-string (EOS) token is received; an EOS token is basically *never* generated.  GPT-2 is not good at knowing when to shut up!
* GPT-2 often answers questions with more questions (although structuring our prompts with explicit "Question:/Answer:" format seems to have helped a lot compared to my previous tests...)

<div style=background-color:#EEEEFF>
    
It takes about 30 seconds to generate punchlines for a batch of 100 jokes (your mileage may vary somewhat).  
    
In the next step, we'll train a joke classifier to distinguish between real and fake jokes, but to do that, we'll need to generate a big dataset of fake punchlines.  Based on our small batch example, generating fake punchlines for the full dataset will take almost 12 hours!
    
If you have a stable internet connection and can leave your laptop open, you can run them right here in the notebook and hope that you don't get disconnected.

However, a better choice is to run them from the terminal, using screen or tmux to background the process. That way, you launch the run, close your laptop, and walk away. The process will run overnight and the output will be waiting for you when you get up in the morning.

To run in the background:
* Open a terminal
* Enter `screen`
* Run the following command:
```
    python fake_punchlines.py data/short_jokes_train.csv
```
* Detach from the `screen`.
    
Then do the same for `data/short_jokes_test.csv`.
    
More detailed instructions on how to do this can be found in the [FakePunchlines Notebook](2_FakePunchlines.ipynb)

<div style=background-color:#EEEEFF>

## 3. A Classifier to recognize "real" and "fake" punchlines
    
We've seen that GPT-2 straight out of the box has some trouble telling jokes.  For one thing, she tends to ramble on and on, much longer than typical Q/A joke punchlines.  She also often answers questions with questions, much more so than actual punchlines do.  If the jokes make sense at all, they usually aren't very funny.  
    
In short, GPT-2 can't fool a human into thinking that she's a real, human comedian.
    
But can GPT-2 fool another AI?

<div style=background-color:#EEEEFF>

Let's find out, by training a classifier to distinguish "real" from "fake" jokes.  
    
For this exercies, we'll use a different type of Transformer model: an *auto-encoding* model.  Models of this type use only the "encoder" part of the Transformer neuron architecture, and are optimized for making sense of a text-string (including classifying it).  The particular auto-encoder we use here is called BERT.  
    
Can GPT-2 fool BERT into thinking her jokes are real?  Or will BERT be able to tell the difference between her jokes and those from a human(?) comedian on Reddit?

<div style=background-color:#EEEEFF>

The process of training and testing BERT to distinguish between real and fake jokes is implemented here in the *classify_punchines()* function in [punchline_classifier.py](punchline_classifier.py).  
    
We need to give him a training set of jokes, including both real jokes and fake jokes, for him to use to train the classifier.  We also give him a test set of jokes, including both real and fake, so that he can quantify how well he is able to do.
    
It takes several hours to train BERT on the full dataset, so we've made it easy to "downsample" the dataset by a large factor (e.g., 20x).  Even using this small subset of the training data, BERT is able to achieve good quality differentiation between the real and fake jokes.  Training on the full dataset only improves the results by a small amount.  
    
If you want to train on the full dataset, we recommend doing it from the command line using *screen*, as explained in [3.PunchlineClassifier](3.PunchlineClassifier.ipynb).  
    
For now, let's just train on 1/20th of the data, right here in the notebook (should take < 10 minutes).

In [2]:
from punchline_classifier import classify_punchlines

train_files = ['data/short_jokes_train.csv','data/short_jokes_train_fake.csv']
test_files = ['data/short_jokes_test.csv','data/short_jokes_test_fake.csv']

# Set downsample=1 or leave out to train on the full training set (it defaults to 1)
model = classify_punchlines(train_files, test_files, downsample=20)  

Using custom data configuration default-22e578e895d47822
Reusing dataset csv (/home/jupyter-genevievegraves/.cache/huggingface/datasets/csv/default-22e578e895d47822/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a)


  0%|          | 0/2 [00:00<?, ?it/s]

Loading cached processed dataset at /home/jupyter-genevievegraves/.cache/huggingface/datasets/csv/default-22e578e895d47822/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-174f232c5e5ea289.arrow
Loading cached processed dataset at /home/jupyter-genevievegraves/.cache/huggingface/datasets/csv/default-22e578e895d47822/0.0.0/bf68a4c4aefa545d0712b2fcbb1b327f905bbe2f6425fbc5e8c25234acb9e14a/cache-3d38ffa341bd5fb6.arrow


10276 rows in the train dataset (20x downsampled).
4404 rows in the test dataset (20x downsampled).
Using checkpoint "bert-base-uncased"


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Running on cuda:0


 33%|███▎      | 1284/3855 [02:12<04:25,  9.68it/s]

Epoch: 0, Loss: 0.009, Accuracy: 0.971, F1: 0.971


 67%|██████▋   | 2569/3855 [04:42<02:15,  9.47it/s]  

Epoch: 1, Loss: 0.007, Accuracy: 0.982, F1: 0.982


100%|█████████▉| 3854/3855 [07:12<00:00,  9.71it/s]  

Epoch: 2, Loss: 0.001, Accuracy: 0.983, F1: 0.983


100%|██████████| 3855/3855 [07:34<00:00,  8.48it/s]


Saving model as models/ClassifyJokes_bert_0.05subset_2021-12-21.pt


<div style=background-color:#EEEEFF>

Notice that BERT is able to achieve 97%+ accuracy, even just training on a small subset of the available training data.
    
So the answer is "No", out-of-the-box GPT-2 cannot fool BERT with her joke-telling abilities.
    
But what if we train her *specifically to tell jokes*?  Can she get better at it?