<div style=background-color:#EEEEFF>

## 1. The Jokes Dataset

To train a joke-telling AI and test its performance, we need a dataset of jokes.  Here, we'll use the "One Million Reddit Jokes" dataset, which covers jokes from the /r/jokes subreddit from April 1, 2020 and earlier.  The jokes dataset is provided here in `./data/one-million-reddit-jokes.csv.`  You can also download the jokes dataset directly from Kaggle [here](https://www.kaggle.com/pavellexyr/one-million-reddit-jokes).

Let's start by reading in the jokes dataset, seeing what we have, and cleaning it up a little.

In [None]:
import pandas as pd

# Read in the raw jokes dataset
filename = '/opt/cloudburst/shared/nlp_puchlines/one-million-reddit-jokes.csv'
print('Reading in a lot of jokes...')
raw_jokes = pd.read_csv(filename,keep_default_na=False)

# Print out the column names with a single example of the data in each column
print('{} jokes with the following columns (column name: example) '.format(raw_jokes.shape[0]))
for c in raw_jokes.columns:
    print('  {:>15}: {}'.format(c,raw_jokes.iloc[1][c]))

<div style=background-color:#EEEEFF>

Let's look at a few example rows from this dataset.

In [None]:
# Show the first three rows
raw_jokes.iloc[:3]

<div style=background-color:#EEEEFF>

Of these columns, the only ones of relevence to us are the last 3: 
* `title` is usually used as the joke "setup"
* `selftext` stores the joke "punchline"
* `score` is the number of upvotes the joke received and can be used as a metric of "joke quality"

It would be nice if we could use `subreddit.nsfw` to filter out inappropriate jokes (crude, hyper-sexualized, racist, homophobic, sexist, or otherwise objectionable) but sadly `subreddit.nsfw = False` for everything in this dataset, despite that fact that many of the jokes are clearly NOT safe for work.  We'll discuss the problem of inappropriate jokes further later.

Here are the top 5 most common punchlines in our dataset:

In [None]:
# Show the value and the number counts for the 5 most common punchlines in the dataset
raw_jokes['selftext'].value_counts()[:5]

<div style=background-color:#EEEEFF>

Many of the jokes have been removed or deleted (maybe because they were inappropriate and the author or a moderator thought better of it?...).  We don't want these in our dataset.  
    
Some of them may be using the text "[removed]" as their actual punchline, in a self-referential meta joke about Reddit jokes.  We don't really want those either.
    
But what is going on with the jokes whose punchlines are blank?

In [None]:
pd.options.display.max_colwidth = None   # don't truncate the column text

# Get the jokes with blank punchlines (selftext=='')
#    Then show values & counts for the 5 most common setups
blank_punchline_counts = raw_jokes[raw_jokes['selftext']=='']['title'].value_counts()
blank_punchline_counts.rename_axis('title').reset_index(name='counts').set_index('counts')[:5]

<div style=background-color:#EEEEFF>

Okay, I get it.  Donald Trump is a joke.  My life is a joke.  Feminism is a joke.  Ha ha.

But this raises the larger issue that jokes can take many different narrative forms.  If we're hoping to get an AI to learn to tell jokes, we should start with a somewhat down-scoped problem. For this exercise, let's only use jokes that take the form of a "setup" question, followed by a "punchline" answer, e.g., 

* Question: Why did the chicken cross the road?
* Answer: To get to the other side.

In [None]:
# Clean up the jokes dataset:
#   - get rid of removed or deleted punchlines
#   - replace newlines w/ spaces (easier to read)
#   - require the setup to be a question
#   - remove blank punchlines (length < 1)

remove = ['[removed]','[deleted]','\[removed\]']   # We'll remove jokes with these punchlines
use_columns = ['title','selftext','score']  # We only care about these columns

jokes = raw_jokes[raw_jokes['selftext'].apply(lambda x: x not in remove)][use_columns]

# Rename columns, replace newlines with spaces
jokes = jokes.rename(columns={'title':'setup','selftext':'punchline'})
jokes['setup'] = jokes['setup'].apply(lambda x: x.replace('\n',' ').replace('\r',' '))
jokes['punchline'] = jokes['punchline'].apply(lambda x: x.replace('\n',' ').replace('\r',' '))

# Is the setup a question?
jokes['question'] = jokes['setup'].apply(lambda x: True if x[-1]=='?' else False)
# How long is the punchline?
jokes['punch_length'] = jokes['punchline'].apply(lambda x: len(x.split()))
# Only keep jokes with punchlines containing at least one word (some are all blank space)
jokes = jokes[jokes['punch_length'] >= 1]

print('{} jokes not missing punchlines, with the following columns:\n'.format(jokes.shape[0]))
for c in jokes.columns:
    print('  {:>15}: {}'.format(c,jokes.iloc[1][c]))

print('\n{} jokes have setups that are questions'.format(jokes[jokes['question']==True].shape[0]))

<div style=background-color:#EEEEFF>

We probably also want to restrict ourselves to jokes with short(ish) punchlines---the longer we let an AI ramble on, the less sense it tends to make.

Let's look at the length of the punchlines for our jokes that are questions:

In [None]:
import matplotlib.pyplot as plt

print('NOTE: logarithmic y-axis!')
fig, ax = plt.subplots(1,1, figsize=(8,5))
plt.rcParams['font.size'] = '18'
ax.plot(list(jokes[jokes['question']==True]['punch_length'].value_counts().index),
        list(jokes[jokes['question']==True]['punch_length'].value_counts().values),
        'go', label='Question jokes')
ax.plot(list(jokes[jokes['question']==False]['punch_length'].value_counts().index),
        list(jokes[jokes['question']==False]['punch_length'].value_counts().values),
        'ro', label='Non-question jokes')
plt.yscale('log', nonpositive='clip')
ax.set_xlim((0,60))
ax.set_ylim((1,1e5))
_ = ax.set_xlabel('# words in punchline',size=14)
_ = ax.set_ylabel('# Jokes w/ this length punchline',size=14)
_ = ax.legend(fontsize=14)

<div style=background-color:#EEEEFF>

Question jokes tend to have short punchlines---typically just a few words long---whereas setups that are not questions often have a longer "narrative" format and a long tail of very long punchlines.  Note that the y-axis is logarithmic, so there are ~100x more very long punchlines in the non-question jokes than the question jokes.
    
Let's stick with "question" jokes that have short(ish) punchlines, no more than 20 words.

In [None]:
jokes = jokes[(jokes['question']==True) & (jokes['punch_length'] <= 20)]
print('{} Q/A jokes with short punchlines'.format(jokes.shape[0]))

<div style=background-color:#EEEEFF>

Let's only keep jokes that *at least one* person thought were funny.

In [None]:
jokes = jokes[jokes['score'] >= 1]
print('{} Q/A jokes with short punchlines that got 1+ upvotes'.format(jokes.shape[0]))

<div style=background-color:#EEEEFF>

And finally, some jokes get posted to Reddit again and again.  We want to deduplicate those, but we want to count *all* the upvotes received by the joke.  If we assume a Reddit user only sees and upvotes a joke once (rather than upvoting the same joke again and again), we can do that by summing the upvotes for each duplicate entry of a joke.

In [None]:
# Sum the scores for all jokes with the same setup and punchline
jokes['score'] = jokes.groupby(['setup', 'punchline'])['score'].transform('sum')
# Then drop the duplicate entries
jokes = jokes.drop_duplicates(subset=['setup','punchline'])
print('{} jokes in the final dataset'.format(jokes.shape[0]))

<div style=background-color:#EEEEFF>

Let's split the jokes into a training set and a test set.  We'll use a fixed random seed so that we choose the same split each time.
    
We'll then write the jokes dataset to disk and take a look at some examples.

In [None]:
print('{:>10} jokes in our final dataset'.format(jokes.shape[0]))

train_frac = 0.7  # Use 70% of jokes for training, 30% for testing
seed = 40         # Use a fixed seed for random state so that we always get the same splits
mini_count = 300  # Let's also store a small subset of the test data as a "mini" test to use during development.

jokes_train = jokes.sample(frac=train_frac, axis=0, random_state=seed)
jokes_test = jokes[~jokes.index.isin(jokes_train.index)]

print('{:>10} jokes in our training set'.format(jokes_train.shape[0]))
print('{:>10} jokes in our test set'.format(jokes_test.shape[0]))

output_columns = ['setup','punchline','score']
outfile = 'data/short_jokes.csv'

print('Joke splits written to:')
for dset,name in [(jokes,'_all'), 
                  (jokes_train, '_train'),
                  (jokes_test,'_test'),
                  (jokes_test.iloc[:mini_count],'_minitest')]:
    dset[output_columns].to_csv(outfile.replace('.csv',name+'.csv'), header=True, index=False)
    print('{:>10} in {}'.format(dset.shape[0],outfile.replace('.csv',name+'.csv')))

In [None]:
print('\nHere are some examples:')
jokes_test.iloc[:10]

In [None]:
print('\nAnd here are the top-10 scoring short Q/A-type jokes on Reddit:')
jokes_sorted = jokes.sort_values('score',ascending=False)
jokes_sorted.iloc[:10]