# Analysing the Edinburgh Fringe Festival Jokes

**This is a notebook that I used at pyconuk 2015 to check whether the jokes of [The Lightning talk guy](http://www.lightningtalkman.com/) are funny**. It is based on [this blog post](http://vknight.org/unpeudemath/code/2015/06/14/natural-language-and-predicting-funny/) and the series of [BBC articles that list the ranking of jokes at the edinburgh fringe festival](http://www.bbc.co.uk/news/uk-scotland-edinburgh-east-fife-34039927).

The basic idea is:

![](http://vknight.org/unpeudemath/assets/images/description_of_ratio_learning_for_nlp_jokes.svg)

Here are the libraries we are going to need:

In [1]:
import pandas  # To handle our data nicely
import nltk  # For all the clever stuff

## Loading and tidying the data

In [2]:
df = pandas.read_csv('jokes.csv', quotechar='"', skipinitialspace=True) #

In [3]:
df.head()

Unnamed: 0,Year,Author,Rank,Raw_joke
0,2015,Darren Walsh,1,I just deleted all the German names off my pho...
1,2015,Stewart Francis,2,Kim Kardashian is saddled with a huge arse ......
2,2015,Adam Hess,3,Surely every car is a people carrier?
3,2015,Masai Graham,4,What's the difference between a 'hippo' and a ...
4,2015,Dave Green,5,If I could take just one thing to a desert isl...


In [4]:
df.tail()

Unnamed: 0,Year,Author,Rank,Raw_joke
65,2009,Adam Hills,6,Going to Starbucks for coffee is like going to...
66,2009,Marcus Brigstocke,7,To the people who've got iPhones: you just bou...
67,2009,Rhod Gilbert,8,"A spa hotel? It's like a normal hotel, only in..."
68,2009,Dan Antopolski,9,I've been reading the news about there being a...
69,2009,Simon Brodkin,10,I started so many fights at my school - I had ...


Getting rid of the common word and tokenising the jokes

In [54]:
# nltk.download()  # Only do this once: needed to download the `stopwords` corpus

In [5]:
commonwords = [e.upper() for e in set(nltk.corpus.stopwords.words('english'))] # <- Need to download the corpus: import nltk; nltk.download()
commonwords.extend(['M', 'VE'])
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')  # To be able to strip out unwanted things in strings
string_to_list = lambda x: [el.upper() for el in tokenizer.tokenize(x) if el.upper() not in commonwords]
df['Joke'] = df['Raw_joke'].apply(string_to_list)

In [6]:
df.head()

Unnamed: 0,Year,Author,Rank,Raw_joke,Joke
0,2015,Darren Walsh,1,I just deleted all the German names off my pho...,"[DELETED, GERMAN, NAMES, PHONE, HANS, FREE]"
1,2015,Stewart Francis,2,Kim Kardashian is saddled with a huge arse ......,"[KIM, KARDASHIAN, SADDLED, HUGE, ARSE, ENOUGH,..."
2,2015,Adam Hess,3,Surely every car is a people carrier?,"[SURELY, EVERY, CAR, PEOPLE, CARRIER]"
3,2015,Masai Graham,4,What's the difference between a 'hippo' and a ...,"[DIFFERENCE, HIPPO, ZIPPO, ONE, REALLY, HEAVY,..."
4,2015,Dave Green,5,If I could take just one thing to a desert isl...,"[COULD, TAKE, ONE, THING, DESERT, ISLAND, PROB..."


## Training our classifier

**From here on in we use the jokes up until 2013 as the training set.**

We start by getting the entire set of words in all the jokes from the training set.

In [7]:
df['Year'] = df['Year'].apply(int)

def get_all_words(dataframe):
    """
    A function that gets all the words from the Joke column in a given dataframe
    """
    all_words = []
    for jk in dataframe['Joke']:
        all_words.extend(jk)
    return all_words

all_words = get_all_words(df)  # Getting the whole database
all_words[:10]  # The first ten words in our training data set

['DELETED',
 'GERMAN',
 'NAMES',
 'PHONE',
 'HANS',
 'FREE',
 'KIM',
 'KARDASHIAN',
 'SADDLED',
 'HUGE']

Creating a function to extract features from a given joke

In [8]:
def extract_features(joke, all_words):
    words = set(joke)
    features = {}
    for word in words:
        features['contains(%s)' % word] = (word in all_words)
    return features

In [9]:
df['Features'] = df['Joke'].apply(lambda x:extract_features(x, get_all_words(df)))
df.head()

Unnamed: 0,Year,Author,Rank,Raw_joke,Joke,Features
0,2015,Darren Walsh,1,I just deleted all the German names off my pho...,"[DELETED, GERMAN, NAMES, PHONE, HANS, FREE]","{'contains(GERMAN)': True, 'contains(FREE)': T..."
1,2015,Stewart Francis,2,Kim Kardashian is saddled with a huge arse ......,"[KIM, KARDASHIAN, SADDLED, HUGE, ARSE, ENOUGH,...","{'contains(ENOUGH)': True, 'contains(KANYE)': ..."
2,2015,Adam Hess,3,Surely every car is a people carrier?,"[SURELY, EVERY, CAR, PEOPLE, CARRIER]","{'contains(CARRIER)': True, 'contains(PEOPLE)'..."
3,2015,Masai Graham,4,What's the difference between a 'hippo' and a ...,"[DIFFERENCE, HIPPO, ZIPPO, ONE, REALLY, HEAVY,...","{'contains(LIGHTER)': True, 'contains(HEAVY)':..."
4,2015,Dave Green,5,If I could take just one thing to a desert isl...,"[COULD, TAKE, ONE, THING, DESERT, ISLAND, PROB...","{'contains(COULD)': True, 'contains(ISLAND)': ..."


Labelling our jokes depending on what will be deemed as funny

In [10]:
funny_threshold = 5
df['Rank'] = df['Rank'].apply(int)
df['Funny'] = df['Rank'] <= funny_threshold
df.head(10)

Unnamed: 0,Year,Author,Rank,Raw_joke,Joke,Features,Funny
0,2015,Darren Walsh,1,I just deleted all the German names off my pho...,"[DELETED, GERMAN, NAMES, PHONE, HANS, FREE]","{'contains(GERMAN)': True, 'contains(FREE)': T...",True
1,2015,Stewart Francis,2,Kim Kardashian is saddled with a huge arse ......,"[KIM, KARDASHIAN, SADDLED, HUGE, ARSE, ENOUGH,...","{'contains(ENOUGH)': True, 'contains(KANYE)': ...",True
2,2015,Adam Hess,3,Surely every car is a people carrier?,"[SURELY, EVERY, CAR, PEOPLE, CARRIER]","{'contains(CARRIER)': True, 'contains(PEOPLE)'...",True
3,2015,Masai Graham,4,What's the difference between a 'hippo' and a ...,"[DIFFERENCE, HIPPO, ZIPPO, ONE, REALLY, HEAVY,...","{'contains(LIGHTER)': True, 'contains(HEAVY)':...",True
4,2015,Dave Green,5,If I could take just one thing to a desert isl...,"[COULD, TAKE, ONE, THING, DESERT, ISLAND, PROB...","{'contains(COULD)': True, 'contains(ISLAND)': ...",True
5,2015,Mark Nelson,6,"Jesus fed 5,000 people with two fishes and a l...","[JESUS, FED, 5, 000, PEOPLE, TWO, FISHES, LOAF...","{'contains(000)': True, 'contains(PEOPLE)': Tr...",False
6,2015,Tom Parry,7,Red sky at night. Shepherd's delight. Blue sky...,"[RED, SKY, NIGHT, SHEPHERD, DELIGHT, BLUE, SKY...","{'contains(BLUE)': True, 'contains(SKY)': True...",False
7,2015,Alun Cochrane,8,"The first time I met my wife, I knew she was a...","[FIRST, TIME, MET, WIFE, KNEW, KEEPER, WEARING...","{'contains(MASSIVE)': True, 'contains(FIRST)':...",False
8,2015,Simon Munnery,9,Clowns divorce. Custardy battle,"[CLOWNS, DIVORCE, CUSTARDY, BATTLE]","{'contains(BATTLE)': True, 'contains(CUSTARDY)...",False
9,2015,Grace The Child,10,They're always telling me to live my dreams. B...,"[RE, ALWAYS, TELLING, LIVE, DREAMS, WANT, NAKE...","{'contains(EXAM)': True, 'contains(ALWAYS)': T...",False


Creating a labeled feature

In [15]:
df['Labeled_Feature'] = list(zip(df['Features'],df['Funny']))
df.head()

Unnamed: 0,Year,Author,Rank,Raw_joke,Joke,Features,Funny,Labeled_Feature
0,2015,Darren Walsh,1,I just deleted all the German names off my pho...,"[DELETED, GERMAN, NAMES, PHONE, HANS, FREE]","{'contains(GERMAN)': True, 'contains(FREE)': T...",True,"({'contains(GERMAN)': True, 'contains(FREE)': ..."
1,2015,Stewart Francis,2,Kim Kardashian is saddled with a huge arse ......,"[KIM, KARDASHIAN, SADDLED, HUGE, ARSE, ENOUGH,...","{'contains(ENOUGH)': True, 'contains(KANYE)': ...",True,"({'contains(ENOUGH)': True, 'contains(KANYE)':..."
2,2015,Adam Hess,3,Surely every car is a people carrier?,"[SURELY, EVERY, CAR, PEOPLE, CARRIER]","{'contains(CARRIER)': True, 'contains(PEOPLE)'...",True,"({'contains(CARRIER)': True, 'contains(PEOPLE)..."
3,2015,Masai Graham,4,What's the difference between a 'hippo' and a ...,"[DIFFERENCE, HIPPO, ZIPPO, ONE, REALLY, HEAVY,...","{'contains(LIGHTER)': True, 'contains(HEAVY)':...",True,"({'contains(LIGHTER)': True, 'contains(HEAVY)'..."
4,2015,Dave Green,5,If I could take just one thing to a desert isl...,"[COULD, TAKE, ONE, THING, DESERT, ISLAND, PROB...","{'contains(COULD)': True, 'contains(ISLAND)': ...",True,"({'contains(COULD)': True, 'contains(ISLAND)':..."


Creating our classifier

In [16]:
classifier = nltk.NaiveBayesClassifier.train(df['Labeled_Feature'])

In [17]:
classifier.show_most_informative_features(10)

Most Informative Features
          contains(WELL) = True            False : True   =      2.5 : 1.0
           contains(GOT) = True            False : True   =      2.5 : 1.0
           contains(SAY) = True             True : False  =      2.2 : 1.0
           contains(ONE) = True             True : False  =      2.0 : 1.0
          contains(TIME) = True            False : True   =      1.8 : 1.0
           contains(SEX) = True            False : True   =      1.8 : 1.0
          contains(MADE) = True            False : True   =      1.8 : 1.0
           contains(DAY) = True            False : True   =      1.8 : 1.0
          contains(DIDN) = True            False : True   =      1.8 : 1.0
         contains(NEVER) = True            False : True   =      1.8 : 1.0


In [18]:
joke = 'How does NASA organize their company parties? They planet.'
classifier.classify(extract_features(string_to_list(joke), get_all_words(df)))

True

In [19]:
joke = 'Why is 10 afraid of 7? Because 7 8 9'
classifier.classify(extract_features(string_to_list(joke), get_all_words(df)))

False