# Jeopardy

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
df = pd.read_csv('jeopardy.csv')

In [3]:
df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [15]:
df.shape

(216930, 7)

In [16]:
df.dtypes

show_number     int64
air_date       object
round          object
category       object
value          object
question       object
answer         object
dtype: object

In [5]:
# Change all the column names for easier manipulation and remove whitespace
df.rename(columns={
    'Show Number': 'show_number',
    ' Air Date': 'air_date',
    ' Round': 'round',
    ' Category': 'category',
    ' Value': 'value',
    ' Question': 'question',
    ' Answer': 'answer'},
    inplace=True
)

In [6]:
df.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [7]:
def filter_dataframe(dataframe, words):
    """Filters the dataset 'dataframe' and returns questions that only
    contail *all* of the words in the list 'words'.
    
    Parameters
    ----------
    dataframe: DataFrame
        The dataset with all the questions to be filtered
    words: list
        The list of words used to filter the dataset
        
    Returns
    -------
    DataFrame
        a new data frame with only the rows whose questions include *all* the words
    """
    # Using all() will return True if all of the words in the list words appear
    # in the question text
    filter_expression = lambda x: all(word.lower() in x.lower() for word in words)
    return dataframe[ dataframe['question'].apply(filter_expression) ]

In [8]:
filt = filter_dataframe(df, ['King', 'France'])
filt

Unnamed: 0,show_number,air_date,round,category,value,question,answer
17413,4496,2004-03-08,Double Jeopardy!,DYNASTY,$1200,The king of France in 1604 & the king of Spain...,the Bourbons
17718,5770,2009-10-16,Double Jeopardy!,MOVIE DEBUTS,$1600,Timothy Dalton was 22 when Peter O'Toole helpe...,The Lion in Winter
23825,4862,2005-11-01,Jeopardy!,"DR. PHIL, SHAKESPEAREAN COUNSELOR",$800,"Regan & Goneril got your inheritance, but you ...",Cordelia
28265,3737,2000-11-28,Double Jeopardy!,THE MIDDLE AGES,$800,"In a 1346 battle, France's Philip VI was Crecy...",Edward III
28576,4752,2005-04-12,Double Jeopardy!,KING ME,"$3,000",He's the last king of France whose reign strad...,Louis XIV
...,...,...,...,...,...,...,...
192169,3583,2000-03-15,Jeopardy!,"YOU SHOULD ""C""",$200,Name shared by the mother of 3 kings of France...,Catherine
193098,5647,2009-03-10,Double Jeopardy!,"HE SAID, SHE SAID",$2000,"<a href=""http://www.j-archive.com/media/2009-0...",Julia Child
203970,5371,2008-01-07,Jeopardy!,IT HAPPENED IN '08,$200,1808: Napoleon's brother Joseph is made king o...,Spain
205541,6061,2011-01-10,Jeopardy!,LAUNDRY DETERGENT,$1000,"Philip III, king of France 1270-1285, was nick...",Bold


The original `value` column has strings that include **$** sign and sometimes **,** (commas).
The `lambda` function applied to the `value` column uses a `regex` expression to remove these characters
and then casts the value to a float.
In the case the value is `'None'`, the function adds a `0` to the new created `float_value` column.

In [27]:
df['float_value'] = df['value'].apply(lambda s: 
                                      float(re.sub(r'[\$,\s+]', '', s)) 
                                      if s != 'None' 
                                      else 0)

In [37]:
# Create a Series with all the float values in the entire dataset
all_values = df['float_value']

18000.0

Now that we can filter the dataset based on questions, we can perform calculations based on the **difficulty** of
certain questions, based on their `float_value` or prize.
For example, what is the average `float_value` of questions that contain the word `'King'`?

In [45]:
king_questions = filter_dataframe(df, ['King'])

# Work out the mean of the 'float_value' column of this filtered dataset
king_q_mean = np.mean(king_questions['float_value'])
print(king_q_mean)

771.8833850722094


The average value of questions with the work `'King'` is **$771.88**.

In [71]:
answer_group = king_questions.groupby('answer').float_value.count().reset_index()
answer_group.sort_values(by=['float_value'], ascending=False)

Unnamed: 0,answer,float_value
1543,Henry VIII,55
2963,Solomon,35
2723,Richard III,33
2070,Louis XIV,31
1027,David,30
...,...,...
1951,L. Frank Baum,1
1950,L'chaim,1
1949,Königsberg,1
1948,Kung Pao Chicken,1


In the filtered dataset, we have used a `groupby` to organise the dataset according to the
answers given to **Jeopardy!** questions. Sorting the `count` in descending order reveals that
**Henry VIII** was the most common answer, followed closely by **Solomon** and **Richard III**.