# Winning Jeopardy

This analysis will explore any potential patterns in the questions that are asked on the show 'Jeopardy'. Historical data will be used, which contains 20,000 questions that have been asked on the show.

In [1]:
import pandas as pd

# import the csv file into a pandas dataframe
jeopardy = pd.read_csv('jeopardy.csv')

In [2]:
# verify the first 5 rows of the data set
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
# print the column names of the data set
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Since there are spaces before some of the names of the columns, the columns will be renamed not to contain spaces.

In [4]:
# rename the columns of the data set
new_col_names = ['show_number', 'air_date', 'round', 'category', 'value', 'question', 'answer']
jeopardy.columns = new_col_names

In [5]:
# verify the column names were changed
jeopardy.columns

Index(['show_number', 'air_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

In [6]:
# verify the output of the data set with new column names
jeopardy.head(2)

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe


The text fields of the data set will need to be normalized so that the same words aren't considered different due to capitalization or punctuation.

In [7]:
import re

# create a function to normalize the question and answer fields
def normalize_q_a(string):
    lower_case_string = string.lower()
    lower_case_string = re.sub('[^A-Za-z0-9\s]', '', lower_case_string)
    return lower_case_string

In [8]:
# create a new column for normalized questions
jeopardy['clean_question'] = jeopardy['question'].apply(normalize_q_a)

In [9]:
# verify the cleaned questions
jeopardy['clean_question'].head(10)

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
5    in the title of an aesop fable this insect sha...
6    built in 312 bc to link rome  the south of ita...
7    no 8 30 steals for the birmingham barons 2306 ...
8    in the winter of 197172 a record 1122 inches o...
9    this housewares store was named for the packag...
Name: clean_question, dtype: object

In [10]:
# create a new column for normalized answers
jeopardy['clean_answer'] = jeopardy['answer'].apply(normalize_q_a)

In [11]:
# verify the cleaned answers
jeopardy['clean_answer'].head(10)

0        copernicus
1        jim thorpe
2           arizona
3         mcdonalds
4        john adams
5           the ant
6    the appian way
7    michael jordan
8        washington
9     crate  barrel
Name: clean_answer, dtype: object

The value column will be normalized to remove the dollar sign, and convert the data type from string to numeric. Additionally, the air_date column will be changed from a string to a datetime data type to make it easier to perform analysis with.

In [12]:
# define a function to clean the value column
def clean_value(val_string):
    no_punct = val_string.replace('$', '')
    no_punct = no_punct.replace(',', '')
    
    # if there's a conversion error, set the value to zero
    try:
        int_value = int(no_punct)
    except Exception:
        int_value = 0
        
    return int_value

In [13]:
# clean the value column into a new column
jeopardy['clean_value'] = jeopardy['value'].apply(clean_value)

In [14]:
# verify the output
jeopardy['clean_value'].head(20)

0     200
1     200
2     200
3     200
4     200
5     200
6     400
7     400
8     400
9     400
10    400
11    400
12    600
13    600
14    600
15    600
16    600
17    600
18    800
19    800
Name: clean_value, dtype: int64

In [15]:
# conver air_date column into a Datetime object
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'], format='%Y-%m-%d')

In [16]:
# verify the output of the cleaned air_date column
jeopardy['air_date'].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: air_date, dtype: datetime64[ns]

## Historical Question Analysis

The next step is to determine how frequently historical questions are re-used, and also how often the answer can be deduced from the question itself.

In [17]:
# define a function that finds whether there is a word match from the question and answer
def q_a_word_match(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    
    # exclude the word 'the' from the split answers
    if 'the' in split_answer:
        split_answer.remove('the')
    
    # return zero if split_answer has no content in the list
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

In [18]:
# create a new column for the proportion of the answer that is found in the question
jeopardy['answer_in_question'] = jeopardy.apply(q_a_word_match, axis=1)

In [19]:
# verify the different proportions
jeopardy['answer_in_question'].value_counts()

0.000000    17375
0.500000     1452
0.333333      551
0.250000      170
1.000000      123
0.666667      103
0.200000       82
0.166667       28
0.400000       28
0.142857       20
0.750000       18
0.285714       10
0.600000        9
0.125000        9
0.428571        3
0.181818        2
0.800000        2
0.571429        2
0.300000        2
0.111111        2
0.307692        1
0.444444        1
0.222222        1
0.375000        1
0.100000        1
0.153846        1
0.875000        1
0.272727        1
Name: answer_in_question, dtype: int64

In [20]:
# verify the mean proportion of the answer_in_question colum
mean_answer_in_question = jeopardy['answer_in_question'].mean()
print(mean_answer_in_question)

0.06049325706933587


### Mean Analysis

The mean displayed above indicates that on average, approximately 6% of the question is contained within the answer. While this is not incredibly significant, this means that it could be worthwhile to pay attention to the words within the question itself to see if the answer can be derived.

In [21]:
# check whether parts of historical questions have been re-used
question_overlap = []
terms_used = set()

jeopardy.sort_values(by='air_date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    
    # remove words that are less than 6 characters
    split_question = [wrd for wrd in split_question if len(wrd) > 5]
    
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    
    # add terms to the set of historical terms
    for word in split_question:
        terms_used.add(word)
        
        # calculate the proportion of words that match historical words
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
        
# add a new column to the data set for the overlap proportions
jeopardy['question_overlap'] = question_overlap
        
# calculate mean of the proportions where there is question overlap
jeopardy['question_overlap'].mean()

0.6908737315671962

The mean of 0.69 printed above indicates that on average, approximately 69% of words that are greater than 6 characters long matched with historical words being used. This means there could be an advantage studying historical material in order to prepare for Jeopardy.

## Correlation of historical terms to high/low value questions

The analysis below will compute the chi-square values for the correlation of historical words vs. high or low value questions. This should provide a sense of what types of questions could be asked in each category.

In [22]:
# define a function that differentiates high and low value questions 
def assign_value_category(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

# assign value category to the clean_value column in the data set
jeopardy['high_value'] = jeopardy.apply(assign_value_category, axis=1)

In [23]:
# verify the output of the high_value column
jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [28]:
# create a function that determines whether a word is found in a high or low value historical question
def count_high_low(word):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        split_q = row['clean_question'].split(' ')
        if word in split_q:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

observed_expected = []

#convert set to a list, and only look at the first 5 words for terms_used (for simplicty)
comparison_terms = list(terms_used)[:5]

# find the count of high value terms used and low value terms used
for term in comparison_terms: 
    observed_expected.append(count_high_low(term))

In [29]:
# verify the output of observed and expected counts
observed_expected

[(2, 3), (1, 0), (2, 7), (1, 0), (0, 4)]

In [32]:
from scipy.stats import chisquare
import numpy as np

# perform the chi-squared test and determine the p-value (statistical significance)
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    # define observed and expected data sets
    observed = np.array([obs[0], obs[1]])
    expected = np.array([expected_high, expected_low])
    
    # perform chi-square test
    chi_sq_p_val = chisquare(observed, expected)
    chi_squared.append(chi_sq_p_val)

In [33]:
# verify the output
chi_squared

[Power_divergenceResult(statistic=0.3137668167849311, pvalue=0.5753778622944691),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.18303865877777942, pvalue=0.6687747661279759),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.607851384507536, pvalue=0.20479409439225948)]

# Conclusion

As shown in the p-values above, there is no statistical significance of the relationship or expected/observed discrepancy between high value words and low value words, and their historical use.