# Winning Jeopardy
## Alex Haughton
The goal of this project is to use existing questions from a Jeopardy game show dataset to maximize our odds of winning a game of jeopardy.

In [115]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.columns
jeopardy.head(5)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Lets put the columns in camel case and strip the leading spaces for good style practice.

In [116]:
jeopardy.columns = jeopardy.columns.str.replace(' ','')
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Before we start to analyze the jeopardy questions and answers, we need to normalize all of the text in both columns. We'll accomplish this by putting the string in lowercase and removing all punctuation.

In [117]:
jeopardy['clean_answer']=jeopardy['Answer'].str.lower().str.replace('[\'\,\.\(\)\";]','')
jeopardy['clean_question']=jeopardy['Question'].str.lower().str.replace('[\'\,\.\(\)\";]','')

The "Value" column should also be numeric in order to manipulate it more easily, and the "Air Date" should be in datetime format, not string. 

In [118]:
import re

def normalize_dollar_values(value):
    clean_val = re.sub('[$,]','',value)
    try:
        return int(clean_val)
    except:
        return 0
    
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar_values)

In [119]:
#Convert str to datetime format
if type(jeopardy['AirDate'][0]) == str:
    jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'],format='%Y-%m-%d')

Our goal is ultimately to determine whether it is best to study specific past questions, general knowledge, or not study it at all. Two factors will be important: how often an answer is deducible from the question, and how often new quesitons are repeats of older questions.

To answer the first question, we'll determine how many question/answer pairs have complex words (>6 characters) common to both.

In [120]:
def deducible(row):
    split_answer = row[0].split()
    
    #Remove particles
    split_answer = [word for word in split_answer if word not in ['a','an','the']]
    split_question = row[1].split()
    
    #If has no words now, return 0 to avoid division by zero error
    if len(split_answer) == 0:
        return 0 
    
    #Count how many of the words in the answer are found in the question
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
            
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = (jeopardy[['clean_answer','clean_question']]
                                  .apply(deducible,axis=1))

In [121]:
jeopardy.sort_values('answer_in_question',ascending=False).head(3)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_answer,clean_question,clean_value,answer_in_question
10556,5281,2007-07-23,Jeopardy!,THE LARGEST IN AREA,$1000,"Libya, Egypt, Tunisia",Libya,libya,libya egypt tunisia,1000,1.0
18064,3227,1998-09-22,Double Jeopardy!,PUT 'EM IN ORDER,$400,"Calamity Jane, Jane Curtin, Lady Jane Grey","Lady Jane Grey, Calamity Jane, Jane Curtin",lady jane grey calamity jane jane curtin,calamity jane jane curtin lady jane grey,400,1.0
3225,5084,2006-10-19,Jeopardy!,THE HIGHEST-SCORING SCRABBLE WORD,$200,"Hell, heaven or limbo",heaven,heaven,hell heaven or limbo,200,1.0


Looking at the questions where all the words in the answer are found in the question ('answer_in_question' == 1), it appears that some jeopardy questions (and whole categories) are sometimes multiple choice, where the contestants choose one of several answers provided in the question.

In [122]:
jeopardy['answer_in_question'].mean()
jeopardy['answer_in_question'].value_counts().sort_values(ascending=False)
answer_in_question = [1 if val > 0 else 0 for val in jeopardy['answer_in_question']]
sum(answer_in_question)/len(answer_in_question)

0.043594567618268805

0.000000    18134
0.500000      978
0.333333      389
0.250000      131
1.000000      125
0.666667       72
0.200000       67
0.400000       24
0.166667       22
0.142857       13
0.750000       11
0.125000        7
0.285714        6
0.600000        4
0.300000        2
0.100000        2
0.111111        2
0.800000        2
0.428571        2
0.153846        1
0.222222        1
0.307692        1
0.272727        1
0.375000        1
0.857143        1
Name: answer_in_question, dtype: int64

0.09325466273313665

Overall, trying to deduce the answer based on the question doesn't seem like a good strategy, fewer than 10% of the questions have even one word of the answer contained in the question.

Now we'll check how often questions in this dataset are repeats of older questions. We'll check not by looking for verbatim repeats of questions, but by looking for repeated usage of terms (6 letters or longer)

In [123]:
question_overlap = []
terms_used = set()

for index, row in jeopardy.sort_values('AirDate').iterrows():
    
    #Get terms 6 letters or longer from each question
    split_question = row['clean_question'].split()
    split_question = [word for word in split_question if len(word)>5]
    match_count = 0
    
    #If term not used yet, add to set. If already used, counts as a match
    for term in split_question:
        if term in terms_used:
            match_count += 1
        else:
            terms_used.add(term)
            
    #Return matches as fraction of total >5 letter terms in question
    if len(split_question) > 0:
        question_overlap.append(match_count/len(split_question))
    else:
        question_overlap.append(0)
        
#Add resulting list to dataframe
jeopardy['question_overlap'] = question_overlap

In [124]:
jeopardy['question_overlap'].mean()

0.6799437402867058

From this, it appears that nearly 68% of the terms in the questions for this dataset are repeats of past questions. This indicates that studying past questions is probably a good strategy for answering questions asked in future games.

To win Jeopardy, it's important to focus more on high value questions than low, as we will score more points for answering the same number of questions correctly. One way to focus our studying is to look for terms which come up more frequently in high value questions.

In [125]:
#Classify each question as high or low value
jeopardy['high_value']=[1 if value >= 800 else 0 for value in jeopardy['clean_value']]

def word_value(word):
    low_count = 0
    high_count = 0
    
    #Count number of times word occurs in high and low value questions
    for index, row in jeopardy.iterrows():
        if word in row['clean_question'].split():
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []

#Convert set from earlier into list
comparison_terms = list(terms_used)[21:30] 

for term in comparison_terms:
    observed_expected.append(word_value(term))
    
observed_expected

[(3, 3), (1, 1), (10, 13), (0, 1), (0, 1), (0, 1), (0, 2), (1, 1), (0, 1)]

This way is straightforward, but is very inefficient and cannot find the frequency for more than a few terms in a reasonable time. Lets try to write a more efficient algorithm to check for the frequency of terms in high and low value questions by avoiding nested loops and taking advantage of pandas string methods.

In [171]:
observed_high = []
observed_low = [] 

list_terms = list(terms_used)

#Add whitespace to tail and end of question so that first and last words are counted
jeopardy['clean_question'] = ' '+jeopardy['clean_question']+' '

for term in list_terms[0:1000]:
    #Add whitespace to term to avoid counting terms contained in other words
    contains_term = jeopardy['clean_question'].str.contains(' '+term+' ')
    
    #Count questions which are high/low value and contain the term
    observed_high.append((contains_term & jeopardy['high_value']==1).sum())
    observed_low.append((contains_term & (jeopardy['high_value']==0)).sum())

In [173]:
data = [('term',list_terms[0:1000]),
        ('observed_high',observed_high),
        ('observed_low',observed_low)]
terms_df = pd.DataFrame.from_items(data)
terms_df

Unnamed: 0,term,observed_high,observed_low
0,breton,1,1
1,elberon,1,0
2,buhner,0,1
3,clevelands,2,0
4,jawless,1,0
5,href=http://wwwj-archivecom/media/2009-05-04_d...,1,0
6,battlefield,3,0
7,hypertension,0,1
8,longing,0,1
9,find--,1,0


Lets test our newly created terms_df against results from our slow function to see if they match up.

In [185]:
results = []
trys = 30
for test in range(0,trys):
    test_row = terms_df.sample(n=1).iloc[0]
    test_term, test_observed_high, test_observed_low = test_row
    test_observed = (test_observed_high,test_observed_low)
    results.append(word_value(test_term)==test_observed)

if pd.Series(results).all():
    print('The results from {} random samples match with the old formula'.format(trys))

True

The results from 30 random samples match with the old formula


From 30 random samples, we get the same results using our more efficient algorithm. We will assume this algorithm is accurate enough to get the frequency for any term in our list.

The function seems to work, now lets find the expected counts for each word in the term list so we can determine which terms come up significantly more in high value questions. For now, we'll try with 1000 of the 25,000 terms found in the answers and see if we get any useful information that justifies extending the algorithm to the entire terms dataset.

In [189]:
from scipy.stats import chisquare

#Total number of high and low value numbers
high_value_count = sum(jeopardy['high_value']==1)
low_value_count = sum(jeopardy['high_value']==0)

chi_squared = []
chi_prob = []

for index, row in terms_df.iterrows():
    #Find expected number of high and low value questions containing term
    observed = (row['observed_high'],row['observed_low'])
    total = sum(observed)
    total_prop = total/jeopardy.shape[0]
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count
    expected = (expected_high, expected_low)
    
    #Find chi squared value for each
    chisq, p = chisquare(observed,expected)  
    chi_squared.append(chisq)
    chi_prob.append(p)

  terms = (f_obs - f_exp)**2 / f_exp


In [194]:
terms_df['chisq']=chi_squared
terms_df['chisq_prob']=chi_prob

terms_df[terms_df['chisq_prob']<0.05]
(terms_df['chisq_prob']<0.05).sum()

Unnamed: 0,term,observed_high,observed_low,chisq,chisq_prob
6,battlefield,3,0,3.885127,0.048716
145,egypt</a>,5,1,3.858039,0.049508
160,mariners,3,0,3.885127,0.048716
168,contiguous,3,0,3.885127,0.048716
274,prepares,3,0,3.885127,0.048716
382,target=_blank>seen,3,0,3.885127,0.048716
388,bosworth,3,0,3.885127,0.048716
422,lab</a>,3,0,3.885127,0.048716
430,thrillers,3,0,3.885127,0.048716
438,[gasps],3024,3586,12.737693,0.000358


21

We've successfully found a set of words that are overrepresented in the high or low value questions. Unfortunately, there are a few issues with this result. For one, knowing one word that occurs in the question doesn't tell us much about how to study for the actual question. Knowing that a question contains the word "perfect" doesn't tell us a thing about the rest of the question. Two, from 1000 questions, we were only able to find 21 terms which were overrepresented in high or low value questions, which suggests there isn't a strong correlation between the words in a question and the value of the question. Three, there are "nonsense" terms in our terms list which need to be filtered out.

Unfortunately, from our investigation, we still don't have an obvious studying strategy to follow outside of studying past questions, as we found that the majority of questions asked in Jeopardy are similar to questions asked previously. We cannot make any recommendations for specific topics to study based on our term chi-squared testing, as there are too few terms significantly overrepresented in high value questions, and the ones that we found are not helpful in determining the nature of the question or how to study for it.

One potential pattern which may be useful for future exploration is in looking at the categories of questions and more narrowly classifying them as history, science, mathematics, geography etc. This could give us an idea of which types of questions are most frequently asked, and we could focus our studying more narrowls on those topics.