# Let's win Jeopardy!


Just for the sake of fun, and practicing, let's take a dataset of 20k rows with Jeopardy questions and see if we can get an edge to win the game, and make some money!



In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
#getting rid of blank spaces at the beginning
jeopardy.rename(str.lstrip,axis='columns',inplace=True) 

In [4]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
#let's normalize question and answer columns
import re
def normalize_f(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]','',text)
    return text

In [6]:
#let's clean the question and the answer columns

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_f)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_f)

In [7]:
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona


In [8]:
#let's also normalize the value and air date columns
def normalize_values(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]','',text)
    try:
        text = int(text)
    except Exception:
        text = 0 
    return text

jeopardy['value_clean'] = jeopardy['Value'].apply(normalize_values)

In [9]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date']) 

In [10]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
value_clean                int64
dtype: object

## We want to know two things:
1. How many times is the answer deducible from the question. We will see how many times the answer shows up in the question itself
2. How many questions have repeated in past episodes. 

In [11]:
#let's answer the first question
def deducible_question(series_row):
    split_answer = series_row['clean_answer'].split(' ')
    split_question = series_row['clean_question'].split(' ')
    match_counter = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer)==0:
        return 0
    for var in split_answer:
        if var in split_question:
            match_counter+=1
    match_counter = match_counter/len(split_answer)
    return match_counter

jeopardy['answer_in_question'] = jeopardy.apply(deducible_question,axis=1)
mean_coincidence_factor = jeopardy['answer_in_question'].mean()

In [12]:
str(round(mean_coincidence_factor*100,2))+'%'

'6.05%'

### Answer 1:
Only in 6% of the cases the answer could be deduced from the question itself. that's not very hopeful!

Thus, hoping to deduce the answer from the question should not be the main game strategy, not even the second one!!!

In [13]:
#let's try to answer question 2

question_overlap = []
terms_used=set()
copy_df = jeopardy
jeopardy = jeopardy.sort_values(by='Air Date', ascending=True)
for i, value in jeopardy.iterrows():
    match_count=0
    split_question = value['clean_question'].split(" ")
    for var in split_question:
        if len(var)<6:
            split_question.remove(var)
    for var in split_question:
        if var in terms_used:
            match_count+=1
    for var in split_question:
            terms_used.add(var)
    if len(split_question)>0:
        match_count/=len(split_question)
        question_overlap.append(match_count)
    else:
        question_overlap.append(0)
jeopardy['question_overlap'] = question_overlap

average_repetition_factor = jeopardy['question_overlap'].mean()

print(str(round(average_repetition_factor,2)*100)+'%')


80.0%


## Answer 2:
Wow, the average repetition of words from past questions is about 80%. that means that 8 in 10 words of the 

## High Value Vs Low Value questions

Now that we now that previous questions can often repeat in similar wording, let's try to focus on studying the questions that have repated and that have a high value (reward). This will help us make the most money at the game!

We define:

1. Low value -- Any row where Value is less than 800.
2. High value -- Any row where Value is greater than 800

In [14]:
def determine_value(row):
    my_val = 0
    if row['value_clean'] > 800:
        my_val = 1
    return my_val

#let's run this function and create a boolean column indicating high or low value
jeopardy['high_value'] = jeopardy.apply(determine_value,axis=1)



In [15]:
jeopardy.columns


Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer', 'clean_question', 'clean_answer', 'value_clean',
       'answer_in_question', 'question_overlap', 'high_value'],
      dtype='object')

In [16]:
jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [17]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected


[(1417, 3301), (0, 1), (8, 12), (0, 1), (1, 0)]

In [18]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=4.28257258858768, pvalue=0.038505027441259804),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.2550672671397245, pvalue=0.2625868590248146),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]


Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.