# Winning Jeopardy

With a goal of winning Jeopardy, we will analyze the dataset of jeopardy questions and answers to see if any helpful patterns emerge.

In [11]:
import pandas as pd
import numpy as np
import random as random
from scipy import stats

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
# Strip whitespace in column names
jeopardy.columns = jeopardy.columns.str.strip()

# Normalize Question and Answer columns
jeopardy['clean_question'] = jeopardy['Question'].str.lower().str.replace(r'[^\w\s]', ' ')
jeopardy['clean_answer'] = jeopardy['Answer'].str.lower().str.replace(r'[^\w\s]', ' ')

# Normalize value column
jeopardy['clean_value'] = jeopardy['Value'].str.replace(r'\W', '')
jeopardy['clean_value'] = pd.to_numeric(jeopardy['clean_value'], 'coerce', 'integer')
jeopardy.loc[jeopardy['clean_value'].isnull(), 'clean_value'] = 0

# Convert date column to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,200.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,200.0


## Effectiveness of Studying Strategies

In order to determine whether studying past questions, studying general knowlege, or not studying is most helpful, we would like to know the following:

- How often an answer can be used for a question
- How often questions are repeated

To answer the first question, we will count how often words in the answer occur in the questions.

To answer the second question we will count how often complex words (> 6 characters) reoccur.

Let's first write a function that takes in a row from the data set and determines if words in the answer appear in the question

In [4]:
def compare_q_a(row):
    split_answer = row.loc['clean_answer'].replace('the', '').split()
    split_question = row.loc['clean_question'].split()
    if len(split_answer) == 0: return 0
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer)

# def compare_q_a(row):
#     print(row)
jeopardy['answer_in_question'] = jeopardy.apply(compare_q_a, axis=1)

In [5]:
jeopardy['answer_in_question'].mean()

0.06191608323339909

On average, just 6% of words in the answer also appear in the question. Because this percentage is so low, we can't rely on questions giving us the answer. Not studying appears to be a poor strategy.

Let's see how often questions are repeated in the dataset.

In [6]:
terms_used = set()
question_overlap = []
jeopardy = jeopardy.sort_values('Air Date')
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        else:
            terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.721603243720504

A question has a 72% probabity on average that there has been another question containing the same word with 6 or more characters. This only looks at a small subset of the total questions asked in jeopardy and doesn't match phrases so there is not enough information to say studying past questions is a good strategy.

In [7]:
def value(row):
    if row.loc['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(value, axis=1)

In [10]:
def classify_words(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row.loc['clean_question'].split()
        if word in split_question:
            if row.loc['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = random.sample(terms_used, 10)
observed_expected = []
for word in comparison_terms:
    observed_expected.append(classify_words(word))
    
print(observed_expected)

[(1, 8), (0, 1), (2, 1), (14, 54), (1, 0), (0, 2), (0, 1), (1, 0), (1, 0), (0, 1)]


In [12]:
high_value_count = jeopardy.loc[jeopardy['high_value'] == 1, 'high_value'].sum()
low_value_count = len(jeopardy) - high_value_count

chi_squared = []
for array in observed_expected:
    total = sum(array)
    total_prop = total / len(jeopardy)
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    chi_squared.append(stats.chisquare(array, [expected_high, expected_low]))
chi_squared

[Power_divergenceResult(statistic=1.3570460299240277, pvalue=0.24405008712856013),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868264344),
 Power_divergenceResult(statistic=2.172513445240382, pvalue=0.140496438465013),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Chi Squared Results

No p value less than .05 indicates that we fail to reject the null hypothesis. This means there isnt a statistically significant difference in usage of words in high or low value questions.