# Winning Jeopardy

In [41]:
# Importing relevant libraries
import pandas as pd

In [42]:
# Read in the data set
jeopardy = pd.read_csv('jeopardy.csv')

In [43]:
# Printing first 5 rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [44]:
# Checking column names
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We can see that some of the column names have an extra space in the beginning.

In [45]:
# Changing column names
jeopardy.columns = ['Show_Number','Air_Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

Let's change the format of the question and answers to a same format.

In [46]:
# Creating a function to normalize the `Question` and `Answer` column
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub(r'[^A-za-z0-9\s]', '', text)
    text = re.sub(r'[\s+]', ' ', text)
    return text

In [47]:
# Creating two new columns using normalize_text() on 'Question' and 'Answer' columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [48]:
# Creating a function to normalize the values in 'Value' column
def normalize_value(value):
    value = re.sub(r'[^A-Za-z0-9\s]','',value)
    try:
        value = int(value)
    except:
        value = 0
    
    return value

In [49]:
# Creating a new column using normalize_value() on 'Value' column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [50]:
# Checking the updated result
jeopardy.head()

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [16]:
# Changing the datatype of 'Air_Date' column to datetime
jeopardy['Air_Date'] = pd.to_datetime(jeopardy['Air_Date'])

## Finding Answer in the Question

In [20]:
def answer_in_question(x):
    split_question = x['clean_question'].split()
    split_answer = x['clean_answer'].split()
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    for i in split_answer:
        if i in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

In [21]:
jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis=1)

In [23]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

On average, the answer only makes up for about `6%` of the question. Therefore, only 6% of the times the answer is deducible from the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer.  We'll probably have to study.

## Recycled questions

In [55]:
jeopardy.sort_values('Air_Date', inplace=True)

In [56]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())

0.689532391454348


On an average, 70% of the words are repeating in new questions that were in old questions. However, we are only looking at words that are repeating not the phrases. This does not tell us about questions being recycled. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## Low value vs high value questions

In [57]:
# Creating a function to separate high value questions
def question_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

In [58]:
# Creating a new column using question_value()
jeopardy['high_value'] = jeopardy.apply(question_value, axis=1)

In [59]:
# Checking updated dataframe
jeopardy.head()

Unnamed: 0,Show_Number,Air_Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0


In [60]:
# Creating a function which calculates how many words are associated with high value or low value questions
def question_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

In [30]:
# Randomly selecting 10 words from 'terms_used'
import random
comparison_terms = []
for i in range(10):
    comparison_terms.append(random.choice(list(terms_used)))

In [31]:
# Calculating high value and low value counts for each term in comparison_terms
observed_expected = []
for term in comparison_terms:
    observed_expected.append(question_count(term))
    
# Checking the result
observed_expected

In [33]:
# Calculating number of high value questions
high_value_count = (jeopardy['high_value'] == 1).sum()

In [34]:
# Calculating number of low value questions
low_value_count = (jeopardy['high_value'] == 0).sum()

In [43]:
# Calculating the chisqured value
from scipy.stats import chisquare
import numpy as np

chi_squared = []
for i in observed_expected:
    total = sum(i)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    observed = np.array([i[0], i[1]])
    expected = np.array([expected_high, expected_low])
    
    chi_squared.append(chisquare(observed, expected))

In [44]:
# Checking the result
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.0459928943532475, pvalue=0.15260738863448364),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows.  Additionally, the frequencies were all lower than `5`, so the chi-squared test isn't as valid.  It would be better to run this test with only terms that have higher frequencies.
