# Winning Jeopardy

In [58]:
import pandas as pd
import csv
import string
from scipy.stats import chisquare
import numpy as np


jeopardy = pandas.read_csv("jeopardy.csv")

jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [59]:
# Remove leading white space from columns
jeopardy.columns = jeopardy.columns.str.strip()
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


In [60]:
# normalize Question, Value and answers columns

def normalize(s):
    s = s.lower()
    s = s.translate(str.maketrans('', '', string.punctuation))
    return s

def normalize_currency(s):
    s = s.translate(str.maketrans('', '', string.punctuation))
    try:
        i = int(s)
    except ValueError:
        i = 0
    return i

jeopardy['clean_question']  = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer']  = jeopardy['Answer'].apply(normalize)
jeopardy['clean_value']  = jeopardy['Value'].apply(normalize_currency)

jeopardy.head(2)
jeopardy.tail(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky,before the grand jury she said im really sorry...,monica lewinsky,200
19998,3582,2000-03-14,Jeopardy!,LLAMA-RAMA,$200,Llamas are the heftiest South American members...,Camels,llamas are the heftiest south american members...,camels,200


In [61]:
# convert the Air Date column to a datetime column.
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

1) How often the answer is deducible from the question.

2) How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question.We'll work on the first question now, and come back to the second.

In [62]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    result = match_count / len(split_answer)
    return result

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis = 1)
jeopardy["answer_in_question"].mean()

0.058861482035140716

## Recycled questions

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

We want to investigate how often new questions are repeats of older ones. For that we can check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like `the` and `than`, which are commonly used, but don't tell us a lot about a question.

In [63]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
            match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6889055316620328

## Low value vs high value questions

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [64]:
def determine_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis = 1)

def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if term in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [65]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    v = count_usage(term)
    observed_expected.append(v)
    
observed_expected

[(3, 6),
 (2, 2),
 (1, 0),
 (0, 1),
 (1, 0),
 (1, 0),
 (1, 0),
 (0, 2),
 (3, 3),
 (0, 1)]

In [66]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
print(high_value_count)

5734


In [67]:
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]
print(low_value_count)

14265


In [69]:
chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_term_count = total_prop * high_value_count
    low_term_count = total_prop * low_value_count
   
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_term_count , low_term_count])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.09564350170321084, pvalue=0.75712159875701),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881886),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.