<h1>Using stats to win at Jeopardy</h1>

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.
Let's say we want to compete on Jeopardy, and we're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.
Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

    Show Number -- the Jeopardy episode number of the show this question was in.
    Air Date -- the date the episode aired.
    Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
    Category -- the category of the question.
    Value -- the number of dollars answering the question correctly is worth.
    Question -- the text of the question.
    Answer -- the text of the answer.


In [24]:
import pandas as pd
import re
from scipy.stats import chisquare
import numpy as np

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')

In [3]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


The dataframe is shown above

In [4]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front, which is a little annoying, so we shall correct this

In [5]:
c = ['Show Number', 'Air Date', 'Round', 'Category', 'Value',
       'Question', 'Answer']
jeopardy.columns = c

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns). The idea is to ensure that we lowercase words and remove punctuation so Don't and don't aren't considered to be different words when we compare them.
The Value column should also be numeric, to allow us to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The Air Date column should also be a datetime, not a string, to enable use to work with it more easily.

In [6]:
def normalise_Q_A(s):
    s = s.lower()
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    return s
def normalize_values(s):
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    try:
        s = int(s)
    except Exception:
        s = 0
    return s

In [7]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalise_Q_A)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalise_Q_A)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

    How often the answer is deducible from the question.
    How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [10]:
def match_words(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    #Remove 'the' from our answers as it carrys no useful information
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count +=1
    return match_count/len(split_answer)
answer_in_question = jeopardy.apply(match_words, axis=1)
print(answer_in_question.mean())

0.0604932570693


So from above we can see that only 6% of questions are relatable to the answer based on the words used. So from this we can see that we can't rely on the question alone to be able to determine the answer.

Let's say we want to investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
    - If it does, increment a counter.
    - Add each word to terms_used.

This will enable us to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables us to filter out words like the and than, which are commonly used, but don't tell us a lot about a question.

In [28]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q)>5]
    match_count = 0
    for q in split_question:
        if q in terms_used:
            match_count +=1
    for word in split_question:
            terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
print(jeopardy['question_overlap'].mean())

0.690873731567


In the 10% of Jeopardy data, which is what we are investigating in this project, there is a 69% overlap between questions. It is unclear whether this would apply to all of the Jeopardy data, but from this subset there is evidence that there are repeated terms in the majority of questions, which may suggested that topics repeat themselves.

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

We'll then be able to loop through each of the terms from the last screen, terms_used, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [32]:
def high_value(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
jeopardy['high_value'] = jeopardy.apply(high_value,axis=1)

In [33]:
def count_usage(word):
    low_count = 0
    high_count = 0
    for i,row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [34]:
observed_expected = []
comparison_terms = list(terms_used)[:5]
for t in comparison_terms:
    observed_expected.append(count_usage(t))
observed_expected

[(1, 0), (0, 1), (0, 1), (0, 1), (0, 1)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [35]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
chi_squared = []
for l in observed_expected:
    total = sum(l)
    total_prop = total/jeopardy.shape[0]
    high_val_expected = total_prop * high_value_count
    low_val_expected = total_prop * low_value_count
    observed = np.array([l[0], l[1]])
    expected = np.array([high_val_expected, low_val_expected])
    chi_squared.append(chisquare(observed, expected))

chi_squared


[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

The p values all exceed our threshold of 0.05, and therefore we can conclude that there is no significant difference between the usage of these terms in high and low value rows.  Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

From this brief analysis it seems that hacking Jeopardy is harder than we had initially thought...