# Winning Jeopardy with Data

In this notebook I'm going to work with a dataset of jeopardy questions with their values and answers. I'll try to find some patterns in the questions.

Let's take a quick look and see what data we have. Here are the first 5 of 20,000 rows.

In [2]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


What are the column names?

In [5]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

You'll notice that some of the column names have leading spaces. Just going to clean this up a little bit.

In [6]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Now we can see the column names are fixed - no pesky extra spaces. The next bit of normalization for any text analysis is to fix up the question and answer columns. Make everything lowercase, remove punctuation, etc.

In [16]:
import re

# Function that turns everything lowercase and removes
# all punctuation
def normalize(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

# normalize the question and answer columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

The next bit of normalization is to make the numeric columns properly numeric. That is, for the column of dollar values, we want to remove all dollar signs and convert it to numeric. We also want to make the the air date column a proper date, not a string.

In [20]:
def normalize_vals(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except:
        text = 0
    return text

jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_vals)

# convert air date to a datetime object
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

The two questions we are going to peruse with these cleaned up data are how often the answer can be deduced from the question, and how often new questions are repeats of older questions.

The first question can be tested by seeing how often the answer of the question actually appears in the question text. The second question can be answered by seeing how often complex words occur (e.g., words longer than six characters).



In [24]:
# This function will count all the matches. If an answer
# appears in a question
def matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for elt in split_answer:
        if elt in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(matches,
                                                axis=1)

In [28]:
jeopardy['answer_in_question'].mean()

0.060493257069335872

So, in answering the first question we see that the answer cannot be inferred from the question very often. Only 6% of the time does the answer actually appear in the question. This makes total sense, because why would they "give away" the answer more than occasionally?

Let's investigate the second question. How often are complex words repeated?

In [34]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for elt in split_question:
        if elt in terms_used:
            match_count += 1
        terms_used.add(elt)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap 
jeopardy['question_overlap'].mean()

0.69259600573386471

Okay, so these data suggest that about 70% of the time words have appeared in previous questions. That is, jeopardy very often recycles questions so we could really streamline our studying by just studying previous questions. We've only got a small subset of data, and we're only looking at recycled terms not recycled phrases, so take it with a grain of salt but suggests further investigation into recycled questions is warranted.

The last little bit I'm going to explore here is what words most frequently appear in high or low values questions, defined as above and below $800, respectively.

In [44]:
# This function assigns high and low value indicators
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)
# Function that counts words in high and low val questions
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected
    

[(0, 1), (0, 1), (0, 1), (1, 0), (0, 1)]

In [53]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]