# Patterns in Jeopardy Questions
### Practice with Statistics

In [5]:
import pandas as pd

In [6]:
jeopardy = pd.read_csv("jeopardy.csv")

# Explore the dataset
print(jeopardy.head(5))
print(jeopardy.columns)

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype

In [7]:
# Rename columns to get rid of spaces in front
# Not the best way but the small number of columns does not warrant a more
# sophisticated solution

jeopardy.rename(columns = {
        " Air Date": "Air Date", " Round": "Round", " Category": "Category", " Value": "Value", " Question": "Question", " Answer": "Answer"
    }, inplace = True)
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


#### Normalizing Text

In [8]:
import string
def normalize_text(some_string):
    lowercase_string = some_string.lower()
    punc_dict = {}
    for punc in string.punctuation:
        punc_dict[punc] = None
    # Puts punctuation into a dictionary, each punctuation having None associated
    final_string = lowercase_string.translate(str.maketrans(punc_dict))
    # maketrans turns punctuation dictionary into a translation table that can be read by the tronslate method
    return final_string



jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)

jeopardy[jeopardy.columns[5:9]].head(5)

Unnamed: 0,Question,Answer,clean_question,clean_answer
0,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


### Normalizing Columns

In [11]:
def num_dollar(some_string):
    punc_dict = {}
    for punc in string.punctuation:
        punc_dict[punc] = None
    no_punc_dollar = some_string.translate(str.maketrans(punc_dict))
    try:
        final_dollar = int(no_punc_dollar)
    except Exception:
        final_dollar = 0
    return final_dollar

jeopardy["clean_value"] = jeopardy["Value"].apply(num_dollar)

jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

## Similarities between Answers and Questions
Now that our data has been cleaned so that it is easy to work with, let's consider how we might approach Jeopardy with this information in hand. We can ask two questions:
1. How often does the answer show up in the question itself?
2. How often are questions recycled?

These two questions can help inform our study strategy for an edge in Jeopardy. Let's look at the first question.

### Answers in Questions

In [12]:
def match_count(jeo_row):
    split_answer = jeo_row["clean_answer"].split(" ")
    split_question = jeo_row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(match_count, axis = 1)


print("Average fraction of words shared in both the question and the answer:")
print(jeopardy["answer_in_question"].mean())

Average fraction of words shared in both the question and the answer:
0.0603527738547


It seems that on average, around 6 percent of the words in Jeopardy questions show up in the answers. If I were to study for Jeopardy, it does not seem prudent to rely on arriving at an answer based on looking at the question. However, if I were to come across as a stumper, this finding suggests that a key word to the answer is possibly in the question and it may be a good starting point when I have nothing else.

### Recycled Questions

In [13]:
question_overlap = []
terms_used = set()
jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows(): #iterrows gives index i and row
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count = match_count / len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

print("Average fraction of major terms reused from older questions:")
print(jeopardy["question_overlap"].mean())

Average fraction of major terms reused from older questions:
0.687124288097


This suggests that around 69% of questions are similar to older ones in terms used. While this doesn't tell us if the questions are recycled, since we aren't looking at phrases, it does tell us that many of the terms involved are similar so the topics are likely to not vary widely overtime. Studying old questions is likely a very beneficial way to get an edge in Jeopardy!

## Chi-Squared Test for Low and High Value Questions

While this has been helpful, the next step in making the data work for us is to identify which words appear often in high value questions so that we can focus our efforts on those. We'll define high value as being worth greater than $800.

In [14]:
def high_or_low(row):
    if row["clean_value"] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy["high_value"] = jeopardy.apply(high_or_low, axis = 1)

def high_val_word(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = []
terms_used = list(terms_used)
comparison_terms = terms_used[0:5]

for term in comparison_terms:
    observed_expected.append(high_val_word(term))

In [15]:
observed_expected 
# each tuple shows how many times a certain term was in a high value question, 
# followed by times in a low value question
# each tuple is a different word

# These are all observed values

[(2, 0), (0, 1), (0, 5), (1, 0), (0, 1)]

In [18]:
# Calculating Chi-Squared
from scipy.stats import chisquare
import numpy as np

high_value_count = len(jeopardy[jeopardy["high_value"] == 1])
low_value_count = len(jeopardy[jeopardy["high_value"] == 0])

chi_squared = []

for some_list in observed_expected:
    total = sum(some_list)
    total_prop = total / len(jeopardy)
    exp_term_count_high = total_prop * high_value_count
    exp_term_count_low = total_prop * low_value_count
    
    observed = np.array([some_list[0], some_list[1]])
    expected = np.array([exp_term_count_high, exp_term_count_low])
    
    chi_squared.append(chisquare(observed, expected))

# Taking the first tuple (2, 0) as an example
# We add the high and low counts, 2 + 0 = 2. 
# This word has appeared a total of 2 times in a question in the jeopardy data
# Divide that by length of jeopardy to get the proportion of ?'s that 
# word has appeard in
# If that word appears in 4% of the questions, and there are 30 high_value_?'s
# then we expect to the term to appear in (total_prop * high_value_count) number
# of questions.
# The same goes for low value questions.
# However, our observed values are the original tuple. They were gathered
# by actually comparing with every question.
# Our chisquare fn takes the observed, then the expected, then gives us
# chisq and p

chi_squared

[Power_divergenceResult(statistic=4.9755842343913503, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.00981423063442, pvalue=0.1562844540498966),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

## Chi-Squared Results

The first thing to note is that our values for observed_expected were all very low, with many in the range of 0 to 2 and one value 5. This makes the Chi-Squared Test less applicable and the results should be considered carefully.

All but the first of our p-values indicate there is no statistically significant discrepancy between high and low values. Only the first one has a p-value less than 0.05, but again the tuple being considered only had a total count of 2.

# Going Further

For future analysis, there are a number of things to make our analysis more sophisticated:

1. Instead of choosing significant words based on length, we could manually make of list of articles and similar words (the, a, than, of, etc.).
2. Perform the Chi-Squared test with more terms (the code is slow so this will take a while!)
3. Consider categories with the Category column
4. This data was just a subset of the whole data, we can expand to the complete data.
5. Consider phrases instead of single words to get a better idea of question overlap.