# Winning Jeopardy

In [1]:
# Disable warnings in Anaconda
import warnings
warnings.filterwarnings('ignore')

import pandas as pd # Data processing
import numpy as np # Linear algebra
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('white')

import timeit # measure runtimes

In [2]:
data = pd.read_csv("jeopardy.csv")

In [3]:
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Data Cleaning

In [4]:
data.columns = [name.strip() for name in data.columns] # remove whitespaces in column names
data.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
data["Question"] = (
    data["Question"].str.replace(r"\W"," ")
    .str.lower()
)
data["Answer"] = (
    data["Answer"].str.replace(r"\W"," ")
    .str.lower()
)

In [6]:
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,for the last 8 years of his life galileo was ...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,no 2 1912 olympian football star at carlisl...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,in 1963 live on the art linkletter show th...,mcdonald s
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,signer of the dec of indep framer of the co...,john adams


In [7]:
data["Value"] = (
    data["Value"].str.replace("$","")
    .str.replace(",","")
)
data["Value"].loc[data["Value"].str.contains(r"[a-z]")] = 0 # replace string values without numbers to 0
data["Value"] = data["Value"].astype("int")

In [8]:
data["Air Date"] = pd.to_datetime(data["Air Date"])

## Can we deduce answers from the asked questions?
- Create a function that takes in a row and:
    - splits the words in the `Question` column
    - splits the words in the `Answer` column
    - counts the number of word matches in both `Question` and `Answer`
    - return match counts
- Apply the created function to the dataframe `data` and create a column with the match counts

In [9]:
def match_counts(row):
    split_question = row["Question"].split(" ")
    split_answer = row["Answer"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the") # we do not want to match "the", irrelevant word
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer) # convert to ratio between 0 and 1

data["answer_in_question"] = data.apply(match_counts, axis=1)

print("Mean of word counts that appear in both answer and question: {:.2f}".format(data["answer_in_question"].mean()))

Mean of word counts that appear in both answer and question: 0.10


We can interpret the number of words that are part of the question and the answer as a probability. In almost 1 out of 10 questions (p = 0.10) part of the answer will be also part of the question. In these cases, we could be able to use the question to deduce the answer.

## Are past questions recycled / repeated?

In [10]:
question_overlap = list()
terms_used = set()
data_sorted = data.sort_values(by="Air Date")

for index, row in data_sorted.iterrows():
    split_question = row["Question"].split()
    split_question = [word for word in split_question if len(word) >= 6] # only consider words with 6 or more characters
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count +=1
        else:
            terms_used.add(word)
    if len(split_question) > 0: 
        match_count = match_count / len(split_question) # convert to ratio between 0 and 1
    question_overlap.append(match_count)

data_sorted["question_overlap"] = question_overlap
print("Mean of relevant word count ratios that have been repeated: {:.2f}".format(data_sorted["question_overlap"].mean()))

Mean of relevant word count ratios that have been repeated: 0.72


There is a high amount of words with more than 5 letters that are repeated in previous questions. The result shown above shows that there is a p = 0.72 of having repeated words in our questions. However, this does not mean that the question itself is being recycled. 

## Studying high / low value questions
We will use a chi-squared test to determine which terms used correspond to high or low value questions.
- Low value: less than 800
- High value: more than 800


In [11]:
def high_or_low(row):
    if row["Value"] > 800:
        return 1
    else:
        return 0
    
data_sorted["high_value"] = data_sorted.apply(high_or_low,axis=1)

### Storing the word counts for high and low value questions

We will create a function that:
- Takes in a word from the set terms_used
- Assign high_count = 0 and low_count = 0
- Loops through all the rows of data_sorted and, if the term is found, adds 1 to high_count or low_count
- returns high_count, low_count

We will then store these values in a dictionary, using the terms as keys and storing (high_count,low_count) as a tuple.

In [12]:
word_counts = dict()

def counts_high_low(term):
    low_count = 0
    high_count = 0
    for index, row in data_sorted.iterrows():
        if term in row["Question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

for term in list(terms_used)[:5]: # using only 5 terms to test our function
    word_counts[term] = counts_high_low(term)
    
word_counts

{'switzerland': (3, 7),
 'flimsy': (0, 1),
 'flashbacks': (1, 0),
 'phenom': (0, 2),
 'ratings': (0, 1)}

## Chi-square test

In [13]:
from scipy.stats import chisquare

chi_squared = dict()

high_value_count = data_sorted[data_sorted["high_value"] == 1].shape[0]
low_value_count = data_sorted[data_sorted["high_value"] == 0].shape[0]

for key,value in word_counts.items():
    total = sum(value)
    total_prop = total / len(data_sorted)
    high_expected = total_prop * high_value_count
    low_expected = total_prop * low_value_count
    observed = np.array([value[0],value[1]])
    expected = np.array([high_expected,low_expected])
    chi_squared[key] = chisquare(observed,expected)

chi_squared

{'switzerland': Power_divergenceResult(statistic=0.008630851497838939, pvalue=0.9259811180040979),
 'flimsy': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'flashbacks': Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 'phenom': Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 'ratings': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)}

## Conclusion
From the 5 words analyzed, the only one with enough frequencies to apply a chi-squared test is the word reports. If we do not have high frequencies (4 of the 5 words have frequencies from 1 to 5), our chi-squared is not going to provide valid results.

We could still try to find the words with the highest frequencies and apply the chi-squared test to only those terms and see if we can find any patterns. 

## To-Do:
- Write intro/intent of this analysis and describe steps
- Examine different ways to eliminate words that do not provide any information -i.e. eliminate words that occur too often in the questions
- Do a chi-squared test on high frequency terms
- Analyze the Category column and find any correlation with high/low value questions
- Use more jeopardy data: Can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)
- Use phrases to analyze patterns in questions instead of splitting into words