## Intro

Today I'm going to be exploring a dataset on jeopardy.  Which can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

The dataset contains information on when the show aired, the question and answer, and the value of the question.
First I'll read the dataset into a pandas dataframe

In [13]:
import pandas as pd

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

jeopardy = pd.read_csv("jeopardy.csv")

jeopardy.head(5)

jeopardy.columns

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces before them... Let's just quickly remove spaces to make things easier

In [15]:
jeopardy.columns = ['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question', 'Answer']

jeopardy.head(1)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus


Now lets normalize the text columns, making sure things like Don't and don't aren't considered different words etc. . .

import re

def normalize_string(s):
    s = s.lower()
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    return s

In [45]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_string)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_string)

jeopardy.head(1)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus


Next the Value column should be an integer, and AirDate would be easier to work with as a datetime object

In [52]:
def normalize_number(s):
    s = re.sub("[^\d\.]", "", s)
    try:
        s = int(s)
    except Exception:
        s = 0
    return s

In [53]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_number)
jeopardy["AirDate"] = pd.to_datetime(jeopardy["AirDate"])

jeopardy.head(1)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200


## Increasing Odds of Winning

It could be helpfull to try and understand if its best to study past questions, study general knowledge, or not to study at all.

It could be helpfull to try and find how often the answer can be deduced from the question, and to see if any questions are repeats of older questions

In [58]:
def count_answer(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
        
    if len(split_answer) == 0:
        return 0
    
    for answer in split_answer:
        if answer in split_question:
            match_count+=1
            
    return match_count/len(split_answer) 

In [60]:
jeopardy['answer_in_question'] = jeopardy.apply(axis=1, func=count_answer)

jeopardy['answer_in_question'].mean()

0.060493257069335872

The asnwer was in the question about 6% of the time.  Which is not significant.  Hoping to figure out the answer by hearing the question won't increase our ods of winning.

Now for repeated questions. . .

In [66]:
question_overlap = list()
terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [s for s in split_question if len(s) > 5]

    match_count = 0
        
    for word in split_question:
        if word in terms_used:
            match_count +=1
        else:
            terms_used.add(word)
                
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
        
jeopardy['question_overlap'] = question_overlap
            
jeopardy['question_overlap'].mean()

0.69259600573386471

There's 69% of words that overlap between questions.  Which seems like a lot, but this is technically only looking at words 6 characters or longer, not whole phrases or questions.  It does mean that this may be worth looking into however.

## Investigating High and Low Value Questions

In [72]:
def value_filter(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [73]:
jeopardy['high_value'] = jeopardy.apply(axis=1, func=value_filter)
jeopardy.head(1)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0,0.0,0


In [74]:
def count_values(word):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [91]:
observed_expected = list()
comparison_terms = list(terms_used)[:5]

for item in comparison_terms:
    observed_expected.append(count_values(item))

observed_expected

[(0, 3), (0, 1), (0, 1), (1, 0), (0, 1)]

## Chi-Squared Test

In [92]:
import numpy as np
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = list()

for item in observed_expected:
    total = sum(item)
    total_prop = total/jeopardy.shape[0]
     
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
    
    exp_array = np.array([exp_high, exp_low])
    
    obs_array = np.array([item[0], item[1]])
    
    chi_squared.append(chisquare(obs_array, exp_array))

chi_squared

[Power_divergenceResult(statistic=1.2058885383806519, pvalue=0.27214791766901714),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

The terms observed don't show any significant difference between high and value questions.  Plus the frequencies of the terms used are all very low.  It would be better to first check the highest frequency terms and then run the test on those terms.  However the test is pretty slow, so its hard to run it on a lot of items. . .