# Using tests to win Jeopardy

In this notebook we are going to explore the [`Jeopardy`](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) dataset. Containing over 20000 questions and answers.

In [123]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from scipy.stats import chisquare
import re

## Read the data in


In [12]:
jp = pd.read_csv('jeopardy.csv')

jp.head()


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [13]:
print(jp.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


Some of the column names have empty spaces in them, let's correct that

In [14]:
jp.columns = jp.columns.str.strip()
jp.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Since we are going to work with the questions and answers. It is alway important to normalize the data. In this case means lowecasing and removig punctuations. Also the Value colum is not numeric so we will need to convert it to an int. The Air Time column can be also converted to a `Datetime` object

In [47]:
def normalize_tex(string):
    '''
    This function strips all non alphanumeric characters and lower them
    '''
    string = string.lower()
    string = re.sub('[^A-Za-z0-9\s]', "", string)
    string = re.sub('\s+', " ", string)
    return string

def normalize_values(value):
    '''
    This function tries to transform the value column to an int, by removing all non numeric characters
    '''
    value = re.sub('[^0-9]', "", value)
    try:
        value = int(value)
    except Exception:
        value = 0
        
    return value

In [48]:
jp['clean_question'] = jp.Question.apply(normalize_tex)
jp['clean_answer'] = jp.Answer.apply(normalize_tex)
jp['clean_values'] = jp.Value.apply(normalize_values)
jp['Air Date'] = pd.to_datetime(jp['Air Date'])

In [53]:
def count_terms(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return  0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)  
    

In [55]:
jp['answer_in_question'] = jp.apply(count_terms, axis=1)

In [56]:
jp.answer_in_question.mean()

0.059001965249777744

The answer only appears 6% of the time in the question. So the probability of getting the right answer by just looking at the question is pretty low. You will need to learn :)

In [80]:

question_overlap = list()
terms_used = set()

jp.sort_values(by='Air Date', inplace=True)

for index, row in jp.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
        
        
jp['question_overlap'] = question_overlap

jp['question_overlap'].mean()
    


0.6876974695057736

Here we see that about 70\% of question words overlap, buit since they are not exactly the same, we cannot conclude that the probability of getting the same queation is so high, maybe some words and concepts are just formulated differently

## Low value vs high value questions

In [81]:
def classify_value (row):
    if row['clean_values'] > 800:
        value = 1
    else:
        value = 0
    return value

In [83]:
jp['high_value'] = jp.apply(classify_value, axis = 1)

In [85]:
def count_type(word):
    low_count = 0
    high_count = 0
    
    for i, row in jp.iterrows():
        if word in row['clean_question'].split(" "):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [97]:
import random

comparison_terms = random.sample(list(terms_used), 10)

In [110]:
observed_expected = list()

for obs in comparison_terms:
    high_count, low_count = count_type(obs)
    observed_expected.append([high_count, low_count])
     

In [111]:
observed_expected


[[1, 0],
 [0, 1],
 [5, 19],
 [9, 32],
 [0, 1],
 [0, 1],
 [1, 0],
 [0, 1],
 [2, 1],
 [0, 1]]

In [127]:
high_value_count = jp[jp['high_value'] == 1].shape[0]
low_value_count = jp[jp['high_value'] == 0].shape[0]

chi_squared = list()

for lst in observed_expected:
    total = sum(lst)
    total_prop = total / jp.shape[0]
    expected_high = total_prop * high_value_count
    expected_low  = total_prop * low_value_count
    
    observed = np.array(lst)
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared
    

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.7209745992373746, pvalue=0.395824408918502),
 Power_divergenceResult(statistic=0.9053930713848508, pvalue=0.3413397346165769),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868263753),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

Looking at the result above we can see that the smalles p_val in our sample was 11\% which means non of our samples is statistically significant