# Winning Jeopardy! 
---
In this project, we are going to explore data from the American TV show ***Jeopardy***, and see if we can figure out some patterns in the questions so that we have a better chance of winning!  

In [164]:
import pandas as pd
from scipy.stats import chisquare, chi2_contingency

In [165]:
jeopardy = pd.read_csv('jeopardy.csv')

In [166]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [167]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [168]:
# Strip empty spaces at the begining of some column names 
jeopardy.columns = jeopardy.columns.str.strip().str.lower().str.replace(r' ', '_')

In [169]:
# A function to convert a string to lowercase and replace all punctuations with empty space
import re
def normalize(string):
    string = string.lower() # Convert all character to lower case
    string = string.strip() # Strip white spaces at the start & end of the string if they exist
    string = re.sub('[^\w\s]', '', string) # Remove non-words & non-space(all punctuations)
    return re.sub('\s+', ' ', string) # Replace all spaces(one or more) with one space for later word-split

In [170]:
# Normalize the `Question` column
jeopardy['clean_question'] = jeopardy.question.apply(normalize)
jeopardy.clean_question.head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [171]:
# Normalize the `Answer` column
jeopardy['clean_answer'] = jeopardy.answer.apply(normalize)
jeopardy.clean_answer.head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

In [172]:
jeopardy[jeopardy.clean_answer.str.contains('\s{2,}')]

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer


In [173]:
# A function to normalize `Value` column
def normal_value(string):
    string = re.sub('\D', '', string)
    try:
        value = int(string)
    except:
        value = 0
    return value

In [174]:
jeopardy['clean_value'] = jeopardy.value.apply(normal_value)

In [175]:
# Convert `Air Date` to datetime data type
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])
jeopardy.drop(columns = ['question', 'answer', 'value'], inplace = True)

In [176]:
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,signer of the dec of indep framer of the const...,john adams,200


In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the first question by seeing how many times words in the answer also occur in the question. <br>
We can answer the second question by seeing how often complex words (> 6 characters) reoccur. 

In [177]:
# A function to split and match the words in anwer and question
def a_in_q(row):
    split_answer = row.clean_answer.split()
    split_question = row.clean_question.split()
    match_count = 0
    # 'the' is usually meaningless but apears in sentences, remove to avoid bias.
    if 'the' in split_answer:
        split_answer.remove('the') 
    if len(split_answer) == 0:
        return 0 # Return 0 to avoid division by zero erro
    for w in split_answer:
        if w in split_question:
            match_count += 1 
    return match_count/len(split_answer)

In [178]:
# Count how many times terms in clean_answer occur in clean_question
jeopardy['answer_in_question'] = jeopardy.apply(a_in_q, axis = 1)

In [179]:
a_in_q_mean = jeopardy.answer_in_question.mean()
print('The average percentage of words appear in both question and corresponding answer is: ', a_in_q_mean*100, '%')

The average percentage of words appear in both question and corresponding answer is:  5.900196524977764 %


From the result above, we can tell that the words in the answer doesn't occur in the question often, only 6% in average. 

In [180]:
jeopardy.loc[jeopardy.answer_in_question>0, 'answer_in_question'].mean()

0.4675040820246842

In [181]:
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,signer of the dec of indep framer of the const...,john adams,200,0.0


Next we will investigate how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [203]:
question_overlap = []
terms_used = set()
jeopardy.sort_values(by = 'air_date', inplace = True)

for idx, row in jeopardy.iterrows():
    split_question = row.clean_question.split()
    split_question = [w for w in split_question if len(w) > 5] # Exclude generic words like 'the', 'than'
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count+=1
        else:
            terms_used.add(w)
    if len(split_question) > 0:
        question_overlap.append(match_count/len(split_question))

In [206]:
question_overlap_mean = sum(question_overlap)/len(jeopardy)
print('The average percentage of words in a question overlap with previous ones is: ', question_overlap_mean*100, '%')

The average percentage of words in questions overlap with previous ones is:  68.94686842646593 %


From the result, it seems that in average, more than half of the words in a question have occurred before. That indicates questions might often be repeated. 

Next, let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.<br><br>
We can actually figure out which terms correspond to high-value questions using a chi-squared test. 

In [209]:
# Create a new column that categorizes high and low values
jeopardy['value_cat'] = 0
jeopardy.loc[jeopardy.clean_value > 800, 'value_cat'] = 'high value'
jeopardy.loc[jeopardy.clean_value <= 800, 'value_cat'] = 'low value'

In [245]:
# Count numbers of a word appears in high value and low value questions 
def cat_count(string):
    low_count = 0
    high_count = 0
    for idx, row in jeopardy.iterrows():
        split_question = row.clean_question.split()
        if string in split_question:
            if row.value_cat == 'high value':
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [246]:
# Randomly pick ten elements of terms_used and append them to a list 
from random import sample
comparison_terms = sample(terms_used, 10)

In [247]:
observed= []
for w in comparison_terms:
    observed.append(cat_count(w))

In [248]:
observed

[(1, 0),
 (3, 7),
 (0, 1),
 (1, 1),
 (1, 1),
 (4, 7),
 (1, 0),
 (0, 2),
 (1, 2),
 (1, 0)]

In [249]:
high_value_count = sum(jeopardy.value_cat == 'high value')
low_value_count = sum(jeopardy.value_cat == 'low value')

In [250]:
high_count, low_value_count, len(jeopardy)

(5734, 14265, 19999)

In [251]:
from scipy.stats import chisquare
chi_squared = []
for l in observed:
    print(l)
    total = sum(l)
    total_pct = total/len(jeopardy)
    expected_high = total_pct*high_value_count
    expected_low = total_pct*low_value_count
    chisquare(l, [expected_high, expected_low])
    chi_squared.append(chisquare(l, [expected_high, expected_low]))

(1, 0)
(3, 7)
(0, 1)
(1, 1)
(1, 1)
(4, 7)
(1, 0)
(0, 2)
(1, 2)
(1, 0)


In [252]:
chi_squared

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.008630851497838939, pvalue=0.9259811180040979),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.31825939412255577, pvalue=0.5726555677100731),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

From the pvalues, we can conclude that there isn't a correlation between words frequency and question value. Although all the sample words have a occurence below 10, for further investigations, maybe we can sample words with a higher occurence. 