### Winning Jeopardy!

In this project I'll explore a dataset with some of the quizzes of the popular TV show [Jeopardy!](https://it.wikipedia.org/wiki/Jeopardy!). <br>
The objective is to find some pattern that could help winning the game.
The dataset for this notebook is about 10% of the full jeopardy dataset (the full dataset can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file) )

In [210]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [211]:
jeopardy = pd.read_csv("jeopardy.csv")

In [212]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [213]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


### Data Cleaning

#### Removing trailing spaces in columns

In [214]:
#cleaning the spaces in front of column names
new_columns = []
for col in jeopardy.columns:
    new_col = col.strip()
    new_columns.append(new_col)
    
jeopardy.columns = new_columns    

In [215]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

#### Normalizing Question and Answer columns

I'm going to normalize the Question and Answer columns, removing all punctuaction characters and making  all the strings lowercase.

In [216]:
jeopardy["clean_question"] = jeopardy["Question"].str.replace("[^A-Za-z0-9\s]", " ").str.lower()

In [217]:
jeopardy["clean_answer"] = jeopardy["Answer"].str.replace("[^A-Za-z0-9\s]", " ").str.lower()

#### Normalizing the Value column
I'll convert the column to numeric.

In [218]:
import re
def normalize_value(value):
    try:
        new_value = int(re.sub("\W", "", value))
    except:
        new_value = 0
    return new_value  
    

In [219]:
jeopardy["Value"] = jeopardy["Value"].apply(normalize_value)

In [220]:
jeopardy["Value"].head()

0    200
1    200
2    200
3    200
4    200
Name: Value, dtype: int64

#### Normalizing the Air Date column

In [221]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [222]:
jeopardy["Air Date"].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

### Answers in Questions

I want to see how many times words in the questions are also in the answer.
I'll remove the article "the" in both answer and questions because it is not meaningful for deducing the answer.

In [223]:
def count_match(row):
    match_count = 0
    split_answer = row["clean_answer"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    split_question = row["clean_question"].split(" ")
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/ len(split_answer)        

In [224]:
jeopardy["answer_in_question"] = jeopardy.apply(count_match, axis= 1)

In [225]:
jeopardy["answer_in_question"].mean()

0.09611447756854266

In [226]:
(jeopardy["answer_in_question"] > 0).sum() / len(jeopardy["answer_in_question"])

0.20826041302065104

In [227]:
jeopardy[jeopardy["answer_in_question"] > 0]["answer_in_question"].mean()

0.4615110292660948

Only 20% of the answers contain words that are present in the question. <br>
On average, considering all the answers and questions, only 6% of the answer is contained inside the question, but considering only the cases in which there is a match, then 46% of the answer is contained inside the question.

### Recycled Question

Does Jeopardy! recycle old questions? I'm using only 10% of the original dataset, but it may be interesting investigate in this direction.

In [228]:
question_overlap = []
terms_used = set()

In [229]:
sorted_jeopardy = jeopardy.sort_values(by=["Air Date"])

In [230]:
for index, row in sorted_jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
   
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word) 
        
    if len(split_question) > 0 :
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)
    

In [231]:
len(terms_used)

20249

In [232]:
sorted_jeopardy["question_overlap"] = question_overlap 

In [233]:
sorted_jeopardy["question_overlap"].mean()

0.725653219187625

About 72% of the questions contain terms that have been already used in other questions.

 ### Low value vs high value questions

In [234]:
def classify_value(row):
    if row["Value"] > 800:
        return 1
    return 0

In [235]:
sorted_jeopardy["high_value"] = sorted_jeopardy.apply(classify_value, axis = 1)

In [236]:
sorted_jeopardy["high_value"].value_counts(normalize=True)*100

0    71.328566
1    28.671434
Name: high_value, dtype: float64

About 71% of all the questions have a value less than 800 dollars, only 29% of the questions have a value above 800 dollars.

In [237]:
def count_word_value(word):
    low_count = 0
    high_count = 0
    for index, row in sorted_jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return [high_count, low_count]        

In [238]:
import random as rand

In [244]:
rand.seed(42)

In [245]:
comparison_terms = rand.sample(terms_used, 10)

In [246]:
comparison_terms

['sirius',
 'drinkers',
 'arouse',
 'masters',
 'interactions',
 'letitia',
 'dependents',
 'soekarno',
 'winders',
 'stater']

In [247]:
observed_expected = [count_word_value(x)for x in comparison_terms]

In [248]:
observed_expected

[[1, 2],
 [0, 1],
 [0, 1],
 [4, 6],
 [0, 1],
 [1, 0],
 [0, 1],
 [1, 0],
 [0, 1],
 [0, 1]]

### Applying the chi-squared test

In [258]:
from scipy.stats import chisquare

In [255]:
high_value_count = sorted_jeopardy["high_value"].value_counts()[1]
high_value_count

5734

In [260]:
low_value_count = sorted_jeopardy["high_value"].value_counts()[0]
low_value_count

14265

In [263]:
chi_squared = []
for item in observed_expected:
    total = sum(item)
    total_prop = total / len(sorted_jeopardy)
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    observed = np.array([item[0], item[1]])
    expected = np.array([expected_high, expected_low])
    chisq = chisquare(observed, expected)
    chi_squared.append(chisq)

In [264]:
chi_squared

[Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.6275336335698622, pvalue=0.42826143908800296),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

The p-value is high for all the words in our sample,so there is no significative difference in the distribution between low value and high value questions.