### Introduction

[https://en.wikipedia.org/wiki/Jeopardy!](Jeopardy) is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. If you need help at any point, you can consult our solution notebook <a href="https://github.com/dataquestio/solutions/blob/master/Mission210Solution.ipynb">here</a>.

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

### Import Libraries and Read Dataset

In [1]:
import pandas as pd
import re
from random import choice
from scipy.stats import chisquare
import numpy as np

jeopardy = pd.read_csv('JEOPARDY_CSV.csv')
print(jeopardy.columns)
jeopardy.head()

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


We can see that columns ' Air Date', ' Round', ' Category', ' Value', ' Question', and ' Answer' have space in front. We shall remove space for each item in jeopardy.columns using the str.replace() function.

In [3]:
# Remove the spaces from each item in jeopardy.columns.
jeopardy.columns = jeopardy.columns.str.replace('\W','')
print(jeopardy.columns)
print(jeopardy.info())

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   ShowNumber  216930 non-null  int64 
 1   AirDate     216930 non-null  object
 2   Round       216930 non-null  object
 3   Category    216930 non-null  object
 4   Value       216930 non-null  object
 5   Question    216930 non-null  object
 6   Answer      216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB
None


The 'AirDate' column is of the format 'object', we can use the pd.to_datetime() function to convert the same to datetime series.

In [5]:
# Write a function to normalize questions and answers. 
def remove_punctuations(string):
    string_lc = string.lower()
    string_cln = re.sub(r'[^\w\s]','',string_lc)
    return string_cln

jeopardy['clean_question'] = jeopardy['Question'].apply(remove_punctuations)
jeopardy.Answer = jeopardy.Answer.astype(str)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(remove_punctuations)
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


After removal of punctuations and conversion to lower case.

In [6]:
# Write a function to normalize dollar values
def normalize_dollar(string1):
    string_no_punc = re.sub(r'[^\w\s]','',string1)
    try:
        string_int = int(string_no_punc)
    except Exception:
        string_int = 0
    return string_int

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar)
jeopardy.AirDate = pd.to_datetime(jeopardy['AirDate'])
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   ShowNumber      216930 non-null  int64         
 1   AirDate         216930 non-null  datetime64[ns]
 2   Round           216930 non-null  object        
 3   Category        216930 non-null  object        
 4   Value           216930 non-null  object        
 5   Question        216930 non-null  object        
 6   Answer          216930 non-null  object        
 7   clean_question  216930 non-null  object        
 8   clean_answer    216930 non-null  object        
 9   clean_value     216930 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 16.6+ MB


In [7]:
# Function to split the clean_answer column around spaces and assign to the variable split_answer
def split_row(series1):
    split_answer = series1['clean_answer'].split()
    split_question = series1['clean_question'].split()
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    result = match_count/len(split_answer)
    return result

jeopardy['answer_in_question']=jeopardy.apply(split_row, axis=1)
jeopardy['answer_in_question'].mean()

0.05792070323661354

A very low mean of 0.06 or 6% indicates that in 6% of the cases the answers appear in questions.

In [9]:
# Create a column called 'question_overlap' which occurs in questions
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('AirDate')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [w for w in split_question if len(w) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question)>0:
        match_count/=len(split_question)
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean()

0.8721766377741468

A mean value of 0.87 or 87% of meaningful words overlap in questions. Though it represents only 10% of the questions in full jeopardy dataset, there is some question recycling and investigation is needed on the same.

In [10]:
# Determining low_value and high_value questions
def binary_data(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(binary_data, axis=1)
jeopardy.head()

def word_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row['clean_question'].split(" "):
            if row['high_value']==1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []
for w in comparison_terms:
    observed_expected.append(word_count(w))
    
observed_expected

[(0, 1),
 (1, 0),
 (1, 0),
 (2, 3),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (3, 7),
 (0, 2)]

By examining the observed_expected, there are words with highest differences in usage between high and low value questions, by selecting the words with highest associated chi-squared values.

In [11]:
# Compute the chi-squared value and p-value for observed and expected counts
high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total/jeopardy.shape[0]
    expected_high_value = total_prop * high_value_count
    expected_low_value = total_prop * low_value_count

    observed = np.array([obs[0], obs[1]])
    expected = np.array([expected_high_value, expected_low_value])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.3363947754070794, pvalue=0.5619176551024535),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=0.014001457003849405, pvalue=0.9058079685768663),
 Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989)]

### Conclusions

- No terms had significant difference in usage between high and low value rows.
- The frequencies were all lower than 5 and the test's validity is affected.
- It would also be better to run this test with terms with higher frequencies.