Winning Jeopardy
===

Jeopardy is a popular TV show in the US where participants answer questions to win money. 

For the purpose of this project, let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. 

We'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

In [1]:
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As we can see from the dataset above, each row represents a single question on a single episode of Jeopardy.

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some column names have spaces in front. We are going to fix them.

In [4]:
jeopardy.columns = jeopardy.columns.str.replace(" ","")

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
ShowNumber    19999 non-null int64
AirDate       19999 non-null object
Round         19999 non-null object
Category      19999 non-null object
Value         19999 non-null object
Question      19999 non-null object
Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


Normalizing Text
---

Before doing any analysis, we need to normalize all text columns (Question and Answer columns).

In [6]:
import re

def normalize(string):
    string = string.lower()
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    string = re.sub("\s+", " ", string)
    return string

Below, we wrote a function to take in a string, convert it to lowercase, and remove all punctuation.

In [7]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize)

Normalizing Columns
---

Next, we need to convert the Value column from text to numeric, and also the AirDate column should be a datetime.

In [8]:
#function to normalize dollar values

def normalize_value(value):
    value = re.sub("[^A-Za-z0-9\s]", "", value)
    try:
        value = int(value)
    except Exception:
        value=0
    return value

In [9]:
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_value)

In [10]:
jeopardy["AirDate"] = pd.to_datetime(jeopardy["AirDate"])

In [11]:
jeopardy.dtypes

ShowNumber                 int64
AirDate           datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

Answers in Questions
---

Let's figure out these two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the first question by seeing how many times words in the answer also occur in the question.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. 

In [12]:
#function to count word matches between questions and answers

def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count +=1
    return match_count/len(split_answer)

In [13]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [14]:
jeopardy["answer_in_question"].mean()

0.05900196524977763

In [15]:
(jeopardy["answer_in_question"].value_counts(normalize=True).sort_index()*100).round(3)

0.000000    87.379
0.111111     0.010
0.125000     0.045
0.142857     0.105
0.166667     0.135
0.181818     0.010
0.200000     0.340
0.250000     0.775
0.285714     0.035
0.300000     0.010
0.333333     2.470
0.350000     0.005
0.400000     0.130
0.428571     0.010
0.444444     0.005
0.500000     7.240
0.571429     0.010
0.600000     0.045
0.666667     0.520
0.750000     0.085
0.800000     0.010
0.875000     0.005
1.000000     0.620
Name: answer_in_question, dtype: float64

It looks like it won't be easy to find the answer from the question as it has a very low mean value 6%. A big proportion of the answers does not include any word from the question.

Recycled Questions
---

Next, we want to investigate how often new questions are repeats of older ones.

In [16]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values(by="AirDate")

In [17]:
for index, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count +=1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

In [18]:
jeopardy["question_overlap"] = question_overlap

In [19]:
jeopardy["question_overlap"].mean()

0.6876260592169802

We have found that about 70% overlap between words in new question and words in old questions. Although 70% is a big proportion, we only looked at a small set of questions, and our code doesn't look at phrases, it looks only at single words. This makes it relatively insignificant. We should do more investigation.

Low Value vs High Value Questions
---

Let's say we only want to study questions that have higher values instead of low value questions. This may help us earn more money when we're on Jeopardy.

We can actually do that by figuring out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.

In [20]:
#function to determine high and low value questions

def determine_value(row):
    if row["clean_value"] > 800:
        value = 1
    else:
        value = 0
    return value

In [21]:
jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [22]:
#function to determine low and high word usage

def usage_count(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [23]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

In [24]:
comparison_terms[:10]

['instructed',
 'polonius',
 'corbin',
 'blondes',
 'subsidies',
 'minerva',
 'pompeys',
 'afikomen',
 'leningrad',
 'contempt']

We randomly chose 10 words for comparison.

In [25]:
observed_expected = []

for i in comparison_terms:
    observed_expected.append(usage_count(i))

In [26]:
observed_expected

[(0, 3),
 (0, 2),
 (1, 0),
 (1, 0),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 1)]

Applying the Chi-Squared Test
---

We can now compute the expected counts and the chi-squared value.

In [27]:
high_value_count = (jeopardy["high_value"] == 1).sum()
low_value_count = (jeopardy["high_value"] == 0).sum()

In [28]:
from scipy.stats import chisquare
import numpy as np

chi_squared=[]

for i in observed_expected:
    total = sum(i)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop*high_value_count
    expected_low = total_prop*low_value_count
    observed = np.array([i[0], i[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

In [29]:
chi_squared

[Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

It looks like none of the words is significantly different in usage between high value and low value questions. The p_values are higher than 0.05, so the chi-squared test isn't as valid. 

It might be better to run this test with only terms that have higher frequencies.

Most Used Words
---

In [41]:
term_list = []

for index, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    for word in split_question:
        term_list.append(word)

In [42]:
term_sr = pd.Series(term_list)

In [45]:
term_sr.value_counts().head(50)

called              521
country             476
played              297
became              287
before              267
president           258
capital             257
american            257
famous              246
targetblankherea    244
french              243
island              216
people              184
national            183
largest             179
little              178
around              169
british             166
author              164
meaning             162
during              161
century             159
family              155
musical             153
company             151
series              148
between             145
states              142
reports             141
founded             141
character           141
targetblankthisa    140
include             138
million             129
number              125
school              120
popular             119
father              114
because             111
through             104
classic             103
german          

We can see the top 50 most used words in the questions of Jeopardy. Based on this output above, we can study topics around these words. It may increase the chance of a win. However, we should also calculate expected counts and the chi-squared value.

In [49]:
comp_terms_50 = list(term_sr.value_counts().head(50).index)

In [50]:
obs_exp_50 = []

for i in comp_terms_50:
    obs_exp_50.append(usage_count(i))

In [68]:
chi_squared_50 = []

for i in obs_exp_50:
    total = sum(i)
    total_prop = total / jeopardy.shape[0]
    exp_high = total_prop*high_value_count
    exp_low = total_prop*low_value_count
    observed = np.array([i[0], i[1]])
    expected = np.array([exp_high, exp_low])
    for w in comp_terms_50:
        chi_squared_50.append([w,chisquare(observed, expected)])

In [75]:
chi_squared_50[:5]

[['called',
  Power_divergenceResult(statistic=4.048305063534577, pvalue=0.044215717944225866)],
 ['country',
  Power_divergenceResult(statistic=4.048305063534577, pvalue=0.044215717944225866)],
 ['played',
  Power_divergenceResult(statistic=4.048305063534577, pvalue=0.044215717944225866)],
 ['became',
  Power_divergenceResult(statistic=4.048305063534577, pvalue=0.044215717944225866)],
 ['before',
  Power_divergenceResult(statistic=4.048305063534577, pvalue=0.044215717944225866)]]

Let's put our result into a dataframe.

In [87]:
chi_50_df = pd.DataFrame(chi_squared_50, columns=["word","chi-square-result"])

In [94]:
chi_50_df.head()

Unnamed: 0,word,chi-square-result,chi-square,p-value
0,called,"(4.048305063534577, 0.044215717944225866)",4.048305,0.044216
1,country,"(4.048305063534577, 0.044215717944225866)",4.048305,0.044216
2,played,"(4.048305063534577, 0.044215717944225866)",4.048305,0.044216
3,became,"(4.048305063534577, 0.044215717944225866)",4.048305,0.044216
4,before,"(4.048305063534577, 0.044215717944225866)",4.048305,0.044216


This dataframe needs a bit cleaning.

In [89]:
chi_50_df["chi-square"] = chi_50_df["chi-square-result"].str[0]
chi_50_df["p-value"] = chi_50_df["chi-square-result"].str[1]

In [93]:
chi_50_clean = chi_50_df.sort_values("chi-square", ascending=False)
chi_50_clean.drop("chi-square-result", axis=1)

Unnamed: 0,word,chi-square,p-value
501,country,30.705096,3.003752e-08
538,because,30.705096,3.003752e-08
528,reports,30.705096,3.003752e-08
529,founded,30.705096,3.003752e-08
530,character,30.705096,3.003752e-08
531,targetblankthisa,30.705096,3.003752e-08
532,include,30.705096,3.003752e-08
533,million,30.705096,3.003752e-08
534,number,30.705096,3.003752e-08
535,school,30.705096,3.003752e-08


We ordered the data frame by chi-square values in descending order.

It looks like these words are significantly different in usage between high value and low value questions. The p_values are lower than 0.05, and some of them are very close to 0 so the chi-squared test is valid. It would be helpful to study these words for the show.