# A Way to Win on Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win. You can download it [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

# Exploring the Dataset

Below, we have the data dictionary of the dataset:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer

Let's see how it looks like:

In [1]:
import pandas as pd
import numpy as np

In [2]:
jeo = pd.read_csv('jeopardy.csv')
print(jeo.shape)
jeo.head()

(19999, 7)


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeo.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

It seems there are spaces in front of each column name. We'll remove it.

In [4]:
jeo.columns = jeo.columns.str.strip()

In [5]:
jeo.dtypes

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object

# Normalizing Columns

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). The idea is to ensure that we put words in lowecase and remove punctuations so, for instance, "Don't" and "don't" aren't considered to be different words when you compare them.

Let's define and apply a function that normalizes any text.

In [6]:
import string
def normalize(s):
    s_lc = s.lower()
    table = str.maketrans("","",string.punctuation)
    out = s_lc.translate(table)
    return out

jeo['clean_question'] = jeo['Question'].apply(normalize)
jeo['clean_answer'] = jeo['Answer'].apply(normalize)

jeo.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


Now we finished normalizing the `Question` and `Answer` columns, there are some other columns to normalize.

- The `Value` column should be numeric, to allow us to manipulate it easier. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.
- The `Air Date` column should also be a datetime, not a string, to enable us to work it easier.

In [7]:
def norm_dollar(s):
    table = str.maketrans("","",string.punctuation)
    s_mod = s.translate(table)
    if s_mod == 'None':
        value = 0
    else:
        value = int(s_mod)
    return value
jeo['clean_value'] = jeo['Value'].apply(norm_dollar)

jeo['Air Date'] = pd.to_datetime(jeo['Air Date'])
jeo.head(3) 

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200


In [8]:
jeo.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

To answer the first question, let's see how many times words in the answer also occur in the question.

In [9]:
def match_words(row):
    row = row[7:9]
    split_question = row[0].split()
    split_answer = row[1].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    result = match_count / len(split_answer)
    return result

answer_in_question = jeo.apply(match_words, axis=1)
mean_answer_in_question = answer_in_question.mean()

In [10]:
print(mean_answer_in_question)
answer_in_question.value_counts()

0.058861482035140716


0.000000    17480
0.500000     1446
0.333333      494
0.250000      155
1.000000      123
0.666667      103
0.200000       68
0.166667       27
0.400000       26
0.142857       20
0.750000       17
0.600000        9
0.125000        9
0.285714        7
0.800000        2
0.428571        2
0.181818        2
0.571429        2
0.300000        2
0.111111        2
0.350000        1
0.444444        1
0.875000        1
dtype: int64

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study

# Recycled Questions

Now, let's try to answer the second question "How often questions are repeated". We'll see how often complex words(>6 characters) reoccur.

In [11]:
jeo = jeo.sort_values(by=['Air Date'])

question_overlap = []
terms_used = set()
for row in jeo.iterrows():
    row = row[1]
    split_question = row['clean_question'].split()
    split_question = [w for w in split_question if len(w) > 5]
    match_count = 0
    for word in split_question:
            if word in terms_used:
                match_count += 1
            else:
                terms_used.add(word) 
    if len(split_question) > 0:
        result = match_count / len(split_question)
    question_overlap.append(result)

jeo['question_overlap'] = question_overlap
print(jeo['question_overlap'].mean())
print(split_question)

0.7032718742160347
['international', 'territory', 'atlantic', 'purposes']


There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

# Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

We'll use chi-squared test to figure out which terms correspond to high-value questions. Let's narrow down the questions into two categories:
- Low value -- Any row where Value is less or equal than 800.
- High value -- Any row where Value is greater than 800.

In [12]:
def classification(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeo['is_high_value'] = jeo.apply(classification, axis=1)
jeo.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,question_overlap,is_high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0


In [13]:
jeo['splitted_question'] = jeo['clean_question'].str.split()

def class_word(word):
    high_count = 0
    low_count = 0
    for i, row in jeo.iterrows():
        if word in row['splitted_question']:
            if row['is_high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [14]:
import random
terms_used_list = list(terms_used)
comparison_terms = random.sample(terms_used_list, 10)
comparison_terms

['athlete',
 'federal',
 'appearing',
 'familiarity',
 'transport',
 'hrefhttpwwwjarchivecommedia20071203dj30mp3the',
 'coming',
 'chalmette',
 'microphone',
 'undistinguished']

In [15]:
observed_expected = []
for term in comparison_terms:
    observed_expected.append(class_word(term))
observed_expected

[(5, 7),
 (6, 15),
 (1, 3),
 (0, 1),
 (3, 6),
 (1, 0),
 (8, 8),
 (0, 1),
 (0, 2),
 (1, 0)]

In [16]:
n_rows_high = jeo[jeo['is_high_value'] == 1].shape[0]
n_rows_low = jeo[jeo['is_high_value'] == 0].shape[0]

from scipy.stats import chisquare
chi_squared = []
for tup in observed_expected:
    total = tup[0] + tup[1]
    total_prop = total / jeo.shape[0]
    expected_high = total_prop * n_rows_high
    expected_low = total_prop * n_rows_low
    
    observed = np.array([tup[0], tup[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))
chi_squared

[Power_divergenceResult(statistic=0.9909151991757656, pvalue=0.31951879465803057),
 Power_divergenceResult(statistic=0.00010269512348538456, pvalue=0.9919144877590688),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.09564350170321084, pvalue=0.75712159875701),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=3.559019853290236, pvalue=0.059222698633572865),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.