# Winning Jeopardy

#### Darren Ho

## Introduction

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in pop culture.

Imagine that we want to compete on Jeopardy, and we're looking for any way to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win. 

The dataset is named `jeopardy.csv`, which could be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

In [1]:
# importing dataset
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')

# exploring first 5 rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
print('Number of Rows:', jeopardy.shape[0])
print('Number of Columns:', jeopardy.shape[1])

Number of Rows: 19999
Number of Columns: 7


In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [4]:
# print out columns of jeopardy dataset

jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
# some column names have spaces in front, so we remove the spaces from each column and assign it back

jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

As we can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are the explanations of each column:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer

## Normalizing Text

Before we start doing any analysis on the Jeopardy questions, we need to normalize all of the text columns. We want to ensure that we put words in lowercase and remove punctuation so `Don't` and `don't` are not considered to be different words when we compare them. 

In [6]:
# function to normalize questions and answers
import re 

def normalize(string):     
    #take in string
    string = string.lower()            # convert string to lower case
    string = re.sub(r'[^\w\s]', '', string)   # remove all punctuation in string
    return string

In [7]:
# testing funciton

test_str = "YO! I Just WOKE..... UP"
normalize(test_str)

'yo i just woke up'

In [8]:
# normalizing Question and Answer columns and the assigning to new cleaned columns

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)

jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


## Normalizing Columns

There are also other columns to normalize.

The `Value` column should be numeric, to allow us to manipulate it easier. We need to remove the dollar sign from the beginning of each value and convert the column from text to numeric. 

The `Air Date` column should also be converted to a datetime, not a string, to enable us to work it easier. 

In [9]:
# function to normalize dollar values

def normalize_dollar(string):
    
    string = re.sub(r'[^\w\s]', '', string)     # remove any punctuation in string
    try:
        string = int(string)          # convert string to int
    except Exception:                 # assign 0 instead if conversion has an error
        string = 0
    return string

In [10]:
# testing funciton

test_str2 = "$23444!!"
normalize_dollar(test_str2)

23444

In [11]:
# normalize the Value column and assign to new cleaned column

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [12]:
# convert Air Date column to a datetime column

jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question
- How often questions are repeated

In [13]:
# function that takes in a row as a series 

def split_and_count(row):
    
    #splits clean_answer and clean_question around spaces and assigns to variable
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    # setting count to 0
    match_count = 0
    
    #remove 'the' if in split_answer b/c doesnt have meaningful use in finding answer
    if 'the' in split_answer:
        split_answer.remove('the')
        
    # if length is 0, return 0, prevents division by zero error later    
    if len(split_answer) == 0:
        return 0
    
    # loop thru split_answer and see if it occurs in split_question, if it does, add 1 to count
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    # divide count by len of split_answer and return
    return match_count / len(split_answer)

In [14]:
jeopardy['answer_in_question'] = jeopardy.apply(split_and_count, axis=1)
print(jeopardy['answer_in_question'].mean())
print(jeopardy['answer_in_question'].mean()*100)

0.05900196524977763
5.900196524977764


We used our function to count how many terms in the `clean_answer` column occur in `clean_question` and assigned the result to the `answer_in_question` column. We then found the mean of `answer_in_question`, which is approximately 0.06, or 6%. This number tells us that on average, the answer appears in the questions 6% of the time. 

## Recycled Questions

Let's say we want to invesetigate how often new questions are repeats of older ones. We cannot completely answer this as we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least. 

In [15]:
# empty list
question_overlap = []

# empty set
terms_used = set()

# sorting jeopardy by ascending air date
jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    
    # split column on the space character
    split_question = row['clean_question'].split(' ')
    
    # remove any words that less than 6 characters long
    split_question = [word for word in split_question if len(word) > 5]
    
    # set count to 0
    match_count = 0
    
    # loop thru, if term occurs, add 1
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
    # add each word in split_question to terms used        
    for word in split_question:
        terms_used.add(word)
        
    # if len is greater than 0, divide count by len
    if len(split_question)>0:
        match_count /= len(split_question)

    # append count to list    
    question_overlap.append(match_count)
    
jeopardy["question_overlap"] = question_overlap

print(jeopardy["question_overlap"].mean())
print(jeopardy["question_overlap"].mean()*100)

0.6876235590919739
68.7623559091974


This function allowed us to check if the terms in questions have been used previously or not. Only looking at words with 6 or more characters enabled us to filter out workds like `the` and `than`, which are commonly used, but do not tell a whole lot about a question. 

We calculated that approximately 69% of the terms in questions have been previously used.

## Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value ones. This will help us earn more money when we're on Jeopardy. 


In [16]:
# 1 or 0 function depending on clean value
def classify_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(classify_value, axis=1)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.0,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0,0


In [17]:
# fucntion that counts high_value values and splits them into two diff counts
def low_high_count(word):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [18]:
from random import choice

terms_used_list = list(terms_used)

# randomly pick 10 elements of terms_used and append them to list
comparison_terms = [choice(terms_used_list) for _ in range(10)]                                                                  

observed_expected = []

for word in comparison_terms:
    observed_expected.append(low_high_count(word))

observed_expected

[(2, 2),
 (0, 1),
 (1, 0),
 (2, 4),
 (0, 2),
 (3, 14),
 (0, 1),
 (3, 2),
 (1, 1),
 (0, 1)]

We narrowed down the questions into two categories:

- High Value - Any row where `Value` is greater than `800`
- Low Value - Any row where `Value` is less than `800`

We then randomly picked elements of `term_used` and appended them to a list called `comparison_terms`. We looped through each term in that list and ran the `low_high_count` function to get the high value and low value counts, and then appended those results to `observed_expected`. 

## Applying the Chi-squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts in additon to the chi-squared value.

In [19]:
from scipy.stats import chisquare
import numpy as np

high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([exp_high, exp_low])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.06376233446880725, pvalue=0.8006453026878781),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=1.0102851115076668, pvalue=0.3148345448133909),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.3995960878537224, pvalue=0.12136658322360773),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

## Conclusion

Looking over the chi-squared values and the associated p-values, we see that none of the results are statistically significant as all of the p-values are greater than 0.05. What this result tells us is that there is no difference in usage between high and low value rows.

## Next Steps

Here are some potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)) instead of the subset we used in this lesson.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.