# Winning Jeopardy

---


## 1. Introduction

[Jeopardy](https://en.wikipedia.org/wiki/Jeopardy!) is a popular TV game show in the USA where participants answer questions to win money. Each episode of Jeopardy features contestants competing in three rounds of questions: `Jeopardy!`, `Double Jeopardy!` and `Final Jeopardy!`.

In this project, we will be working with a dataset of [Jeopardy questions](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). The objective is to **find patterns in the questions to improve our chances of winning a game of Jeopardy**.

---

## 2. Open and Explore the Data

Let's start by reading and understanding the `jeopardy.csv` dataset.

In [1]:
import pandas as pd

# Read csv file into dataframe
jeopardy = pd.read_csv('jeopardy.csv')

# Print size of dataset
print(f'There are {jeopardy.shape[0]} rows and {jeopardy.shape[1]} columns in the data.')

# Print first few rows
jeopardy.head()

There are 19999 rows and 7 columns in the data.


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Each row in the data represents a single question from a single episode of Jeopardy, with the following information:

| Column | Description |
| - | - |
| `Show Number` | Jeopardy episode number |
| `Air Date` | date of broadcast of the Jeopardy episode |
| `Round` | round of Jeopardy in the episode | 
| `Category` | category of the question |
| `Value` | number of dollars which the correct answer is worth |
| `Question` | the text of the question |
| `Answer` | the text of the answer |

Next, we will check the data types stored in each column.

In [2]:
print('Column:     Data Type')
print(jeopardy.dtypes)

Column:     Data Type
Show Number     int64
 Air Date      object
 Round         object
 Category      object
 Value         object
 Question      object
 Answer        object
dtype: object


All the columns contain text data, with the exception of `Show Number` which has integer values.

We will also take a closer look at the column names of the data.

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Upon closer investigation, we notice that there are spaces in the column names of the data.

---

## 3. Clean the Data

**a. Clean column names**

We will start cleaning the data by removing the spaces in the column names.

In [4]:
# Remove spaces from column names
jeopardy.columns = jeopardy.columns.str.replace('^ ', '')
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

**b. Normalize text columns**

We will also remove all punctuation from the `Question` and `Answer` columns, then convert them to lower case.

In [5]:
import re

# Function to return normalized text
def normalize_text(text):
    
    # Convert to lower case
    text_lower = text.lower()
    
    # Remove punctuation
    text_normal = re.sub('[^\w\s]', '', text_lower)
    
    return text_normal

# Normalize text in dataset
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

# Print first few rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams


**c. Convert columns to appropriate data types**

We will clean and convert the following columns to facilitate further data processing and analysis:
- Remove `$` sign and convert `Value` column from text to numeric
- Convert `Air Date` column from text to datetime

In [6]:
# Function to return normalized dollar values
def normalize_value(string):
   
    # Remove punctuation
    string_clean = re.sub('[^\w\s]', '', string)
    
    # Convert string to an integer
    try:
        text = int(string_clean)
    
    # Assign 0 as placeholder in event of conversion error
    except Exception:
        text = 0
        
    return text

# Convert value to numeric
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

# Convert air date to datetime
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

# Check data types
print('Column:                Data Type')
print(jeopardy.dtypes)

# Print first few rows
jeopardy.head()

Column:                Data Type
Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


---

## 4. Analyze the Data

From the cleaned data, we can figure out:
- how often the answer can be found in the question and
- how often questions are repeated.

This will help us determine whether to study general knowledge, focus on past questions or adopt another strategy.

**a. How often the answer can be found in the question**

In [7]:
# Function that takes in a row from the dataset
def count_matches(row):
    
    # Split into lists of words
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    # Initialize variable with zero value
    match_count = 0
    
    # Do not consider `the` since it is a common word
    if 'the' in split_answer:
        split_answer.remove('the')
        
    # Return zero to prevent division error
    if len(split_answer) == 0:
        return 0
    
    # Increment variable if a common word is found in both the question and answer
    for word in split_answer:
        if word in split_question:
            match_count += 1
            
    # Return the proportion of common words in the answer
    return match_count / len(split_answer)

# New column for proportion of common words in the answer
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis = 1)

# Compute mean of new column
print(f"On average, words from the question are found in the question only {jeopardy['answer_in_question'].mean() * 100 :.2f}% of the time.")

On average, words from the question are found in the question only 5.90% of the time.


We probably need to study as hearing a question is unlikely to help us figure out the answer.

**b. How often questions are repeated**

In order to determine what to study, we want to investigate how often new questions are recycled from older ones.

We will store words from old questions in a set named `terms_used` and check how often new questions contain these terms.

Only complex words with at least six characters will be considered, to filter out common filler words like `the` or `and`.

In [8]:
# Initialise empty list and set
question_overlap = []
terms_used = set()

# Sort data in order of ascending air date
jeopardy = jeopardy.sort_values('Air Date')

# Iterate through each row of data
for i, row in jeopardy.iterrows():

    # Split question into list of words
    split_question = row['clean_question'].split(' ')

    # Retain words with at least 6 characters
    split_question = [word for word in split_question if len(word) >= 6]
    
    # Initialize variable with zero value
    match_count = 0
    
    # Iterate through each word
    for word in split_question:
        
        # Increment variable if word is already in the set
        if word in terms_used:
            match_count += 1

        # Add each new word to the set
        terms_used.add(word)
        
    # Divide number of words matched by the length of the question
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    # Append the number of words matched to the list
    question_overlap.append(match_count)
    
# Add new column for the number of words matched in each question
jeopardy['question_overlap'] = question_overlap

# Mean number of words matched
print(f"There is about {jeopardy['question_overlap'].mean() * 100 :.2f}% overlap between terms in the new and old questions.")

There is about 68.94% overlap between terms in the new and old questions.


Hnece, it may be of value to investigate further into recycling of questions.

---

## 5. Perform Chi-Squared Test

A viable approach is to focus our study on high value questions that will earn more money on Jeopardy. 

Let's define two categories to segment the dataset:
- low value - a question with `Value` less than or equal to 800, and
- high value - a question with `Value` more than 800.

In [9]:
# Function to classify value of questions
def assign_value(row):
    
    # 1 corresponds to high value
    if row['clean_value'] > 800:
        value = 1
        
    # 0 corresponds to high value
    else:
        value = 0
        
    # Return value
    return value

# Determine which questions are high and low value
jeopardy['high_value'] = jeopardy.apply(assign_value, axis = 1)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200,0.0,0.5,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200,0.0,0.0,0


For each word in `terms_used`, we will compute their observed counts for high and low value questions.

Since this is a time-consuming process, we will only perform this for a sample of 10 words in `terms_used`.

In [10]:
# Function to return word counts in high and low value questions
def count_value(word):
    
    # Initialize variables with zero value
    low_count = 0
    high_count = 0
    
    # Iterate through each question of data set 
    for i, row in jeopardy.iterrows():
        
        # Split the question into list of words
        split_question = row['clean_question'].split(' ')

        # If word is found in question
        if word in split_question:
            
            # Increment variable for high value count 
            if row['high_value'] == 1:
                high_count += 1
                
            # Increment variable for low value count
            else:
                low_count += 1
    
    return high_count, low_count

import random

# Random sample of ten words from set
random.seed(0)
comparison_terms = random.sample(terms_used, k = 10)

# Initialize empty list
observed_expected = []

# Iterate through each word
for term in comparison_terms:

    # Observed high value and low value word counts
    high_count, low_count = count_value(term)

    # Append observed word counts to list
    observed_expected.append([term, high_count, low_count])

observed_expected

[['holdings', 0, 1],
 ['payments', 1, 0],
 ['tunnel', 3, 3],
 ['velocity', 0, 1],
 ['sketch', 3, 1],
 ['fertilization', 0, 1],
 ['innovations', 2, 0],
 ['3voice', 0, 1],
 ['ajnachakra', 0, 1],
 ['nonenglish', 1, 0]]

Lastly, we can compute the expected counts and associated chi-squared values for each word in high and low value questions.

We are interested to identify words with the highest chi-squared values, since they have the biggest differences in usage between high and low value questions.

In [11]:
import numpy as np
from scipy.stats import chisquare

# Initialize empty dictionary
chi_squared = {}

# Number of high value questions
high_value_count = sum(jeopardy['high_value'] == 1)
print(f'There are {high_value_count} high value questions in the data set.')

# Number of low value questions
low_value_count = sum(jeopardy['high_value'] == 0)
print(f'There are {low_value_count} low value questions in the data set.')

# Iterate through each word
for term, high_count, low_count in observed_expected:
    
    # Total observed word count
    total = high_count + low_count
    
    # Expected word count for high value questions
    high_value_exp = high_value_count / len(jeopardy) * total 
    
    # Expected word count for low value questions
    low_value_exp = low_value_count / len(jeopardy) * total 
    
    # Chi-squared value and p-value
    expected = np.array([high_value_exp, low_value_exp])
    observed = np.array([high_count, low_count])
    chi_squared[term] = chisquare(observed, expected)

chi_squared

There are 5734 high value questions in the data set.
There are 14265 low value questions in the data set.


{'holdings': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'payments': Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 'tunnel': Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881564),
 'velocity': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'sketch': Power_divergenceResult(statistic=4.198022975221989, pvalue=0.0404711362009595),
 'fertilization': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'innovations': Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 '3voice': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'ajnachakra': Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 'nonenglish': Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)}

- None of the terms have a significant difference in usage between high value and low value rows.
- The word counts of the sample terms are all quite low.
- It will be better to run the chi-squared test with terms that have higher frequencies.