# Winning Jeopardy

## Introduction

[Jeopardy](https://en.wikipedia.org/wiki/Jeopardy!) is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. The show is a quiz competition that reverses the traditional question-and-answer format of many quiz shows. Rather than being given questions, participants are instead given general knowledge clues in the form of answers and they must identify the person, place, thing, or idea that the clue describes, phrasing each response in the form of a question. You can find its official website [here](https://www.jeopardy.com/).

The aim of this project is to figure out some patterns in the questions that could help us win. We'll work with a dataset containing about 20,000 rows from the beginning of a full dataset of Jeopardy questions, available for downloading [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Each row represents a single question on a single episode of Jeopardy.

## Initial Data Exploration

In [172]:
import pandas
import csv
import re
from scipy.stats import chisquare
import numpy as np

jeopardy = pandas.read_csv("jeopardy.csv")
print('Number of rows:', jeopardy.shape[0])
print('Number of columns:', jeopardy.shape[1])
print('Number of missing values:', jeopardy.isnull().sum().sum())
jeopardy.head()

Number of rows: 19999
Number of columns: 7
Number of missing values: 0


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [173]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front, which should be fixed:

In [174]:
# Removing leading spaces from column names
cleaned_column_names = []
for column in jeopardy.columns:
    cleaned_column_names.append(column.lstrip())

jeopardy.columns = cleaned_column_names
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing Columns

Now, let's normalize some columns to make it easier to conduct data analysis:

- Question and Answer – putting words in lowercase and removing punctuation,
- Value – removing the dollar sign and converting each value to numeric,
- Air Date – making it datetime.

In [175]:
# Converts a string to lowercase and removes all punctuation
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

# Takes in a string, removes any punctuation, converts to an integer, otherwise assigns 0. Returns the integer.

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)
jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200


## Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

To answer the first question, we can check how many times words in the answer also occur in the question.

In [176]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)
jeopardy["answer_in_question"].mean()
answer_in_question_pct = round(jeopardy["answer_in_question"].mean()*100)
print("In {p}% of cases, the answer appears in the question.".format(p=answer_in_question_pct))

In 6.0% of cases, the answer appears in the question.


Only in 6% of cases, the answer is mentioned in the question. Hence, we cannot rely on this approach, so let's reject it and proceed with the second one.

## Recycled Questions

We're going to find out how often new questions are repeats of older ones. This approach also has an issue from the very beginning: we only have about 10% of the full Jeopardy question dataset. However, let's investigate it anyway.

The idea is to check if the words in the questions have been used previously or not. We'll only look at words with six or more characters. This helps us to filter out words like the and than, which are commonly used, but don't tell us a lot about a question. 

Let's calculate the percentage of meaningful word overlap in all the questions of our dataset:

In [177]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1   # the number of repeated words
        for word in split_question:
            terms_used.add(word)   # a set of unique words in all the questions
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap
question_overlap_pct = round(jeopardy["question_overlap"].mean()*100)
print("{p}% of meaningful word overlap in questions.".format(p=question_overlap_pct))

69.0% of meaningful word overlap in questions.


We observe a significant overlap between meaningful words of all the questions of our dataset. Even though we're looking only at 10% of questions of the full Jeopardy dataset and ignore collocations between words (and, hence, lose context), we can assume that some question recycling is quite possible and should be investigated in more detail.

## Low Value vs High Value Questions

Since our major goal is to earn more money on Jeopardy, let's focus on high value questions instead of low value ones. For this purpose, we should figure out which words correspond to high-value questions using a chi-squared test, but first, we need to split the questions into two categories:

- low value – with the value <= 800,
- high value – with the value > 800.

We're going to perform the chi-squared test across the 10 most frequent words in the whole dataset to see which ones have larger differences between the number of high and low value questions where they occurred (doing this for all of the words would take a very long time).

In [178]:
# Takes in a row in Jeopardy as a Series and categorizes questions as high or low value ones.

def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

#Takes in a word and separately returns the numbers of high and low value questions the word occurs in.

def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_high_low = []

for term in comparison_terms:
    observed_high_low.append(count_usage(term))

print("The 10 most frequent words in the whole dataset:"
      "\n", comparison_terms)
print("\nNumber of times each word occurred in high and low value questions:"
      "\n",observed_high_low)

The 10 most frequent words in the whole dataset:
 ['wildlife', 'haggis', 'sleight', 'gunter', 'butalso', 'integrated', '67yrold', 'believing', 'swimming', 'nursery']

Number of times each word occurred in high and low value questions:
 [(1, 5), (1, 0), (1, 0), (2, 0), (0, 1), (1, 0), (0, 1), (1, 1), (0, 3), (1, 8)]


## Applying the Chi-Squared Test

Now that we've found the observed counts for the 50 most frequent words, we can compute the expected counts, chi-squared values, and p-values:

In [179]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_high_low:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.42281054506129573, pvalue=0.515537958129453),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047),
 Power_divergenceResult(statistic=1.3570460299240277, pvalue=0.24405008712856013)]

## Conclusion

In this project, we tried to figure out some successful question-based strategies to win Jeopardy. We used 10% of all the questions from a full Jeopardy dataset. Below are the approaches applied and the results obtained:

- Checking if the answer tends to be hinted at in a question.
    - It happened only in 6% of cases.
- Investigating the possibility of question recycling, whether to study past questions or not.
    - A significant overlap (78%) between meaningful words of all the questions suggests that this is a perspective direction to investigate further.
- None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid.