# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's assume that I want to compete on Jeopardy, and I'm looking for any edge I can get to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help me win.

# Exploring the Data

The dataset is named `jeopardy.csv`, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). Here's the beginning of the file:

In [97]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from scipy.stats import chisquare

# Load the data
jeopardy = pd.read_csv("jeopardy.csv", parse_dates=[" Air Date"])

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 30)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` - the Jeopardy episode number of the show this question was in.
- `Air Date` - the date the episode aired.
- `Round` - the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- `Category` - the category of the question.
- `Value` - the number of dollars answering the question correctly is worth.
- `Question` - the text of the question.
- `Answer` - the text of the answer.


## Data Cleaning

Let's start the cleaning phase by fixing the names of some columns.

In [98]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [0]:
jeopardy.columns = jeopardy.columns.str.strip()

In [100]:
print(jeopardy.info())
jeopardy.describe(include="all")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null datetime64[ns]
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB
None


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
count,19999.0,19999,19999,19999,19999,19999,19999
unique,,336,4,3581,76,19988,14963
top,,2007-11-13 00:00:00,Jeopardy!,TELEVISION,$400,[audio clue],Japan
freq,,62,9901,51,3892,5,22
first,,1984-09-21 00:00:00,,,,,
last,,2012-01-19 00:00:00,,,,,
mean,4312.730537,,,,,,
std,1374.121672,,,,,,
min,10.0,,,,,,
25%,3393.0,,,,,,


### Text Normalization

I need to normalize `Question` and `Answer` columns. The goal is to lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words when I compare them.

In [0]:
for col in ["Question", "Answer"]:
  jeopardy["clean_"+col.lower()] = jeopardy[col].str.lower().str.replace('target|blank', '').str.replace(r'[^A-Za-z0-9\s]', "")

The `Value` column should also be numeric, to allow me to manipulate it more easily. I'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

In [0]:
jeopardy["clean_value"] = jeopardy["Value"].str.replace(r"[$,]", "").str.replace("None", "0").astype(int)

In [103]:
jeopardy.describe(include="all")

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
count,19999.0,19999,19999,19999,19999,19999,19999,19999,19999,19999.0
unique,,336,4,3581,76,19988,14963,19987,14223,
top,,2007-11-13 00:00:00,Jeopardy!,TELEVISION,$400,[audio clue],Japan,audio clue,japan,
freq,,62,9901,51,3892,5,22,5,22,
first,,1984-09-21 00:00:00,,,,,,,,
last,,2012-01-19 00:00:00,,,,,,,,
mean,4312.730537,,,,,,,,,748.336267
std,1374.121672,,,,,,,,,653.988299
min,10.0,,,,,,,,,0.0
25%,3393.0,,,,,,,,,400.0


# Data Analysis

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

## Percentage of Answers Deducible from Questions

Let's see how many times words in the answer (excluding `the`) also occur in the question.

In [0]:
def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [105]:
jeopardy["answer_in_question"].mean()

0.06054492631946508

The answer only appears in the question about `6%` of the time.  This isn't a huge number, and means that I probably can't just hope that hearing a question will enable me to figure out the answer.

## Percentage that New Questions are Repeats of Older Questions

Let's see how often complex words (> 6 characters) reoccur.

To do this, I can:

- Sort `jeopardy` in order of ascending air date.
- Maintain a *set* called `terms_used` that will be empty initially.
- Iterate through each row of `jeopardy`.
- Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
    - If it does, increment a counter.
    - Add each word to `terms_used`.

This will enable me to check if the terms in questions have been used previously or not. Only looking at words greater than 6 characters enables ne to filter out words like `the` and `than`, which are commonly used, but don't tell me a lot about a question.

In [106]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6857367850919883

There is about `69%` overlap between terms in new questions and terms in old questions.  This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms.  This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## High-Value Questions

I only want to study questions that pertain to high value questions instead of low value questions. This will help me earn more money when I'm on Jeopardy.

I can actually figure out which terms correspond to high-value questions using a chi-squared test. I'll first need to narrow down the questions into two categories:

- Low value - Any row where `Value` is less than 800.
- High value - Any row where `Value` is greater than 800.

In [107]:
# Create column High Value
jeopardy["high_value"] = np.where(jeopardy["clean_value"] > 800, 1, 0)

jeopardy["high_value"].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [108]:
# Creating Bag of Words
count_vect = CountVectorizer()
word_counts = count_vect.fit_transform(jeopardy["clean_question"])
jeopardy_count = pd.DataFrame(word_counts.toarray(), columns=count_vect.get_feature_names(), index=jeopardy.index)

# Create new dataframe with the high_value column and the count of every word
jeopardy_count = pd.concat([jeopardy["high_value"], jeopardy_count], axis=1)

# Group to have a total for every word
chi2_df = jeopardy_count.iloc[:,1:].sum().T.to_frame(name="total")

chi2_df.head(10)

Unnamed: 0,total
1,1
7,2
11331,1
2,1
5,1
8,1
10,118
100,56
1000,23
10000,15


### Observed

Now I have to:
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.

In [109]:
# Group by high_value and sum()
chi2_df = chi2_df.merge(jeopardy_count.pivot_table(index="high_value", values=count_vect.get_feature_names(), aggfunc="sum").T, left_index=True, right_index=True)

# Change name of columns
chi2_df.rename(columns={0: "observed_low", 1: "observed_high"}, inplace=True)

chi2_df.head(10)

Unnamed: 0,total,observed_low,observed_high
1,1,0,1
7,2,2,0
11331,1,1,0
2,1,1,0
5,1,1,0
8,1,0,1
10,118,92,26
100,56,43,13
1000,23,17,6
10000,15,12,3


### Expected

Based on the percentage of questions the word occurs in, find expected counts.

In [110]:
# Create expected_low
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]
chi2_df["expected_low"] = chi2_df["total"].apply(lambda x: low_value_count * x / jeopardy.shape[0])

# Create expected_high
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
chi2_df["expected_high"] = chi2_df["total"].apply(lambda x: high_value_count * x / jeopardy.shape[0])

chi2_df.head(10)

Unnamed: 0,total,observed_low,observed_high,expected_low,expected_high
1,1,0,1,0.713286,0.286714
7,2,2,0,1.426571,0.573429
11331,1,1,0,0.713286,0.286714
2,1,1,0,0.713286,0.286714
5,1,1,0,0.713286,0.286714
8,1,0,1,0.713286,0.286714
10,118,92,26,84.167708,33.832292
100,56,43,13,39.943997,16.056003
1000,23,17,6,16.40557,6.59443
10000,15,12,3,10.699285,4.300715


### Chi-Squared Test

Compute the chi-squared value based on the expected counts and the observed counts for high and low value questions.

In [111]:
# Apply chisquare() to all the rows
chi2_df[["chi_value","pvalue"]] = chi2_df.iloc[:,1:].apply(lambda x: chisquare(x[:2], x[2:]), axis=1, result_type="expand")

chi2_df.head(10)

Unnamed: 0,total,observed_low,observed_high,expected_low,expected_high,chi_value,pvalue
1,1,0,1,0.713286,0.286714,2.487792,0.114733
7,2,2,0,1.426571,0.573429,0.803926,0.369922
11331,1,1,0,0.713286,0.286714,0.401963,0.526077
2,1,1,0,0.713286,0.286714,0.401963,0.526077
5,1,1,0,0.713286,0.286714,0.401963,0.526077
8,1,0,1,0.713286,0.286714,2.487792,0.114733
10,118,92,26,84.167708,33.832292,2.542042,0.110851
100,56,43,13,39.943997,16.056003,0.815467,0.366509
1000,23,17,6,16.40557,6.59443,0.075121,0.784022
10000,15,12,3,10.699285,4.300715,0.551519,0.457698


### Select Good Values

I can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the lowest associated p-values.

In [112]:
# Select only rows with pvalue < 0.01
chi2_df["good_pvalue"] = np.where(chi2_df["pvalue"] < .01, 1, 0)

# Create new column with the rate of observed_high respect to expected_high
chi2_df["rate_high"] = 100 * (chi2_df["observed_high"] - chi2_df["expected_high"]) / chi2_df["expected_high"]

chi2_df.head(10)

Unnamed: 0,total,observed_low,observed_high,expected_low,expected_high,chi_value,pvalue,good_pvalue,rate_high
1,1,0,1,0.713286,0.286714,2.487792,0.114733,0,248.779212
7,2,2,0,1.426571,0.573429,0.803926,0.369922,0,-100.0
11331,1,1,0,0.713286,0.286714,0.401963,0.526077,0,-100.0
2,1,1,0,0.713286,0.286714,0.401963,0.526077,0,-100.0
5,1,1,0,0.713286,0.286714,0.401963,0.526077,0,-100.0
8,1,0,1,0.713286,0.286714,2.487792,0.114733,0,248.779212
10,118,92,26,84.167708,33.832292,2.542042,0.110851,0,-23.150343
100,56,43,13,39.943997,16.056003,0.815467,0.366509,0,-19.033397
1000,23,17,6,16.40557,6.59443,0.075121,0.784022,0,-9.014119
10000,15,12,3,10.699285,4.300715,0.551519,0.457698,0,-30.244158


In [113]:
# Take good values sorting by rate_high
good_values = chi2_df.loc[(chi2_df["good_pvalue"] == 1) & (chi2_df["rate_high"] > 0) & (chi2_df["total"] > 50), ["total", "pvalue", "rate_high"]].copy().sort_values("rate_high", ascending=False)

good_values.sort_values("rate_high", ascending=False).head(30)

Unnamed: 0,total,pvalue,rate_high
shows,87,1.037328e-12,120.492605
kelly,55,5.59023e-06,96.584647
very,64,5.133095e-05,79.839281
african,68,3.219584e-05,79.518712
sarah,95,2.472833e-06,76.225286
crew,368,2.4296830000000002e-17,69.650758
ancient,72,0.0005000218,64.701294
clue,420,1.42138e-16,63.594059
thisa,142,2.701016e-06,62.108648
court,71,0.0009066988,62.108648


There are a lot of words with a significant difference in usage (p-value less than 1%) between high value and low value questions.

So, it will be useful to focus more on questions containing this words.