# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

In this project, I will work with a dataset of Jeopardy questions to figure out some patterns in the questions.

Here are explanation for each column:
- `Show Number`: the Jeopardy episode number of the show this question was in
- `Air Date`: the date the episode aired
- `Round`: the round in which the question was asked in
- `Category`: the category of the question
- `Value`: the amount in dollars the question is worth
- `Question`: the text of the question
- `Answer`: the text of the answer

In [1]:
# importing files we will be using in this project
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front

In [2]:
jeopardy.columns = jeopardy.columns.str.replace(' ', '')
jeopardy.head(10)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel


In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
ShowNumber    19999 non-null int64
AirDate       19999 non-null object
Round         19999 non-null object
Category      19999 non-null object
Value         19999 non-null object
Question      19999 non-null object
Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


Before we get started on the analysis we will normalize the `Question` and `Answer` columns by converting the strings to lowercase and removing all punctuation to ensure that words like "don't" and "Don't" are considered to be different words when comparing them.

In [4]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)         
jeopardy.head(10)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sha...,the ant,200
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 bc to link rome the south of ita...,the appian way,400
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2306 ...,michael jordan,400
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 197172 a record 1122 inches o...,washington,400
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel,400


Also, the `Air Date` column should be of type `datetime` not `object`.

In [5]:
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])
jeopardy.dtypes

ShowNumber                 int64
AirDate           datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# Study Strategies

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question
- How often new questions are repeats of older questions

We can answer the first question by seeing how many times words in the answer also occur in the question.

In [6]:
def from_question(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    
    # 'the' is commonly found in answers and questions, but has no meaningful use in answering
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_question)

jeopardy['answer_in_question'] = jeopardy.apply(from_question, axis=1)
avg = jeopardy['answer_in_question'].mean()
print(avg)

0.012300977639028241


# Answers from Questions

The answer only appears in the question about 1% of the time. This is an insignificant amount and means that we can't just hope that hearing a question will allow us to figure out the answer; we will have to study.

Now we will investigate how often new questions are repeats of older ones. We cannot completely answer this since we are only working with about `10%` of the full Jeopardy question dataset; however, we can at least investigate it.

To do this we will`:`
- Sort `jeopardy` in order of ascending air date
- Maintain a set called `terms_used` that is initially empty
- Iterate through each row of `jeopardy`
- split `clean_question` into words, removing any words less than 6 characters and checking if remaining terms are in `terms_used`
    - If it does, increment a counter
    - Otherwise add to `terms_used`
    
This will enable us to check if the terms in the questions have been used previously. Looking at words greater than 6 characters allows us to filter out words like 'the' and 'than', which are commonly used, but don't inform us about the question.

In [7]:
question_overlap = list()
terms_used = set()

for item, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [term for term in split_question if len(term) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        else:
            terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6925960057338647

# Question Overlap

There is about a 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, but single terms. This makes it relatively insignificant; however it is still worth looking more into.

A study strategy would be to study questions that are high valued questions, thus earning more money when answering correctly on Jeopardy. We can figure out which terms correspond to high-value questions using a chi-squared test. 

To do this, we first have to narrow down the questions into two categories

- Low value -- any row where `Value` is less than 800
- High value -- any row where `Value` is greater than or equal to 800

Then, we can loop through each terms in `terms_used` and`:`
- Find the number of low value questions the word occurs in
- Find the number of high value questions the word occurs in
- Find the percentage of questions the word occurs in
- Based on the percentage of questions the word occurs in, find expected counts
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions

We use a chi_squared test because this test allows us to test whether there is a relationship between `terms_used` and `clean_questions` 

In [8]:
def change_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(change_value, axis=1)

In [10]:
def counter(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
            
    return high_count, low_count

observed_expected = []
comparison_terms = list(terms_used)[:5]
for i in comparison_terms:
    observed_expected.append(counter(i))
    
observed_expected

[(0, 1), (0, 1), (0, 1), (1, 0), (1, 0)]

In [12]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
total = 0
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    expected_hi = total_prop * high_value_count
    expected_lo = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([expected_hi, expected_lo])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

# Chi-squared Results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.