# Winning Jeopardy

In this project, we'll work with a dataset of Jeopardy questions to look for ways in which we might gain an advantage in order to win at Jeopardy. Information on the dataset and the `jeopardy.csv` file can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

First, we'll start off by reading in the data and formatting the columns.

In [1]:
import pandas as pd
import csv
import re

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# Some of the columns have leading spaces that should be removed

jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

## Normalizing the Text and Numeric Columns

Before we start our analysis, we need to normalize the `Question` and `Answer` columns by removing punctuation and making sure all words are lowercase so that we will be able to compare them.

In [4]:
def normalize_text(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]', '', text) # Removes all punctuation
    text = re.sub('\s+', ' ', text) # Replaces any number of spaces with a single space
    return text

In [5]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

Next we'll continue by normalizing the `Value` column so that it is fully numeric, and normalizing the `Air Date` column so that the values are datetime objects and not strings.

In [6]:
def normalize_value(value):
    value = re.sub('[^A-Za-z0-9\s]', '', value)
    try: # To avoid value error resulting from attempting to convert an empty string to int
        value = int(value)
    except Exception:
        value = 0
    return value

In [7]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [9]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [10]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in Questions

The next step is to figure out whether or not to study past questions, general knowledge, or to study at all. We'll want to figure out how often the answer to the question can be found in the question itself, and we'll want to know how often new questions are actually just repeats of old questions.

We're going to first write a function to see how many times words in the answer also can be found in the question.

In [12]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

In [13]:
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

It looks like the answer only occurs in the question up to about 6% of the time, so this likely isn't the best strategy to rely on to win Jeopardy.

## Recycled Questions

Without access to the entire Jeopardy question dataset, we can't know exactly if a question is a repeat of an older one, but we can still investigate how often complex words reoccur.

In [18]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

In [19]:
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6877034572001164

It looks like approximately 70% of the terms in the new questions and old questions overlap. This only looks at single terms, but it tells us this is something worth looking into more.

## Low Value vs. High Value Questions

If we would like to study high value questiongs rather than low value questions, then this could help us earn more money on Jeopardy. We can figure out which terms correspond to higher-value questions using a chi-squared test.

We'll start by categorizing our questions into low value and high value. Then we'll loop through each term in `terms_used` to calculate the chi-squared value.

In [None]:
Create a function that takes in a row from a Dataframe, and:
If the clean_value column is greater than 800, assign 1 to value.
Otherwise, assign 0 to value.
Return value.
Determine which questions are high and low value.
Use the Pandas DataFrame.apply method to apply the function to each row in jeopardy.
Pass the axis=1 argument to apply the function across each row.
Assign the result to the high_value column.
Create a function that takes in a word, and:
Assigns 0 to low_count.
Assigns 0 to high_count.
Loops through each row in jeopardy using the iterrows method.
Split the clean_question column on the space character ().
If the word is in the split question:
If the high_value column is 1, add 1 to high_count.
Else, add 1 to low_count.
Returns high_count and low_count. You can return multiple values by separating them with a comma.
Randomly pick ten elements of terms_used and append them to a list called comparison_terms.
Create an empty list called observed_expected.
Loop through each term in comparison_terms, and:
Run the function on the term to get the high value and low value counts.
Append the result of running the function (which will be a list) to observed_expected.