# Using Python to Win Jeopardy
Using Python to find patterns in the way Jeopardy asks questions to try to get an edge in winning

In [27]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from scipy.stats import chisquare
import re
import string
import seaborn as sns
%matplotlib inline

## Read in the Data Set

In [2]:
jeopardy = pd.read_csv("jeopardy.csv")

In [3]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


## Cleaning the Data
The column headers have spaces so we will need to remove that. Additionally, we want to convert out dates to datetime, our currency to integers, and all our questions and answers to be lowercase and have no punctuation.

In [4]:
new_columns = []
for i in jeopardy.columns:
    if i[0] == ' ':
        new_columns.append(i[1::])
    else:
        new_columns.append(i)
jeopardy.columns = new_columns
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
def norm_text(text):
    text = text.lower()
    translator = str.maketrans('', '', string.punctuation)
    text = text.translate(translator)
    return text
    
jeopardy["clean_question"] = jeopardy["Question"].apply(norm_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(norm_text)

In [6]:
def norm_dollar(text):
    try:
        translator = str.maketrans('', '', string.punctuation)
        text = text.translate(translator)
        text = int(text)
        return text
    except ValueError:
        return 0

jeopardy["clean_value"] = jeopardy["Value"].apply(norm_dollar)

In [7]:
jeopardy["Air Date"] = jeopardy["Air Date"].apply(pd.to_datetime)

## Is Studying Past Questions Helpful in Jeopardy
Writing a functions that will help us answer

* How often the answer is deducible from the question.

* How often new questions are repeats of older questions.

In [8]:
def deduct_from_answer(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count/len(split_answer)

answer_in_question = jeopardy.apply(deduct_from_answer, axis=1)
answer_in_question.mean()

0.060352773854698942

The answer only appears in the questions 6% of the time. This isn't exactly a lot so the chances that we could answer the question just by hearing it is small.

### Finding Question Overlap

In [9]:
question_overlap = []
terms_used = set([])

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
        
jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.69195779922036438

### Question overlap
There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

## High vs Low Level Questions
Checking the data to see if there is certain words that appear in high value questions more than low value questions

In [10]:
def set_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(set_value,axis=1)

In [11]:
def find_value_count(word):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count,low_count

In [12]:
observed_expected = []
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    print(term)
    observed_expected.append(find_value_count(term))

rowlands
artemis
libraries
benigni
swissstyle


In [13]:
observed_expected

[(1, 0), (3, 1), (0, 2), (1, 0), (1, 0)]

In [14]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

In [17]:
chi_squared = []

for counts in observed_expected:
    total = np.sum(counts)
    total_prop = total/jeopardy.shape[0]
    print(total_prop)
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
    
    obs = np.array([counts[0],counts[1]])
    exp = np.array([exp_high,exp_low])
    chi_squared.append(chisquare(obs,exp))

5.0002500125e-05
0.0002000100005
0.00010000500025
5.0002500125e-05
5.0002500125e-05


In [16]:
chi_squared

[Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=4.1980229752219893, pvalue=0.040471136200959497),
 Power_divergenceResult(statistic=0.80392569225376798, pvalue=0.36992223780795708),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047)]

## Calculating Word Frequencies


In [33]:
terms_used_dict = {}

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 4]
    for word in split_question:
        if word in terms_used_dict:
            terms_used_dict[word] +=1
        else:
            terms_used_dict[word] = 1
d = list((k, v) for k, v in terms_used_dict.items() if v >= 200)
d.sort(key=lambda tup: tup[1])
d

[('found', 205),
 ('where', 208),
 ('great', 208),
 ('means', 209),
 ('island', 216),
 ('their', 235),
 ('french', 243),
 ('targetblankherea', 244),
 ('famous', 246),
 ('american', 256),
 ('capital', 257),
 ('president', 258),
 ('world', 261),
 ('novel', 265),
 ('before', 267),
 ('wrote', 287),
 ('became', 287),
 ('played', 297),
 ('years', 316),
 ('known', 352),
 ('which', 361),
 ('title', 370),
 ('after', 426),
 ('state', 443),
 ('country', 476),
 ('named', 513),
 ('called', 521),
 ('about', 549),
 ('first', 949),
 ('these', 1389)]