# A strategy to win Jeopardy (TV show)

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help one win.

We work with a dataset containing first 20000 rows of the full dataset of Jeopardy questions, which can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number - the Jeopardy episode number
- Air Date - the date the episode aired
- Round - the round of Jeopardy
- Category - the category of the question
- Value - the number of dollars the correct answer is worth
- Question - the text of the question
- Answer - the text of the answer

-------------
Reading & cleaning dataset
------------

In [79]:
import pandas as pd
import numpy as np
import re
import random
from scipy.stats import chisquare

In [43]:
# reading the dataset
jeo = pd.read_csv('jeopardy.csv')

In [44]:
# first 5 rows
jeo.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [45]:
# column names
jeo.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [46]:
# removing sapces in column names
jeo.columns = [i.replace(' ', '') for i in jeo.columns]
jeo.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [47]:
# data format 
jeo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
ShowNumber    19999 non-null int64
AirDate       19999 non-null object
Round         19999 non-null object
Category      19999 non-null object
Value         19999 non-null object
Question      19999 non-null object
Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


-------------
Normalizing text
-----------------

In [48]:
# function to normalize text
def normalize(text):
    # changing to lower case
    text = text.lower()
    # removing punctuations
    text = re.sub(r'[^\w\s]','',text)
    return text
    

In [49]:
# apply normzlie function to "Question" & "Answer"
jeo['Question_norm'] = jeo['Question'].apply(normalize)
jeo['Answer_norm'] = jeo['Answer'].apply(normalize)

In [50]:
jeo.sample(5)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,Question_norm,Answer_norm
12307,6230,2011-10-21,Jeopardy!,WEATHER GEAR,$200,In winter these are often connected by a strin...,mittens,in winter these are often connected by a strin...,mittens
9879,3130,1998-03-20,Jeopardy!,"""TOMORROW""",$300,"In the novel ""Gone with the Wind"", it follows ...",Tomorrow is another day,in the novel gone with the wind it follows ill...,tomorrow is another day
16382,4344,2003-06-19,Double Jeopardy!,SPINGARN MEDAL WINNERS,$400,"A lawyer when he won in 1946, he went on to be...",Thurgood Marshall,a lawyer when he won in 1946 he went on to be ...,thurgood marshall
19942,5694,2009-05-14,Jeopardy!,ASSERTING AUTHOR-ITY,$600,"This Irish novelist's July 4, 1931 marriage oc...",(James) Joyce,this irish novelists july 4 1931 marriage occu...,james joyce
6423,3010,1997-10-03,Jeopardy!,FLOPS,$400,Roger Ebert called this 1980 Michael Cimino fi...,Heaven's Gate,roger ebert called this 1980 michael cimino fi...,heavens gate


In [51]:
# fuction to normalize "Value"
def norm_value(value):
    # removing punctuations
    value = re.sub(r'[^\w\s]','',value)
    # converting to integer
    try:
        value = int(value)
    except Exception:
        value = 0
    return value

In [52]:
# apply norm_value to "Value"
jeo['Value_norm'] = jeo['Value'].apply(norm_value)

In [53]:
# convertinf "AirDate" to datetime
jeo['AirDate'] = pd.to_datetime(jeo['AirDate'])

In [54]:
jeo.sample(5)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,Question_norm,Answer_norm,Value_norm
4306,3403,1999-05-26,Double Jeopardy!,TOM JONES,$800,The benevolent Mr. Allworthy & the crude Mr. W...,Squire,the benevolent mr allworthy the crude mr west...,squire,800
2955,3697,2000-10-03,Double Jeopardy!,HISTORY IN MOVIES,$400,"Kathy Bates played the real-life ""Unsinkable"" ...",Titanic,kathy bates played the reallife unsinkable mol...,titanic,400
223,3673,2000-07-19,Double Jeopardy!,ALASKA,$800,One of the 3 mottos that have been featured on...,"""The Last Frontier"", ""The Great Land"", or ""Nor...",one of the 3 mottos that have been featured on...,the last frontier the great land or north to t...,800
7949,3547,2000-01-25,Jeopardy!,LITERARY ANIMALS,$200,"Cottontail & these 2 ""went down the lane to ga...",Flopsy & Mopsy,cottontail these 2 went down the lane to gath...,flopsy mopsy,200
6574,3358,1999-03-24,Double Jeopardy!,EXPLORERS,$500,In 1828 Rene Caille reached this remote Africa...,Timbuktu,in 1828 rene caille reached this remote africa...,timbuktu,500


In [55]:
# data format 
jeo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
ShowNumber       19999 non-null int64
AirDate          19999 non-null datetime64[ns]
Round            19999 non-null object
Category         19999 non-null object
Value            19999 non-null object
Question         19999 non-null object
Answer           19999 non-null object
Question_norm    19999 non-null object
Answer_norm      19999 non-null object
Value_norm       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


------------
Answers in questions
---------

We would like to know how often the answer can be used for a question. We can check how many times words in the answer also occur in the question.

In [56]:
# function to calculate how many times ansewer appears in question
def ans_in_quest(row):
    split_answer = row['Answer_norm'].split(' ')
    split_question = row['Question_norm'].split(' ')
    match_count = 0
    # removing "the" from answer
    if 'the' in split_answer:
        split_answer.remove('the')
    # avoiding devision by zero
    if len(split_answer) == 0:
        return(0)
    # calculating how many times ansewer appears in question
    for i in split_answer:
        if i in split_question:
            match_count += 1
    match_count =  match_count / len(split_answer)
    return match_count

In [57]:
# apply function to dataframe
answer_in_question = jeo.apply(ans_in_quest, axis=1)

In [58]:
# mean of answer in question
answer_in_question.mean()

0.06049325706933587

Only in about 6% of all questions the anwer can be found in the question. So, it sesms that this is not a good approach to get
the answer. 

-----------
Repeated questions
------------

We would like to investigate how often new questions are repeats of older ones by seeing how often complex words (> 6 characters) reoccur.

In [59]:
questions_overlap = []
terms_used = set()
# sorting datframe by date
jeo.sort_values(by=['AirDate'], axis=0)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,Question_norm,Answer_norm,Value_norm
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,200
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,200
19305,10,1984-09-21,Double Jeopardy!,HOMONYMS,$200,Hindu hierarchy or a play's actors,a caste (cast),hindu hierarchy or a plays actors,a caste cast,200
19306,10,1984-09-21,Double Jeopardy!,TV TRIVIA,$200,"Last season, this series mourned the loss of S...",Hill Street Blues,last season this series mourned the loss of sg...,hill street blues,200
19307,10,1984-09-21,Double Jeopardy!,1789,$400,Why April 28th was a bad day for Capt. Bligh,the day of the mutiny on the Bounty,why april 28th was a bad day for capt bligh,the day of the mutiny on the bounty,400
19308,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$400,Seaside resort that has a monopoly on East Coa...,"Atlantic City, New Jersey",seaside resort that has a monopoly on east coa...,atlantic city new jersey,400
19309,10,1984-09-21,Double Jeopardy!,LITERATURE,$400,"He wrote ""The 3 Musketeers""; his son wrote ""Ca...",(Alexandre) Dumas,he wrote the 3 musketeers his son wrote camille,alexandre dumas,400


In [60]:
for index, row in jeo.iterrows():
    split_question = row['Question_norm'].split(' ')
    # removing words with less than 6 characters
    split_question = [i for i in split_question if len(i) > 5]
    # counting matching words with set of all words
    match_count = 0
    for j in split_question:
        if j in terms_used:
            match_count += 1
        # storing new wrods to set of all words
        terms_used.add(j)
    
    if len(split_question) > 0:
        match_count /= len(split_question)
    # storing number of overlaps
    questions_overlap.append(match_count)

# assign number of overlaps to the dataframe
jeo['questions_overlap'] = questions_overlap

# mean of number of overlapped terms
jeo['questions_overlap'].mean()
    

0.6925935056088584

In [61]:
# example: looking at the last questions
jeo.tail(5)

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,Question_norm,Answer_norm,Value_norm,questions_overlap
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18,of 8 12 or 18 the number of us states that tou...,18,200,1.0
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince,the new power generation,prince,200,1.0
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo,in 1589 he was appointed professor of mathemat...,galileo,200,1.0
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky,before the grand jury she said im really sorry...,monica lewinsky,200,1.0
19998,3582,2000-03-14,Jeopardy!,LLAMA-RAMA,$200,Llamas are the heftiest South American members...,Camels,llamas are the heftiest south american members...,camels,200,0.666667


About 70% of words/terms in the new questions have been repeated in the old questions. So it seems that it is a good strategy to study the old questions. But, whcih questions bring the most value?

----------
"Low-values" vs "high-value" questions
-----------


To find the words in teh most valued questions, in the set of all words, we  
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
We can then find the words with the biggest differences in usage between high and low value questions by calculating chi-squared values.

In [62]:
# function to define low vs high value questions (threshold: 800$)
def low_high_value(row):
    if row['Value_norm'] > 800:
        value = 1
    else: 
        value = 0
    return value

In [63]:
# apply the function to the dataframe
jeo['high_value'] = jeo.apply(low_high_value, axis=1)

In [64]:
# function to count low and high value questions for each word
def count_high_low(word):
    low_count = 0
    high_count = 0
    for index, row in jeo.iterrows():
        split_question = row['Question_norm'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [118]:
# 10 randomly selected words to test the function
random.seed(1)
terms_rand = random.sample(terms_used, 10)
terms_rand

['millionaires',
 'classically',
 'ranitidine',
 'generations',
 'chemotherapy',
 'patrolled',
 'shrinkage',
 'beeswax',
 'spokesgiraffe',
 'shortened']

In [119]:
# testing the function with the random words
obs_example = []
for word in terms_rand:
    obs_example.append(count_high_low(word))
# first number is count for high-value and second number for low-value 
obs_example

[(1, 0),
 (0, 1),
 (1, 0),
 (0, 3),
 (1, 0),
 (1, 0),
 (2, 0),
 (1, 0),
 (0, 1),
 (3, 6)]

-----------
Chi-squared test to find "hig-value" words
-------

In [77]:
# number of high value questions
high_value_count = sum(jeo['high_value'] == 1)
high_value_count

5734

In [78]:
# number of low value questions
low_value_count = sum(jeo['high_value'] == 0)
low_value_count

14265

In [96]:
# function to calculate Chi-squared for example of 10 random words
def chi_sq(obs):
    chi_squared = []
    totol_rows = len(jeo)
    for i in obs:
        # sum of high value and low value counts
        total = sum(i)
        # proportion across dataset
        total_prop = total / totol_rows
        # expected count for high/low value rows
        exp_high = total_prop * high_value_count
        exp_low = total_prop * low_value_count

        observed = np.array([i[0], i[1]])
        expected = np.array([exp_high, exp_low])
        # calculating chi-Squared 
        chisq_value, pvalue= chisquare(observed, expected)
        chi_squared.append((chisq_value, pvalue))  
    return chi_squared

In [120]:
# firs number: chi squared, second: p-value
chi_sq(obs_example)

[(2.487792117195675, 0.11473257634454047),
 (0.401962846126884, 0.5260772985705469),
 (2.487792117195675, 0.11473257634454047),
 (1.205888538380652, 0.27214791766902047),
 (2.487792117195675, 0.11473257634454047),
 (2.487792117195675, 0.11473257634454047),
 (4.97558423439135, 0.025707519787911092),
 (2.487792117195675, 0.11473257634454047),
 (0.401962846126884, 0.5260772985705469),
 (0.09564350170321084, 0.75712159875701)]

in the 10 random examples, there is no statistically significnat results (p-value < 0.05).

-----------
Testing the algorithm on more terms
-------

In [121]:
# 1000 random words 
random.seed(1)
terms1000 = random.sample(terms_used, 1000)
obs1000 = []
for word in terms1000:
    obs1000.append(count_high_low(word))
chi_sq1000 = chi_sq(obs1000)

In [122]:
# finding the words and frequencies for the cases with p-value < 0.05
final_list = []
for i,tup in enumerate(chi_sq1000):
    # filter only rows with p-value below 0.05
    if tup[1] < 0.05:
        final_list.append([terms1000[i], obs1000[i]])
        
final_list

[['shrinkage', (2, 0)],
 ['zathura', (2, 0)],
 ['helmut', (2, 0)],
 ['jamaica', (4, 2)],
 ['obstacles', (2, 0)],
 ['seafood', (5, 2)],
 ['persian', (11, 10)],
 ['process', (15, 12)],
 ['pulitzer', (15, 9)],
 ['orator', (7, 3)],
 ['pregnant', (5, 2)],
 ['supper', (3, 1)],
 ['absorb', (2, 0)],
 ['unofficial', (2, 0)],
 ['isotope', (4, 2)],
 ['sparta', (2, 0)],
 ['scenic', (6, 2)],
 ['conversion', (3, 1)],
 ['austen', (4, 1)],
 ['movement', (17, 21)],
 ['fiendish', (2, 0)],
 ['austriaa', (3, 0)],
 ['harlan', (3, 0)],
 ['pioneering', (2, 0)],
 ['madeline', (2, 0)],
 ['target_blankkelly', (25, 16)],
 ['conditioner', (2, 0)],
 ['financed', (2, 0)],
 ['charitable', (3, 1)]]

Only considering the csaes that are repeated more than 5 time (in totoal) and are more repeated in the high-value category

In [123]:
for i in final_list:
    if (sum(i[1]) > 5) & (i[1][0] > i[1][1]):
        print(i[0])

jamaica
seafood
persian
process
pulitzer
orator
pregnant
isotope
scenic
target_blankkelly


Among the 1000 random words, those words seem to appeare significantly more in the questions of high-values. So, it's worth studying about the related topics for the game.

Ideas for the improvement: 
- considering the categories of the questions as well
- testing on most frequent words instead of random selection
- testing on whole set of words instead of random selection (long computational time)
- using full dataset of Jeopardy (available [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file)) instead of the first 20000 rows used in this study