# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

Let's say we want to compete on Jeopardy, and we're looking for any edge we can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named `jeopardy.csv`, and contains `20000` rows from the beginning of a full dataset of Jeopardy questions, which you can download here. Here's the beginning of the file:

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
    
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

+ `Show Number` -- the Jeopardy episode number of the show this question was in.
+ `Air Date` -- the date the episode aired.
+ `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
+ `Category` -- the category of the question.
+ `Value` -- the number of dollars answering the question correctly is worth.
+ `Question` -- the text of the question.
+ `Answer` -- the text of the answer.

In [2]:
print(jeopardy.columns) # Some of the column names have spaces in front.

jeopardy.columns = jeopardy.columns.str.replace(' ','')
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')
Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


## Normalizing Text 

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). The idea is to ensure that we lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words when we compare them.

In [3]:
#Write a function to normalize questions and answers
import re

def cleaning_answer_question(string):
    
    string = string.lower()
    result = re.sub(r'[^\w\s]', '', string)
    
    return result

# Normalize the Question column
jeopardy['clean_question'] = jeopardy['Question'].apply(cleaning_answer_question)

jeopardy['clean_question'].head(10)

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
5    in the title of an aesop fable this insect sha...
6    built in 312 bc to link rome  the south of ita...
7    no 8 30 steals for the birmingham barons 2306 ...
8    in the winter of 197172 a record 1122 inches o...
9    this housewares store was named for the packag...
Name: clean_question, dtype: object

In [4]:
# Normalize the Answer column

jeopardy['clean_answer'] = jeopardy['Answer'].apply(cleaning_answer_question)
jeopardy['clean_answer'].head(10)

0        copernicus
1        jim thorpe
2           arizona
3         mcdonalds
4        john adams
5           the ant
6    the appian way
7    michael jordan
8        washington
9     crate  barrel
Name: clean_answer, dtype: object

## Normalizing Columns

Now that we've normalized the text columns, there are also some other columns to normalize.

The `Value` column should also be numeric, to allow you to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should also be a datetime, not a string, to enable you to work with it more easily.

In [5]:
# Write a function to normalize dollar values
def cleaning_value(string):
    
    result = re.sub(r'[^\w\s]', '', string)
    
    if result == 'None':
        result = 0
    integer = int(result)
    
    return integer

# Normalize the Value column
jeopardy['Value'] = jeopardy['Value'].apply(cleaning_value)
jeopardy['Value'].head(10)

0    200
1    200
2    200
3    200
4    200
5    200
6    400
7    400
8    400
9    400
Name: Value, dtype: int64

In [6]:
# Convert the Air Date column to a datetime column
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])
jeopardy['AirDate'].head(10)

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
5   2004-12-31
6   2004-12-31
7   2004-12-31
8   2004-12-31
9   2004-12-31
Name: AirDate, dtype: datetime64[ns]

## Answers in questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

+ How often the answer is deducible from the question.
+ How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [7]:
def matching_words(row):
    
    split_answer = row[-1].split(' ')
    split_question = row[-2].split(' ')
    
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0
    
    for word in split_answer:
        
        if word in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

# Count how many times terms in clean_answer occur in clean_question
answer_in_question = jeopardy.apply(matching_words, axis=1)
print(answer_in_question.head(10))

0    0.000000
1    0.000000
2    0.000000
3    0.000000
4    0.000000
5    0.000000
6    0.000000
7    0.000000
8    0.000000
9    0.333333
dtype: float64


In [8]:
# Find the mean of the answer_in_question column
print(answer_in_question.mean())

0.060493257069335914


On average, the answer only makes up for about 6% of the question.

## Recycled Questions

Let's say we want to investigate how often new questions are repeats of older ones. we can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, we can:

+ Sort `jeopardy` in order of ascending air date.
+ Maintain a set called `terms_used` that will be empty initially.
+ Iterate through each row of `jeopardy`.
+ Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
  + If it does, increment a counter.
  + Add each word to `terms_used`.

This will enable you to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables you to filter out words like `the` and `than`, which are commonly used, but don't tell you a lot about a question.

In [9]:
jeopardy.sort_values(['AirDate'], inplace=True)

question_overlap = []
terms_used = set()

for row in jeopardy.iterrows():
    
    row = row[1]
    split_question = row[-2].split(' ')
    
    for word in split_question:
        
        if len(word) < 6:
            split_question.remove(word)
            
    match_count = 0
    
    for word in split_question:      
        
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
        
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)

# Assign question_overlap to the question_overlap column of jeopardy
jeopardy['question_overlap'] = question_overlap
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,0,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,0.2
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,0.142857


In [10]:
# Find the mean of the question_overlap column
print(jeopardy['question_overlap'].mean())

0.8019868294831005


There is about 45.6% overlap between terms in new questions and terms in old questions.

## Low Value vs High Value Questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

+ Low value -- Any row where `Value` is less than `800`.
+ High value -- Any row where `Value` is greater than `800`.

We'll then be able to loop through each of the terms from `terms_used`, and:

+ Find the number of low value questions the word occurs in.
+ Find the number of high value questions the word occurs in.
+ Find the percentage of questions the word occurs in.
+ Based on the percentage of questions the word occurs in, find expected counts.
+ Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [11]:
jeopardy.head()

Unnamed: 0,ShowNumber,AirDate,Round,Category,Value,Question,Answer,clean_question,clean_answer,question_overlap
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,0,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug this ...,the grand canyon,0.2
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones a sa...,tom,0.142857


In [12]:
def high_low_value(row):
    
    row = row[4]
    
    if row > 800:
        value = 1
    else:
        value = 0
    
    return value 
   
# Determine which questions are high and low value
jeopardy['high_value'] = jeopardy.apply(high_low_value, axis=1)

jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [13]:
def high_low_count(word):
    
    low_count = 0
    high_count = 0
    
    for row in jeopardy.iterrows():
        
        row = row[1]
        split_question = row[-4].split(' ')
        
        if word in split_question:
            
            if row[-1] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count

In [14]:
# Randomly pick ten elements of terms_used
import random
comparison_terms = random.sample(terms_used, 10)

observed_expected = []

for term in comparison_terms:
    result = high_low_count(term)
    observed_expected.append(result)
    
print(observed_expected)

[(1, 0), (0, 1), (0, 1), (0, 1), (1, 0), (9, 35), (0, 1), (1, 0), (1, 0), (0, 2)]


## Applying the Chi-Squared Test

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [16]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy['high_value'].value_counts()[1]
low_value_count = jeopardy['high_value'].value_counts()[0]

chi_squared = []

for obs in observed_expected:
    
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared    

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=1.4526283635973305, pvalue=0.22810667716779373),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571)]

## Conclusion

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all low values, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.