In [1]:
import pandas as pd
import csv

jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. Data can be download from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

* `Show Number` -- the Jeopardy episode number of the show this question was in.
* `Air Date` -- the date the episode aired.
* `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* `Category` -- the category of the question.
* `Value` -- the number of dollars answering the question correctly is worth.
* `Question` -- the text of the question.
* `Answer` -- the text of the answer.

In [3]:
columns_name = []

for i in jeopardy.columns:
    stripped = i.strip()
    columns_name.append(stripped)
columns_name

['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [4]:
jeopardy.columns = columns_name

Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). Idea is to ensure that we lowercase words and remove punctuation so `Don't` and `don't` aren't considered to be different words when we compare them.

In [5]:
# Function to normalize questions and answers.

def normalize_text(text):
    import re
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]","", text) # remove punctuation
    return text

In [6]:
# function to normalize dollar values.

def normalize_values(text):
    import re
    text = re.sub("[^A-Za-z0-9\s]","", text) #  or  re.sub("$","", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [7]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [8]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [9]:
jeopardy["Air Date"].dtype

dtype('O')

In [10]:
# convert the Air Date column to a datetime column

jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
jeopardy["Air Date"].dtype

dtype('<M8[ns]')

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

1. How often the answer is deducible from the question.
2. How often new questions are repeats of older questions.


* We can answer the second question by seeing how often complex words (`> 6` characters) reoccur.
* We can answer the first question by seeing how many times words in the answer also occur in the question

In [11]:
# Working on the first question

def count_matches(row):
    split_answer = row["clean_answer"].split(" ")
    split_question = row["clean_question"].split(" ")
    
    if "the" in split_answer: # The is commonly found in answers and questions, but doesn't have any meaningful use in finding the answer.
        split_answer.remove("the")
    
    if len(split_answer) == 0:# This prevents a division by zero
        return 0
    else: 
        match_count = 0
        for item in split_answer:
            if item in split_question:
                match_count += 1
        return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [12]:
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis = 1)
jeopardy["answer_in_question"].value_counts()

0.000000    17375
0.500000     1452
0.333333      551
0.250000      170
1.000000      123
0.666667      103
0.200000       82
0.166667       28
0.400000       28
0.142857       20
0.750000       18
0.285714       10
0.600000        9
0.125000        9
0.428571        3
0.181818        2
0.800000        2
0.571429        2
0.300000        2
0.111111        2
0.307692        1
0.444444        1
0.222222        1
0.375000        1
0.100000        1
0.153846        1
0.875000        1
0.272727        1
Name: answer_in_question, dtype: int64

In [13]:
jeopardy["answer_in_question"].mean()

0.060493257069335914

# Answer terms in the question
The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

Let's say we want to investigate how often new questions are repeats of older ones. we can't completely answer this, because we only have about `10%` of the full Jeopardy question dataset, but we can investigate it at least.

In [14]:
# Sort jeopardy in order of ascending air date.

jeopardy = jeopardy.sort_values(by = "Air Date")
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first n...,thanksgiving,200,0.0


In [15]:
question_overlap = []

terms_used = set()

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split()
    split_question = [i for i in split_question if len(i) > 5] # remove any word shorter than 6 enables to filter out words like the and than, which are commonly used but don't tell a lot about a question.
    
    match_count = 0
    for j in split_question:
        if j in terms_used:
            match_count += 1
    for word in split_question:
            terms_used.add(word)
            
    if len(split_question) > 0:
        match_count /=len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.6876260592169776

# Question overlap
There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. we'll first need to narrow down the questions into two categories:

* `Low value` -- Any row where Value is less than 800.
* `High value` -- Any row where Value is greater than 800.

In [16]:
def determine_value(row):
    
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [17]:
def count_usage(term):
    
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

In [18]:
comparison_terms = list(terms_used)[:5] 

observed_counts = []
for i in comparison_terms:
    observed_counts.append(count_usage(i))
observed_counts   # observed counts for a few terms
    

[(2, 6), (0, 2), (1, 1), (1, 0), (0, 1)]

In [19]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []

for obs in observed_counts:
    
    total = sum(obs)
    total_prop = total/jeopardy.shape[0]   # proportion across the dataset
    
    high_value_exp = total_prop*high_value_count   # expected term count for high value rows
    low_value_exp = total_prop*low_value_count     # expected term count for low value rows
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    
    chisq = chisquare(observed, expected)
    chi_squared.append(chisq)

chi_squared

[Power_divergenceResult(statistic=0.05272886616881538, pvalue=0.818381104912348),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

# Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.