# Winning Jeopardy Analysis

In this project we will be examining a dataset about questions that have been asked on Jeopardy. Jeopardy is a popular tv show in which participants can win money by means of answering several trivia-related questions. Our goal is to analyze the given data and figure out whether we can *gain an edge in order to win*.

## Jeopardy Questions

We'll go ahead and start by performing a brief analysis on our dataset

In [1]:
import pandas as pd

# storing data as pandas df
jeopardy = pd.read_csv('jeopardy.csv')

# examining first five rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# reading columns
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


We can see that a majority of our columns contain a whitespace at the begin of the column name. We can go ahead and remove that and while we're at it, we can make the column names follow *snake case* format.

*One more thing to note is that all of our columns are stored as objects*

In [3]:
# fixing column name format
jeopardy.columns = (jeopardy.columns
                    .str.strip() # remove leading whitespace
                    .str.replace(' ','_') # replace remaining whitespaces with '_'
                    .str.lower() # make lowercase
                   )

# verifying
jeopardy.columns

Index(['show_number', 'air_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

## Normalizing Text

Before we can start to perform any kind of analysis on our text, we must normalize all of the text columns. We do so by removing all punctuation and changing all text to lowercase. Normalizing our text helps remove the "distinctness" between words such as "Don't" and "don't".

First, we'll write a function that takes in a string, removes punctuation, makes all the text lowercase, and then we'll apply it to out text columns

In [4]:
import re

# text normalizer function
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

# creating clean_question column
jeopardy['clean_question'] = jeopardy['question'].apply(normalize_text)

# creating clean_answer columns
jeopardy['clean_answer'] = jeopardy['answer'].apply(normalize_text)

In [5]:
# verifying
jeopardy.head(10) 

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sha...,the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 bc to link rome the south of ital...,the appian way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2306 ...,michael jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 197172 a record 1122 inches o...,washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel


## Normalizing Columns

Now that we're done normalizing our text columns, we need to do the same for our *value* column, as well as our *air_date* column. We'll go ahead and do that next.

We'll start with the value column. We first need to remove the the dollar sign and convert it into a numerical value.
Then we can move on to the *air_value* column and convert it to *datetime* format.

In [6]:
# dollar normalizer functions
def normalize_dollar_value(dollar):
    dollar = re.sub('\W', '', dollar) # removing punctuation
    if dollar == 'None':
        dollar = 0
    else:
        dollar = int(dollar)
    return dollar

# creating clean_value column
jeopardy['clean_value'] = jeopardy['value'].apply(normalize_dollar_value)

# datetime formatting
jeopardy['air_date'] = jeopardy['air_date'].apply(pd.to_datetime)

## Answers in Questions

Now that we have normalized all our columns, we can start to focus on the goal of our project. We need figure out whether to:
    - study past questions
    - study general knowledge
    - or not study it at all
In order to make a decision, we must first answer two questions:
    1. How often the answer is deducible from the question
    2. How often the new questions are repeats of older questions
    
Let's start by answering one question at a time. For question 1., we can check how many times words in an answer occured in the corresponding question.

In [7]:
# check 
def occurrence(row):
    
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(occurrence, axis=1)

print(jeopardy['answer_in_question'].mean())

0.05900196524977763


So based on our analysis, it looks like only about 6% percent of the words in from each answer appear in their corresponding number. Thus, it is not very often that our answers are deducible from our given questions. Let's move on to the next question.

## Recycled Questions

Can check whether some questions were repeats of older ones? To do this we can start by first rearranging the order of the data by the date aired, from earliest to latest. Then, for each row, we can check if any words have occurred before in a *Set* we create by adding words as we go down the rows. To filter out simple words like "the " or "than" (which are not meaningful in this case) by setting the length of words to be 6 letters or more to be added to the *Set*.

In [8]:
# initalize list of overlapped word count
question_overlap = []

#initialize set of repeated questions
terms_used = set()

# sorting dataframe rows in ascending order
jeopardy = jeopardy.sort_values('air_date')

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split() # create list of words
    split_question = [word for word in split_question if len(word) > 5] # filtering words
    
    match_count = 0
    
    for word in split_question: # counts number of repeated words
        if word in terms_used:
            match_count += 1
    for word in split_question: # adds new words, if any
        terms_used.add(word)
    
    if len(split_question) > 0: # percentage of words in given question that have been repeated
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

jeopardy['question_overlap'].mean() # average number of repeated words per question

0.6876260592169802

Based on what the average was for the number of questions overlapping, we can see that close to 70% of the words in each row were prevalent in previous questions. This seems like a really great thing (and it is) but we have to remember that our data is only 10% of the entire jeopardy dataset. This means that the 70% really only represents about 7% out of the entire dataset. For all we know, the other 90% could have 70% repeated words, but we can't tell from our dataset so we can't make a final decision on this. The good news we can take from this is that this gives us a motive to go further in our analysis.

## Low Value v.s. High Value Questions

Now, what if we wanted to study questions of high value (in this case, high value being greater than $800)? Well, we can actually figure out which words correspond to high-value words by performing a Chi-Squared Test. We can split our questions into two categories: low-value and high-value. By doing this, we can find the expected values, compute the observed, and determine the Chi-Square value. 

In [25]:
from random import choice

# creating a new column with classification
def classify_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(classify_value, axis=1)

"""This function counts how many times
a word appears in high and low value questions"""
def word_value(word):
    # initialize counters
    low_count = 0
    high_count = 0
    
    for i,row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

# testing 10 words
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_h_and_l = []
 # return low and high value counts for 10 words
for term in comparison_terms:
    h_l_counts = word_value(term)
    observed_h_and_l.append(h_l_counts)
    
observed_h_and_l = list(observed_h_and_l)

In [26]:
# verifying
observed_h_and_l

[(0, 1),
 (1, 0),
 (0, 1),
 (0, 1),
 (1, 1),
 (1, 1),
 (1, 0),
 (1, 1),
 (1, 0),
 (0, 1)]

## Applying the Chi-Square Test

We decided to use a small sample since using more would take a long time. From the previous section, we now have our *observed* counts. Our next step is to compute our *expected* counts and the *Chi-Squared* value.

In [27]:
import numpy as np
from scipy.stats import chisquare

# number of high value counts
high_value_counts = len(jeopardy[jeopardy['high_value'] == 1])

# number of low value counts
low_value_vounts = len(jeopardy[jeopardy['high_value'] == 0])

chi_squared = []

for counts in observed_h_and_l:
    total = sum(counts)
    total_proportion = total / jeopardy.shape[0]

    expected_high = total_proportion * high_value_counts
    expected_low = total_proportion * low_value_vounts
    observed = np.array([counts[0], counts[1]])
    expected = np.array([expected_high, expected_low])
    chisquare_value, p_value = chisquare(observed, expected)
    chi_squared.append([chisquare_value, p_value])

In [28]:
chi_squared

[[0.401962846126884, 0.5260772985705469],
 [2.487792117195675, 0.11473257634454047],
 [0.401962846126884, 0.5260772985705469],
 [0.401962846126884, 0.5260772985705469],
 [0.4448774816612795, 0.5047776487545996],
 [0.4448774816612795, 0.5047776487545996],
 [2.487792117195675, 0.11473257634454047],
 [0.4448774816612795, 0.5047776487545996],
 [2.487792117195675, 0.11473257634454047],
 [0.401962846126884, 0.5260772985705469]]

It looks likes of the 10 words we used, only 2 point to some association. Furthermore, most of our values were not significant enough to conclude anything. Lastly, none of the 10 variables had a frequency higher than 5, so our Chi-Square was not as valid as we had hoped for.