# Analyzing Jeopardy Data

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win. The dataset can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/?st=jmwnphdw&sh=16abc86f).

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number -- the Jeopardy episode number of the show this question was in.
- Air Date -- the date the episode aired.
- Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- Category -- the category of the question.
- Value -- the number of dollars answering the question correctly is worth.
- Question -- the text of the question.
- Answer -- the text of the answer.



In [22]:
import pandas as pd
import re
from matplotlib import pyplot as plt
from scipy.stats import chisquare
import numpy as np

# Reading dataset into a Dataframe

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [23]:
# Removing spaces from column names

jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [24]:
jeopardy.dtypes

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object

In [25]:
# Normalizing Question and Answer column data. Converting words to lower case and removing punctuation.

def normalize(data):
    data = data.lower()
    
    # Remove punctuation 
    data= re.sub(r'[^\w\s]','',data)
    return data


jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

In [26]:
# Normalizing Air Date and Value column data.

def normalize_dollar(data):
    # Remove punctuation 
    data= re.sub(r'[^\w\s]','',data)
    try:
        return int(data)
    except ValueError:
        return 0



jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_dollar)
 
# Converting Air Date to datetime

jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
    
jeopardy.head()


Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


### Answering Questions

With the dataset in hand, a contestant can figure out wheather to study past questions, study general knowledge, or not study at all. 
Existing questions in the dataset can help us find some meaningful insights about the contest. 

We will analyse the data to find out how often the answer is deducible from the question. We can so this by seeing how many times words in the answer also occur in the question.

In [27]:
# Analyzing how often the answer is deducible from the question.

# Function to count the word match between question and answer
def word_count(row):
    stop_words = ['the','in','a','an','is','on','then','than','why','what','when','where']
    
    # split questions and answer
    split_question = row['clean_question'].split(" ")
    
    # Removing stop words from question
    for word in split_question:
        if word in stop_words:
            split_question.remove(word)
            
    split_answer = row['clean_answer'].split(" ")
    
    # Removing strop words from answer
    for word in split_answer:
        if word in stop_words:
            split_answer.remove(word)
    
    match_count = 0
    if len(split_answer)==0:
        return 0 # prevents a division by zero error later
    
    for each_answer in split_answer:
        if each_answer in split_question:
            match_count+= 1
    
    return match_count/ len(split_answer)

# Count how many times terms in a question occur in a question
jeopardy['answer_in_question'] = jeopardy.apply(word_count,axis=1)


mean_word_match = jeopardy['answer_in_question'].mean()
mean_word_match

0.04377071958826046

The dataset shows that words in the answer occur only in 4% of the times in the question. This shows that the answer might not be deducible from the question. This helps us to come up with a strategy to study for the contest. 

We will have to know the answer to the question. One cannot get away without studying for the participate and win Jeopardy! 

### Finding Frequency of Repeated Questions

We want to investigate how often new questions are repeats of older ones. Similarity of words in different questions help us analyze this scenario. 

We will look at words greater than 6 characters that enables us to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [28]:
# Sorting dataframe by Air Date

jeopardy= jeopardy.sort_values(by='Air Date')

terms_used = set() # Empty set to contain the words from the questions
question_overlap = []

for index,row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    
    # Removing words less than 6 character long
    split_question = [each for each in split_question if len(each) > 5]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count+= 1
        # Adding words to the set- terms_used
        terms_used.add(word)
        
    # Appending match count to the dataset    
    if len(split_question)>0:
        match_count = match_count/ len(split_question)
        
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap

mean_question_overlap = jeopardy['question_overlap'].mean()
mean_question_overlap
            
        

0.6894006357823182

Looks like around 70% of the times the questions have overlapping terms. Although the entrire question is not being analysed here, but this shows some signs that going through the questions which have a
lready been asked in the show can help us perform better in Jeopardy! 

### Analyzing the Occurance of Terms Corresponding to the Value of  Questions

In [29]:
# Function to categorize values into high and low value

def question_value(row):
    if row['clean_value']>800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(question_value, axis=1)


In [30]:
# Function to count the number of questions with high and low counts

def high_low_counts(word):
    low_count = 0
    high_count = 0
    
    # Counting the number of questions with high and low counts for the given term
    for index,row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value']==1:
                high_count+= 1
            else:
                low_count+= 1
    return high_count,low_count

observed_expected = []
comparison_terms = list(terms_used)[:5] # Selecting a few words for analysis


for each_term in comparison_terms:
    observed_expected.append(high_low_counts(each_term))

print(comparison_terms)
print(observed_expected)



['cassie', 'wagons', 'murphy', 'legion', 'whitebellied']
[(1, 0), (0, 1), (5, 1), (0, 1), (0, 1)]


### Chi-Squared Test for the Selected Terms

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value

In [31]:


high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]
chi_squared = []

for each in observed_expected:
    total = sum(each)
    total_prop = total/jeopardy.shape[0]
    
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
    
    # Calculating Chi-Squared value and P-value
    observed = np.array([each[0], each[1]])
    expected = np.array([exp_high, exp_low])
    
    chi_squared.append(chisquare(observed,expected))

    
print(chi_squared)

[Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=8.76612762933646, pvalue=0.003068762885281689), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]


- The terms which we have selected for analysis didn't have a significant difference between high value and low value rows. 
- Terms with higher frequency should be selected to carry out chi-squaerd tests to analyze if the  difference between high value and low value rows is significant. 