# Answering Jeoperdy Questions to Win Money

## 1. Introduction

`Jeopardy` is a popular TV show in the US where participants answer questions to win money. 

**In this project we are going to work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help win money.**

The dataset called `jeopardy.csv` contains *20000* rows. This dataset can be downloaded here __[GeopardyDataset](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/)__.

In [1]:
#import necessary modules

import numpy as np
import pandas as pd
import re
import random
from scipy.stats import chisquare

In [2]:
#reading the dataset
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Jeopardy is a dataset with *7* columns. Let's take a look at just the column names.

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

As we can see above, column names have spaces at the beginning. Let us fix it by getting rid of these sapces before moving forward.

In [4]:
jeopardy.columns = jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

We are ready to move forward with needful analysis.

## 2. Normalizing the Text Columns

In this section we are going to normalize the columns *Questions* and *Answer*. We do this by writing a function called `Normalize`, which 

 * Takes the string in
 * Converts the string to a lower case
 * Removes all the punctuation in the string
 * Returns the string

In [5]:
def Normalize(text):
        text = text.lower()
        text = re.sub(r"[^\w\s]", "", text)
        text = re.sub(r"[_]", "", text)
        return text  

In [6]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(Normalize)
jeopardy['clean_question'] = jeopardy['Question'].apply(Normalize)

We normalized the columns *Question* and *Answer* and renamed the normalized columns as *clean_question* and *clean_answer* respectively.

In [7]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_answer,clean_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,copernicus,for the last 8 years of his life galileo was u...
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,jim thorpe,no 2 1912 olympian football star at carlisle i...
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,arizona,the city of yuma in this state has a record av...
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,mcdonalds,in 1963 live on the art linkletter show this c...
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,john adams,signer of the dec of indep framer of the const...


## 3. Normalizing the Non-Text Columns

In this section we are going to normalize the non text columns `Value` and `Air Date`. 

We will convert the `Value` column's data into a *numeric type* and get rid of the *dollar sign* at the beginning of each value. For this we will build a function called `Normalize_int`, which 

 * Takes in a string
 * Removes any punctuation
 * Converts it into an integer 
 * If there's an error assigns `0`      
 * Returns an integer

In [8]:
def Normalize_int(text):
    text = re.sub(r"[_\W]",'',text)
    try:
        text = int(text)
    
    except ValueError:
        text = 0
    return text

We will apply the above function to the `Value` column and name the new column as `clean_value`. 

In [9]:
jeopardy['clean_value'] = jeopardy['Value'].apply(Normalize_int) 

We will also convert each value in `Air Date` column to *datetime* format. 

In [10]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [11]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_answer,clean_question,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,copernicus,for the last 8 years of his life galileo was u...,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,jim thorpe,no 2 1912 olympian football star at carlisle i...,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,arizona,the city of yuma in this state has a record av...,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,mcdonalds,in 1963 live on the art linkletter show this c...,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,john adams,signer of the dec of indep framer of the const...,200


In [12]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_answer      19999 non-null object
clean_question    19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


From the above table we can see that we have successfully achieved our objective.  

## 4. Examining Answers in Questions

In order to find out whether to study past questions from *Jeopardy*, we need to figure out two things:
    
 - How often the answer can be used for a question - this we perform by seeing how many times words in answer also occur in the question.
 - How often questions are repeated - this we find out by seeing how often complex words (>6 characters) reoccur. 

In this section we are going to find out `how often the answer can be used for a question`. 

For this we create a function called *answer_in_question*, which

 - takes a row in jeopardy, splits the *clean_answer* and *clean_question* and assign them to variables *split_answer* & *split_question* respectively.
 - loops through each item in *split_answer* and checks if it also exists in *split_question*. If it does assiging *1* to a variable called *match_count*.
 - returns *match_count* divided by *length of split_answer*.

In [13]:
def answer_in_question(row):
    
    #split each row and assign to a variable
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    
    #assign 0 to a variable match_count
    match_count = 0
    
    #get rid of 'the' as it doesn't give any meaning
    if "the" in split_answer:
        split_answer.remove("the")
    
    #if length of split_answer is 0 return 0 to prevent division by 0
    if len(split_answer) == 0:
        return 0
    
    #if item exists both in split_answer & split_question add 1 to match_count
    for item in split_answer:
        if item in split_question:
            match_count += 1
    
    #divide match_count by length of split_answer
    result = match_count/len(split_answer) 
            
    return result    

Using pandas *dataframe.apply()* method, we will apply the above function to each row in `jeopardy`. We will assign the result to a new column called `answer_in_question`.

In [14]:
jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis=1)

Below we calculate *mean* for *answer_in_question* column in order to understand *the percentage of answers in questions* for *Jeopardy*.

In [15]:
Mean = jeopardy['answer_in_question'].mean()
Mean

0.06049325706933587

We found that on an average `6%` of terms in answers are found in questions. So we think that relying soley on this strategy may not be effective for `jeopardy` preparation.

## 5. Examining the Recycled/Repeated Questions

In this section we address the latter point of the two things we mentioned in the earlier section. i.e. `how often the questions are repeated`. In order to achieve our objetive, we perform the following:

 * Sorting the dataset in an ascending order of `Air Date` column.
 * Create an empty set called `terms_used`.
 * Split the `clean_question` column and remove words that are less than `6` character long (this helps filter out the words like *the*, *than* etc.).
 * Add each word to `terms_used`.
 * If a word reoccurs in `terms_used`, increment a counter.
 * divide the *counts* by *length of split_question* and append this to an empty list `question_overlap`.

In [16]:
question_overlap = []

terms_used = set()

jeopardy = jeopardy.sort_values(by= 'Air Date')

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    split_question = [terms for terms in split_question if len(terms)>5]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
        terms_used.add(word)
        
        if len(split_question) > 0:
            result = match_count / len(split_question)
            question_overlap.append(result)
            
jeopardy['question_overlap'] = [question_overlap[i] for i in range(len(jeopardy))]

We have added a column called `question_overlap` to the dataset `jeopardy`. Let us calculate the mean for this column. This value estimates average value of terms used in questions.  

In [17]:
Mean_question_overlap = jeopardy['question_overlap'].mean()
Mean_question_overlap

0.34281494433252024

We found that `~34%` of terms in questions are recycled/repeated. The set we are working with is only *10%* of the full *Jeopardy* question dataset. Still we think that studying past questions might be an effective strategy for `jeopardy` preparation.

## 6. Examining the Low Value and High Value Questions

In order to earn more money in `jeopardy`, studying *high value* questions instead of *low value* questions is a wise strategy. In a given row of the dataset: 

 - Low value question is one with value < 800.
 - High value questions is one with value > 800.
 
For our analysis we follow these steps:

 * Loop through the terms from `terms_used` (which contains complex words from the `clean_question` column).
 * Calculate the number of *low value* and *high value* questions the word occurs in.
 * Calculate the percentage of questions the word occurs in.
 * Based on the percentage, calculate the *expected counts*.
 * Compute the `chi square value` based on the expected counts and the observed counts for high value and low value questions.
 
Let us first build a function called *assign_value* that assigns value to the *clean_value* column of the dataframe. 

In [18]:
def assign_value(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0        

Let us apply the function to the *jeopardy* dataset. 

In [19]:
jeopardy['high_value'] = jeopardy.apply(assign_value, axis=1)
jeopardy.head(6)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_answer,clean_question,clean_value,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,theodore roosevelt,adventurous 26th president he was 1st to ride ...,0,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,jimmy hoffa,notorious labor leader missing since 75,200,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,thanksgiving,washington proclaimed nov 26 1789 this first n...,200,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,the grand canyon,both ferde grofe the colorado river dug this ...,200,0.0,0.0,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,tom,depending on the book he could be a jones a sa...,200,0.0,0.0,0
19305,10,1984-09-21,Double Jeopardy!,HOMONYMS,$200,Hindu hierarchy or a play's actors,a caste (cast),a caste cast,hindu hierarchy or a plays actors,200,0.333333,0.0,0


Further, we will build a function called `assign_count` that 

 * Takes *word* as input. 
 * Adds to the variables *high_count* or *low_count* (assigned 0 at the beginning) for the word in *clean_question* by checking if the word was assigned 1 in *high_value* or *low_value*. 
 * Returns *high_count* and *low_count*. 

In [20]:
def assign_count(word):
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(" ")
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count     

Let us now pick *10* random words from *terms_used* and apply *assign_word* function on those words and calculate the *high_count* and *low_count*. 

In [21]:
# create a list of 10 random words from terms_used
comparison_terms = random.sample(terms_used, k=10)
comparison_terms= list(comparison_terms)

#create an empty list
observed_expected = []

#apply the function and append to the empty list
for term in comparison_terms:
    observed_expected.append(assign_count(term))

for i,j in zip(comparison_terms,observed_expected):
    print(i,":", j)

receives : (1, 1)
arnold : (0, 4)
stretched : (0, 4)
defender : (1, 2)
highest : (15, 45)
gullet : (0, 1)
contribution : (0, 2)
giannini : (0, 1)
ailuropoda : (0, 1)
smothered : (0, 1)


We have successfully calculated the `Observer Counts` (i.e. *the number of high value questions the word occurs in* and *the number of low value questions the word occurs in*) for *10* random words from *terms_used*.

## 7. Applying the Chi-squared Test

In this section we are going to compute the `expected counts` and `chi-squared values` for the selected 10 words. In order to calculate *chi-squared value* and *p-value* we use `scipy.stats.chisquare` function.

In [22]:
#calculate the number of rows with 'high_value = 1' & 'high_value = 0'
high_value_row = jeopardy[jeopardy['high_value']==1]
low_value_row = jeopardy[jeopardy['high_value']==0]

high_value_count = len(high_value_row)
low_value_count = len(low_value_row)

#create an empty list
chi_squared = []

#loop through each list in observed_expected
for values in observed_expected:
    #add the values to get total sum
    total = sum(values)     
    
    #get the total proportion
    total_prop = total / len(jeopardy) 
    
    #calculate expected high value count
    expected_high_value_count = total_prop * high_value_count
    
    #calcute expected low value count
    expected_low_value_count = total_prop * low_value_count
    
    expected = np.array([expected_high_value_count, expected_low_value_count])
    observed = np.array([values[0], values[1]])
    
    #calculate chisquare value using expected & observed arrays
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=1.607851384507536, pvalue=0.20479409439225948),
 Power_divergenceResult(statistic=1.607851384507536, pvalue=0.20479409439225948),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.39546649626611535, pvalue=0.5294398839801588),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

The above list shows the *chisquare values* and *p-values* for *10* randomly selected words from a list of complex words. In general if the `p-value is < 0.05`, we say the `it is statistically significant` (suggesting that there is less than a 5% chance that the observed results occurred by random chance and we can reject the null-hypothesis) and if the `p-value is > 0.05`, we say that the `it is statistically insignificant`(meaning there is not enough evidence to reject the null hypothesis). In here we observed that  for words if *high_value_counts* are higher than *low_value_counts*,(i.e. if the number of high value questions the word occurs in is greater than the number of low value questions the word occurs in) we get a p-value < 0.05. We observed that such words are rare in the list of *10* complex words.  

We can follow two strategies inorder to overcome this problem:

 1. Get rid of the words that occur in more than a certain percentage (eg. 5%) of questions in order to eliminate non-informative words.
 2. Performing chi-squared test across more terms.

## Conclusions

Our objective in this project was to work with the dataset of `Jeopardy` questions (*Jeopardy* is a popular TV show in the US where participants answer questions to win money). We did this in order to figure out some patterns in the questions that could help us win money. The dataset had *20000* rows and seven columns. 

**Datacleaning:** We performed datacleaning

 1. On *Questions* and *Answers* columns by getting rid of *punctuations* and converting the strings to *lower case*.   
 2. On *Value* column by getting rid of the dollar sign and converting each column value into numeric type.
 3. By converting *Air Date* column values to datetime format.
 
**In order to find out whether to study past questions, we performed following two analysis:**

1. We studied how often the answer can be used for a question by seeing how many times words in answer also occur in the question.
   * We found that this strategy is not effective as our analysis resulted in only *6%* of terms in answers found in questions. 
2. We studied how often questions are repeated. This we found out by seeing how often complex words (>6 characters) reoccur.
   * We found that close to *34%* of the terms in questions are repeated. We think that studying past questions might be a usefule strategy.
   
**In order to earn more money we analysed *high value questions (>800)* and *low value questions (\<800)*:**
 
 - For the purpose we built a function that takes in a word from a *clean_question* column and adds to either *high value counts* or *low value counts* by checking the value of the word and returns the *high count* and *low count*. These counts, we termed them as an `observer counts`.
 - Using *observer counts*, we calculated *expected counts* for *high value counts* and *low value counts*. Then using *scipy.stats.chisquare* we calculated `chisquare value` and `p-value` for randomly selected *10* words.
    * We observed that rarely a p-value is < 0.05, implying there are not many words with *high value count* among 10 randomly selected words.
    
We conclude that either by increasing the number of randomly selected words or by eliminating non-informative words we can get better results. 