# Winning Jeopardy -- A Project On Hypothesis Testing

Jeopardy is a popular TV show in the US where participants answer questions to win money.

A friend would like to go on the show (to win, of course!) and needs my help on winning strategies to work with before the show.

## Introduction

My task in this project is to figure out patterns in the questions asked on previous Jeopardy shows. 
I'm hoping that the patterns I find will help my friend win when she goes on the show.

I'll be working with this [dataset of Jeopardy questions](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file). 



## Exploring the Dataset

In [1]:
import numpy as np
import pandas as pd
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.shape

(19999, 7)

There are about 20,000 rows and 7 columns in this dataset. Each row represents a single question asked on a single episode of  Jeopardy.

Here are detailed descriptions of each column:


| Column Name  | Description                                                                                                   |
|--------------|---------------------------------------------------------------------------------------------------------------|
| Show Number  | The Jeopardy episode number this question was in.                                                             |
| Air Date     | The date the episode aired.                                                                                   |
| Round        | The round of Jeopardy that the question was asked in. The show has several rounds as each episode progresses. |
| Category     | The category of the question                                                                                  |
| Value        | The value that a correct answer is worth (in dollars).                                                          |
| Question     | The text of the question.                                                                                     |
| Answer       | The text of the answer.                                                                                       |

## Cleaning the Column Names

As you will see below, some column names in the dataset have spaces in front of it. It is important to remove the spaces in the names to avoid errors in analyzing the data later.

In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing Columns

### 1. <u>'Question' and 'Answer' Columns</u>

Values in the `Question` and `Answer` columns are formatted as text data. I will need to get rid of punctuations and case variations in these columns to make them useful for further analysis.

For instance, by normalizing the text data, a word like `Don't` is not considered differently from `don't`.


In [4]:
#function converts to lowercase and removes punctuation 
import string
def normalize_text(text): 
    for punctuation in string.punctuation:
        text = text.lower().replace(punctuation, '')
    return text

In [5]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)

In [6]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

### 2. <u>'Value' and 'Air Date' Columns</u>

The `Value` column needs to be numeric for easy manipulation. So, the dollar sign needs to be removed from the beginning of each value in the column before making the format change.

The values in the `Air Date` column also need to be formatted as datetime values for easy manipulation. 

In [7]:
#Removes dollar sign and converts to integer
def normalize_value(text): 
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [8]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [9]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [10]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [11]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Show Number     19999 non-null  int64         
 1   Air Date        19999 non-null  datetime64[ns]
 2   Round           19999 non-null  object        
 3   Category        19999 non-null  object        
 4   Value           19999 non-null  object        
 5   Question        19999 non-null  object        
 6   Answer          19999 non-null  object        
 7   clean_question  19999 non-null  object        
 8   clean_answer    19999 non-null  object        
 9   clean_value     19999 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


## Can a Participant Guess the Answer By Looking At The Question?

Is there a slight chance that a participant can guess the correct answer to a question by looking at the question? One way to find out is to investigate how many times words in the answer occur in the question.

In [12]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [13]:
jeopardy["answer_in_question"].mean()

0.05886148203514083

**<u>Observation</u>**

On the average, ONLY 6% of the words found in the questions asked on Jeopardy were also present in the corresponding answer to the question. 

This isn't a strong number. So, it is only wise to actually study before the show rather than assume that one may be able to guess the answer to a question by picking out certain words in the question.

## How Often Are Questions Repeated?

If new questions on the Jeopardy show are sometimes repeats of questions asked on previous episodes, that may be something to keep in mind while studying for the show.

However, because this dataset is only 10% of the full Jeopardy dataset questions, I may not be able to completely answer this question. But a little investigation never hurt anyone, right?

Here's how I will investigate further with this dataset:
- Sort the `Air Date` column in ascending order. 
- Initialize an empty set called 'terms_used'
- Iterate through each row in the Jeopardy dataset.
- Split the `clean_question` column into words and remove any word that is shorter than 6 characters. Then, I'll check if these words occur in 'terms_used'.

In [14]:
question_overlap = []
terms_used = set()
jeopardy.sort_values(by='Air Date',ascending=True)
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6902117143393427

**<u>Observation</u>**

On the average, 69% of terms in questions asked on the Jeopardy show overlap with terms in the corresponding answers.

But this is only a small set of all Jeopardy questions. What's more? The investigation done was for single words. Phrases were not considered.

However, while it is tempting to discard this result as insignificant in the larger scheme of things, it may be worth it to probe further on the possiblity of questions being recycled.

## Low Value vs High Value Questions -- Earning More

My friend is keen on securing the bag. So, she's particular about high-value questions that will earn her more money than low-value questions.

My task in helping her with a study strategy in this regard is to figure out terms that correspond to high-value questions by running a chi-squared test.

For the purpose of this analysis, 
- Low-value question will be any row where the `Value` is less than 800.
- High-value question will be any row where the `Value` is greater than 800.

In [15]:
def question_value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(question_value, axis=1)

Next, I'll loop through each of the terms in `terms_used` and find:
- How many low-value questions the term occurs in.
- How many high-value questions the word occurs in.
- The percentage of questions the word occurs in.
- Expected counts based on the percenttage of questions the terms occurs in.

In [16]:
def word_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows(): 
        if word in row["clean_question"].split(" "):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count +=1
    return high_count, low_count

In [17]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(word_count(term))

print(comparison_terms)
observed_expected

['rodham', 'versions', 'competed', 'michel', 'meijing', 'dionnes', 'satisfies”', 'avenue', 'outright', 'larynx']


[(0, 1),
 (1, 8),
 (0, 2),
 (1, 1),
 (0, 1),
 (0, 1),
 (0, 1),
 (2, 6),
 (0, 1),
 (0, 2)]

## Applying the Chi-squared Test

In this section, I will find terms that have the largest difference in use between the high value and the low value questions. I will do this by selecting the words with the highest associated chi-squared values. 

However, I will be running this test on a small sample of the words because running it for all the words will take a long while.

In [18]:
from scipy.stats import chisquare

high_value_count = len(jeopardy[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy[jeopardy['high_value'] == 0])
chi_squared = []

for observation in observed_expected:
    total = observation[0] + observation[1]
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([observation[0], observation[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared


[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.3570460299240277, pvalue=0.24405008712855691),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.05272886616881538, pvalue=0.818381104912348),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571)]

**<u>Observation</u>**

- None of the tested terms has a significant difference in usage between low-value questions and high-value questions.
- The fact that none of the terms had a frequency higher than 5 made the chi-squared test invalid. A better approach will be to use only terms that have high frequencies.

## Next Steps

Here are a few steps I'll be taking next in this project to get better results:

- Use phrases instead of individual words because they capture the context of the questions better.
- Use more terms (high-frequency terms) in the chi-squared test to see which ones have large differences.
- Explore the `Category` column to find what categories appear often as well as the probability of such categories appearing in every round.