# Best Way To Prepare For Jeopardy

## Introduction
Jeopardy is a popular American TV show where participants answer questions to win money.The show debuted in March 30, 1964 and has since become very popular. You can learn more about jeopardy <a href = 'https://en.wikipedia.org/wiki/Jeopardy!'>here</a>.

The dataset we will be working with is `JEOPARDY_CSV.csv` and it conatains 216,930 rows and was collected from <a href = 'https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/'>reddit</a>. Each row on the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* `Show Number` - the Jeopardy episode number.
* `Air Date` - the date the episode aired.
* `Round` - the round of Jeopardy.
* `Category` - the category of the question.
* `Value` - the number of dollars the correct answer is worth.
* `Question` - the text of the question.
* `Answer` - the text of the answer.

The goal of this project is to analyse the dataset and look for patterns that could help you win jeopardy.

## Data Exploration

In [1]:
import pandas as pd
import re
import numpy as np
from random import choice
from scipy.stats import chisquare

In [2]:
jeopardy = pd.read_csv('JEOPARDY_CSV.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1    Air Date    216930 non-null  object
 2    Round       216930 non-null  object
 3    Category    216930 non-null  object
 4    Value       216930 non-null  object
 5    Question    216930 non-null  object
 6    Answer      216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


With the exception of the `Answer` column, none of the column have null values. There are just two null values in the `Answer` column so we are going to drop the rows with null values as they are quite insignificant.

## Data Cleaning

In [4]:
jeopardy.dropna(inplace=True)

In [5]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [6]:
jeopardy.columns = jeopardy.columns.str.strip() # removes white space at the beiginning and end of strings
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [7]:
def clean_text(text):
    ''' Takes in a text input, removes every punctuation
    and converts every word to lower case.'''
    text = re.sub(r'[^\w\s]', '', text) # replaces characters that are not alpha-numeric follow by a whitespace with an empty string
    text = text.lower()
    return text

def clean_value(value):
    value = re.sub(r'\W', '', value)
    
    '''value column contains none dtype, 
    use try and except to avoid raising valueError'''
    try:
        value = int(value)
    except:
        value = 0
    return value


In [8]:
jeopardy['clean_question'] = jeopardy['Question'].apply(clean_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(clean_text)
jeopardy['clean_value'] = jeopardy['Value'].apply(clean_value)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [9]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216928 entries, 0 to 216929
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   Show Number     216928 non-null  int64         
 1   Air Date        216928 non-null  datetime64[ns]
 2   Round           216928 non-null  object        
 3   Category        216928 non-null  object        
 4   Value           216928 non-null  object        
 5   Question        216928 non-null  object        
 6   Answer          216928 non-null  object        
 7   clean_question  216928 non-null  object        
 8   clean_answer    216928 non-null  object        
 9   clean_value     216928 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 18.2+ MB


## Answer In Questions:
One strategy you might want to consider when answering questions is taking a hint from the question to derive an answer. We are going to look at all the questions and find out what percentage of questions have their answers in them.

In [10]:
def count_matches(row):
    '''returns the proportion for clean_question
    & clean_answer with matching terms'''
    
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0: # to avoid dividing by 0
        return 0
    
    for i in split_answer:
        if i in split_question:
            match_count +=1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] =  jeopardy.apply(count_matches, axis=1)

In [11]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0


In [12]:
jeopardy['answer_in_question'].mean()

0.057921237245162335

On average, only 6% of questions have their answers in the questions asked. This is not a whole lot of questions and means we can't hope to win by trying to figure out the answers of questions using the question. So the best strategy will be to actually study for jeopardy.

## Repeated Questions
It is somewhat common that in most Q & A competitions, questions are repeated. We want to find out what percentage of questions are repeated and if it is a good strategy to study past questions.

In [13]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values('Air Date')

for row in jeopardy.iterrows():
    row = row[1]
    split_question = row["clean_question"].split(" ")
    split_question = [word for word in split_question if len(word) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0: # to avoid dividing by 0
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap

In [14]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question,question_overlap
84523,1,1984-09-10,Jeopardy!,LAKES & RIVERS,$100,River mentioned most often in the Bible,the Jordan,river mentioned most often in the bible,the jordan,100,0.0,0.0
84565,1,1984-09-10,Double Jeopardy!,THE BIBLE,$1000,"According to 1st Timothy, it is the ""root of a...",the love of money,according to 1st timothy it is the root of all...,the love of money,1000,0.333333,0.0
84566,1,1984-09-10,Double Jeopardy!,'50'S TV,$1000,Name under which experimenter Don Herbert taug...,Mr. Wizard,name under which experimenter don herbert taug...,mr wizard,1000,0.0,0.0
84567,1,1984-09-10,Double Jeopardy!,NATIONAL LANDMARKS,$1000,D.C. building shaken by November '83 bomb blast,the Capitol,dc building shaken by november 83 bomb blast,the capitol,1000,0.0,0.0
84568,1,1984-09-10,Double Jeopardy!,NOTORIOUS,$1000,"After the deed, he leaped to the stage shoutin...",John Wilkes Booth,after the deed he leaped to the stage shouting...,john wilkes booth,1000,0.0,0.0


In [15]:
jeopardy['question_overlap'].mean()

0.8721734034756163

87% of terms used in old questions are repeated on newer questions so it might be worth it to look at older questions when preparing for Jeopardy.

## High Value vs Low Value Questions.
We want to find if there is any relationhsip between certain terms and high value questions so we are prepared enough to answer high value questions. We are going to be using a chisquare hypothesis testing to find this.

In [16]:
def value_category(row):
    '''categorises rows into high or low value
    1 = high value, 0 = low vaue'''
    
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
    
jeopardy['high_value'] = jeopardy.apply(value_category, axis=1)

In [17]:
def count_value(word):
    '''counts the value of individual words 
    in the clean question column'''
    
    low_count = 0
    high_count = 0
    for row in jeopardy.iterrows():
        row = row[1]
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [18]:
comparison_terms = []
comparison_terms = [choice(list(terms_used)) for i in range(10)] # picks a random smaple of 10 terms with replacement

observed_expected = []
for i in comparison_terms:
    result = count_value(i)
    observed_expected.append(result)
    
print(observed_expected)

[(0, 1), (0, 1), (1, 0), (0, 1), (1, 5), (4, 7), (1, 1), (0, 1), (3, 1), (2, 0)]


In [19]:
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for value in observed_expected:
    total = sum(value)
    total_prop = total / jeopardy.shape[0]
    exp_high = high_value_count * total_prop
    exp_low = low_value_count * total_prop
    
    observed = np.array([value[0], value[1]])
    expected = np.array([exp_low, exp_high])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=8.948179686982321, pvalue=0.0027774621368179186),
 Power_divergenceResult(statistic=6.76146672712397, pvalue=0.009314716768224153),
 Power_divergenceResult(statistic=0.4633727036157106, pvalue=0.4960519396377898),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=0.02164944004882361, pvalue=0.8830235016084509),
 Power_divergenceResult(statistic=0.7899630882409683, pvalue=0.3741112870360538)]

In our `observed_expected` list, terms seem to be more frequent in lower value questions, this could be due to the fact that there are more low value questions than high value ones. In cases where there were significant differences(of at least 3) in the term frequencies for low and high value, the pvalues are all less than 0.05 which would mean a strong relationship between those terms and low value words which makes sense as low value questions are more common. Although it was a small sample, there are no strong relationship between terms and high value questions.

## Popular Categories Per Round.
Jeopardy has rounds and here we want to find out the most frequent category in each of the rounds.

In [20]:
jeopardy['Round'].value_counts(normalize=True)

Jeopardy!           0.495017
Double Jeopardy!    0.488231
Final Jeopardy!     0.016738
Tiebreaker          0.000014
Name: Round, dtype: float64

In [21]:
jeopardy_grp =  jeopardy.groupby(['Round'])

In [22]:
for i in jeopardy['Round'].unique():
    j_round = jeopardy_grp.get_group(i)
    top_cat_proportion = j_round['Category'].value_counts(normalize=True)[0] # returns the value for the category with the highest proportion
    top_cat_percentage = top_cat_proportion * 100
    top_cat_name = j_round['Category'].value_counts().index[0] # returns the name of the category with the highest frequency in each round
    
    print(f'''
    {top_cat_name} category make up {top_cat_percentage:.3}% of the questions in {i} round.
''')


    POTPOURRI category make up 0.237% of the questions in Jeopardy! round.


    BEFORE & AFTER category make up 0.425% of the questions in Double Jeopardy! round.


    U.S. PRESIDENTS category make up 1.38% of the questions in Final Jeopardy! round.


    THE AMERICAN REVOLUTION category make up 33.3% of the questions in Tiebreaker round.



Most of the questions in our dataset are from the `Jeopardy!` and `Double Jeopardy!` rounds, with these round making up nearly 99% of the data, even though we know the top categories for these rounds, these categories make up only a small percentage of the total questions; 0.2% and 0.3% respectively. Focusing on just one particular category of question for a specific round isn't a very good strategy.

## Conclusion
* While there is no guaranteed strategy to winning Jeopardy as we have found out, it might be worth while to look at past questions while preparing. 

* There also isn't any significant relationship between any term and high questions, so there is no keyword to look out for to prepare for high value questions.

* There isn't a significant question category to focus on for any jeopardy round, it's best to be prepared for as much ccategories as possible.