# Jeopardy 

is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

working with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

In [1]:
import pandas as pd 
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


* Show Number - the Jeopardy episode number
* Air Date - the date the episode aired
* Round - the round of Jeopardy
* Category - the category of the question
* Value - the number of dollars the correct answer is worth
* Question - the text of the question
* Answer - the text of the answer

In [2]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
#replacing the space
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [5]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## normalizing

idea is to ensure to put words in lowercase and remove punctuation so Don't and don't aren't considered to be different words when we compare them.

In [6]:
import re 

def normalise_text(string):
    string = string.lower() #string to lower case
    pattern = r'[^A-za-z0-9\s]' #removing the punctuation marks
    string = re.sub(pattern, '', string) 
    string = re.sub('s\+', ' ', string) #removes newlines and tabs..
    return string

In [7]:
jeopardy['Clean_Question'] = jeopardy['Question'].apply(normalise_text)
jeopardy['Clean_Answer'] = jeopardy['Answer'].apply(normalise_text)

In [8]:
def normalise_values(string):
    pattern = r'[^A-za-z0-9\s]'
    string = re.sub(pattern,'',string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string

In [9]:
jeopardy['Clean_Value'] = jeopardy['Value'].apply(normalise_values)

In [10]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Clean_Question,Clean_Answer,Clean_Value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [11]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [12]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
Clean_Question            object
Clean_Answer              object
Clean_Value                int64
dtype: object

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer can be used for a question.
* How often questions are repeated.

can answer the first question by seeing how many times words in the answer also occur in the question. 

In [13]:
#function that takes in the row from the dataset to count the number of matches of word that
#occur in answer and question 
def count_matches(row):
    split_answer = row['Clean_Answer'].split(' ') #splitting by spaces
    split_question = row['Clean_Question'].split(' ')
    match_count = 0
    #removing the from the split_answer as it is very common 
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0 #tp prevent division by 0 later
    for word in split_answer: #counting number of matches
        if word in split_question:
            match_count += 1
    return match_count / len(split_answer)
            

In [14]:
jeopardy['Ans_in_Que'] = jeopardy.apply(count_matches, axis  = 1)

In [16]:
jeopardy['Ans_in_Que'].mean()

0.06049325706933587

## Recycled questions

On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

## investigate 

how often new questions are repeats of older ones. We can't completely answer this, because we only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

In [24]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date") #sorting based on airing date of episodes

for i, row in jeopardy.iterrows():
        split_question = row["Clean_Question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]#keeping words greater than length of 5 to remove the most common words like 'the', 'than'.......
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)  #capturing the unique words using set
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)

In [25]:
jeopardy["Question_overlap"] = question_overlap

jeopardy["Question_overlap"].mean()

0.6877983201427721

## Low value vs high value questions

There is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

In [26]:
#finding questions who's value is more than 800
def determine_value(row):
    if row['Clean_Value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [27]:
jeopardy['High_Value'] = jeopardy.apply(determine_value, axis=1)

In [28]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,Clean_Question,Clean_Answer,Clean_Value,Ans_in_Que,Que_overlap,question_overlap,Question_overlap,High_Value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride ...,theodore roosevelt,0,0.0,0,0.0,0.0,0
19312,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$600,"Since '27, stars have made good impressions at...",Mann's Chinese Theatre,since 27 stars have made good impressions at t...,manns chinese theatre,600,0.0,0,0.0,0.0,0
19299,10,1984-09-21,Jeopardy!,"""B"" MOVIES",$500,Sensitive Mart Crowley treatment of gays march...,The Boys in the Band,sensitive mart crowley treatment of gays march...,the boys in the band,500,0.0,0,0.0,0.0,0
19274,10,1984-09-21,Jeopardy!,GEOGRAPHY,$100,Formerly Formosa,Taiwan,formerly formosa,taiwan,100,0.0,0,0.0,0.0,0
19275,10,1984-09-21,Jeopardy!,DOUBLE TALK,$100,"Not a Hawaiian cow, but a dress worn by Hawaii...",a muumuu,not a hawaiian cow but a dress worn by hawaiia...,a muumuu,100,0.5,0,0.0,0.0,0


In [29]:
def count_usage(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row['Clean_Question'].split(' ')
        if word in split_question:
            if row['High_Value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [31]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)] #choosing random 10 words 

observed_expected = []

for word in comparison_terms:
    observed_expected.append(count_usage(word))
    
observed_expected

[(2, 3),
 (1, 0),
 (2, 0),
 (2, 3),
 (1, 6),
 (0, 1),
 (2, 1),
 (1, 0),
 (1, 0),
 (1, 2)]

In [34]:
high_value_count = jeopardy[jeopardy['High_Value'] == 1].shape[0]
high_value_count

5734

In [35]:
low_value_count = jeopardy[jeopardy['High_Value'] == 0].shape[0]
low_value_count

14265

In [36]:
from scipy.stats import chisquare 
import numpy as np 

chi_squared = []

for obs in observed_expected:
    total = sum(obs) #summing both high and low count
    total_prep = total / jeopardy.shape[0] #dividing by number of rows 
    high_value_exp = total_prep * high_value_count #
    low_value_exp = total_prep * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_count, low_value_count])
    chi_squared.append(chisquare(observed, expected))

In [37]:
chi_squared

[Power_divergenceResult(statistic=19989.00132850813, pvalue=0.0),
 Power_divergenceResult(statistic=19997.000174398327, pvalue=0.0),
 Power_divergenceResult(statistic=19995.000697593303, pvalue=0.0),
 Power_divergenceResult(statistic=19989.00132850813, pvalue=0.0),
 Power_divergenceResult(statistic=19985.002698057633, pvalue=0.0),
 Power_divergenceResult(statistic=19997.000070101647, pvalue=0.0),
 Power_divergenceResult(statistic=19993.00076769495, pvalue=0.0),
 Power_divergenceResult(statistic=19997.000174398327, pvalue=0.0),
 Power_divergenceResult(statistic=19997.000174398327, pvalue=0.0),
 Power_divergenceResult(statistic=19993.000454804915, pvalue=0.0)]

# Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.