# Introduction

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. The show features a quiz competition in which contestants are presented with general knowledge clues in the form of answers, and must phrase their responses in the form of questions.Questions are divided on dollar value.

Our goal is to find the chances of winning if participated in the Jeopardy. In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). 

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number - the Jeopardy episode number
- Air Date - the date the episode aired
- Round - the round of Jeopardy
- Category - the category of the question
- Value - the number of dollars the correct answer is worth
- Question - the text of the question
- Answer - the text of the answer



In [1]:
import pandas as pd
from random import choice
from scipy.stats import chisquare
import numpy as np
import re
jeopardy=pd.read_csv("jeopardy.csv")

jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.shape

(19999, 7)

## Normalizing the data

Now, let's normalize some columns to make it easier to conduct data analysis:

- Question and Answer – putting words in lowercase and removing punctuation,
- Value – removing the dollar sign and converting each value to numeric,
- Air Date – making it datetime.

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [6]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)


In [7]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null object
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: int64(2), object(8)
memory usage: 1.5+ MB


In [8]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [9]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [10]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Finding how often answers and questions are repeated

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. 

In [11]:
def count_matches(row):
    '''Takes in a row in Jeopardy as a Series and returns how many times
    words in the answer occur in the question.
    '''
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)
jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)
answer_in_question_pct = round(jeopardy['answer_in_question'].mean()*100)

In [12]:
mean=np.mean(jeopardy['answer_in_question'])
print(f'In {answer_in_question_pct}% of cases, the answer appears in the question.')

In 6% of cases, the answer appears in the question.


## Recycled Questions
###  Investigate how often new questions are repeats of older ones.
On average, the answer only makes up for about 6% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

Check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables you to filter out words like *the* and *than*, which are commonly used.

In [13]:
# how often new questions are repeats of older ones.

question_overlap=[]

# Sets replace the duplicate entry with the old one unlike lists
terms_used=set()
jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        #proportion of match_count in split_question
        match_count /= len(split_question)
    question_overlap.append(match_count)

print(len(question_overlap),jeopardy.shape[0])

jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()
        

19999 19999


0.6876260592169776

About 70% of the terms used in the questions were reused in the other questions.

## Filter words corresponds to high value questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test.

Let's first categorize questions in two groups:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

In [14]:
def determine_value(row):
    '''Takes in a row in Jeopardy as a Series and categorizes questions as
    high or low value ones.
    '''
    if row['clean_value']>800:
        value=1
    else:
        value=0
    return value

def count_usage(term):
    '''Takes in a word and separately returns the numbers of high and low value
    questions the word occurs in.
    '''
    low_count=0
    high_count=0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)



In [15]:
from random import choice


terms_used_list = list(terms_used)
# Creating a list of the 50 random words in all the questions
comparison_terms = [choice(terms_used_list) for _ in range(50)]
list(pd.Series(terms_used_list).value_counts()[:50].index)

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))


    
print(f'The 50 random words in the whole dataset:'
      f'\n{comparison_terms}')
print(f'\nNumber of times each word occurred in high and low value questions:'
      f'\n{observed_expected}')

The 50 most frequent words in the whole dataset:
['belgian', 'condescend', 'eriksons', 'oversaw', 'tracks', 'hrefhttpwwwjarchivecommedia20110713j07ajpg', 'kidnapping', 'etchings', 'vestments', 'hrefhttpwwwjarchivecommedia20091208dj20jpg', 'hrefhttpwwwjarchivecommedia20060329j28jpg', 'millionfooted', 'lorado', 'quasar', 'horizons', 'romanov', 'revolving', 'longrunning', 'squash', 'finishes', 'serengeti', 'snappy', 'cemetery', 'mahimahi', 'composed', 'annuals', 'battalion', 'pallares', 'perlmans', 'hopkinson', 'walker', 'resistance', 'heroine', 'mosque', 'munching', 'zappafrank', 'ruling', 'charms', '19031950', 'youngblood', 'racings', 'outing', 'responsibility', 'villagers', 'wexler', 'princess', 'supplied', 'instead', 'founding', 'canaanite']

Number of times each word occurred in high and low value questions:
[(0, 4), (1, 0), (1, 0), (0, 1), (2, 4), (0, 1), (1, 4), (1, 0), (1, 0), (1, 0), (0, 1), (0, 1), (0, 1), (0, 1), (0, 1), (1, 2), (0, 1), (1, 2), (1, 1), (2, 1), (1, 0), (0, 2), (

## Applying the Chi-Squared Test
Now that we've found the observed counts for the 50 random words, we can compute the expected counts, chi-squared values, and p-values:

In [16]:
from scipy.stats import chisquare
import numpy as np

# Counting high and low value questions
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

# Counting chi-squared and p-values for each word
chi_squared = []
chi_squared_dict={}
for i, obs in enumerate(observed_expected):
    total=sum(obs)    # the number of questions a word occurs in
    total_prop=total/len(jeopardy)
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_square=chisquare(observed, expected)
    chi_squared.append(chi_square)
    chi_squared_dict[comparison_terms[i]]=[chi_square[1],chi_square[0]]
    

chi_squared[:5]

[Power_divergenceResult(statistic=1.607851384507536, pvalue=0.2047940943922556),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.06376233446880725, pvalue=0.8006453026878781)]

To display the results in a more readable form, let's create a dataframe where we'll include only the words with a big difference between the number of high and low value questions where these words occurred, and, at the same time, with p-value lower than 0.05, meaning that the results are significant and cannot be explained just by a random chance.

In [17]:
chi_squared[0][1]

0.2047940943922556

In [18]:
chi_squared_dict

{'belgian': [0.2047940943922556, 1.607851384507536],
 'condescend': [0.11473257634454047, 2.487792117195675],
 'eriksons': [0.11473257634454047, 2.487792117195675],
 'oversaw': [0.5260772985705469, 0.401962846126884],
 'tracks': [0.8006453026878781, 0.06376233446880725],
 'hrefhttpwwwjarchivecommedia20110713j07ajpg': [0.5260772985705469,
  0.401962846126884],
 'kidnapping': [0.6680941623250602, 0.18383953104516373],
 'etchings': [0.11473257634454047, 2.487792117195675],
 'vestments': [0.11473257634454047, 2.487792117195675],
 'hrefhttpwwwjarchivecommedia20091208dj20jpg': [0.11473257634454047,
  2.487792117195675],
 'hrefhttpwwwjarchivecommedia20060329j28jpg': [0.5260772985705469,
  0.401962846126884],
 'millionfooted': [0.5260772985705469, 0.401962846126884],
 'lorado': [0.5260772985705469, 0.401962846126884],
 'quasar': [0.5260772985705469, 0.401962846126884],
 'horizons': [0.5260772985705469, 0.401962846126884],
 'romanov': [0.8582887163235293, 0.03188116723440362],
 'revolving': [0.

In [19]:
chisquare_pval=pd.DataFrame.from_dict(chi_squared_dict,orient='index',columns=['P_value','Chisquare Value'])

In [20]:
significance=chisquare_pval[chisquare_pval["P_value"]<0.05]

In [21]:
print(significance)

          P_value  Chisquare Value
composed  0.01628          5.77234


# Chi-squared results
Only one of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.