# Winning Jeopardy Project

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. Rules for playing Jeopardy are [here](!https://tag.rutgers.edu/wp-content/uploads/2014/05/Jeopardy-instructions.pdf)

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](!https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).


Columns description:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import random

from scipy.stats import chisquare
from scipy.stats import chi2_contingency


In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


# Normalizing Text

In [3]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [4]:
columns_list = list(jeopardy.columns)

for column in range(len(columns_list)):
    columns_list[column] = columns_list[column].strip()
jeopardy.columns = columns_list
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

# Normalizing Columns

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [6]:
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text) # the regex removes any punctuation
    text = re.sub("[\s+]", " ", text)
    return text
    
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

In [7]:
def normalize_value(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [9]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


In [10]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the first question by seeing how many times words in the answer also occur in the question.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur.

## How often the answer can be used for question?

In [11]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()

    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0

    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    
    return match_count/len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count_matches, axis=1)

In [12]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

In the operations above, we wanted to see if the answer to a question is inside the question.

With an occuring percentage of just 6%, it means that it is nearly impossible to be able to answer a question by simply hearing it. Thus, this would mean that it requires study time in order to answer Jeopardy questions.


## How often questions are repeated?

In [13]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values(by='Air Date')

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)

    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6876260592169802

In the function above, we eliminated all the words that are less than 6 characthers in order to allow for more complex words to be 

It looks that around 69% of the terms in old questions are being repeated in the new questions. This looks only at words and not at phrases. Since the dataset that we have gave us access to only 10% of the phrases, the result is a quite insignificant, but it does allow us to look into how the questions are being recycled.

# Low Value vs High Value Questions

In [14]:
def determine_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(determine_value, axis=1)

In [23]:
def count_usage(term):
    low_count = 0
    high_count = 0
    
    for index, row in jeopardy.iterrows():
        if term in row['clean_question'].split(" "):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count            

In [24]:
terms_used

{'territorynunavut',
 'impulsive',
 'motorwagen',
 'leaders',
 'billboard',
 'liberata',
 'concoction',
 'intake',
 'cytosine',
 'associates',
 'highwire',
 'hoekstras',
 'garbage',
 'colter',
 'schultz',
 'whistlestop',
 'alcatraz',
 'comets',
 'bianco',
 'founded',
 'consisted',
 'sheryl',
 'beggars',
 'liquids',
 'actons',
 'instructed',
 'fabriano',
 'speechless',
 '12letter',
 'halleys',
 'langera',
 'lipton',
 'hrefhttpwwwjarchivecommedia20010718j12jpg',
 'romano',
 'siddim',
 'belmont',
 'babysitters',
 'carpet',
 '125000',
 'miserables',
 'targetblankhanda',
 'hrefhttpwwwjarchivecommedia20050502dj18jpg',
 'mcneese',
 'targetblankputs',
 '900000',
 'luhsootoo',
 'robbie',
 'wilhelm',
 'replay',
 'disappear',
 'anagram',
 'ingest',
 'conways',
 'website',
 'donelson',
 'dominata',
 'sniffling',
 'hrefhttpwwwjarchivecommedia20060601j24jpg',
 'youngest',
 'stimulate',
 'butchart',
 '6pronged',
 'neuman',
 'scottsboro',
 'progressive',
 'abroad',
 'subsidence',
 'gauche',
 'valiant'

In [25]:
count_usage('schultz')

(1, 0)

In [26]:
terms_used_list = list(terms_used)

random.seed(42)
comparison_terms = [random.choice(terms_used_list) for i in range(10)]
print("The random sampled terms are:", comparison_terms)

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))
    
observed_expected

[(0, 1),
 (0, 1),
 (1, 1),
 (0, 1),
 (5, 6),
 (1, 2),
 (2, 8),
 (1, 3),
 (1, 1),
 (0, 1)]

In [27]:
high_value_count = jeopardy[jeopardy['high_value']==1].shape[0]
low_value_count = jeopardy[jeopardy['high_value']==0].shape[0]

chi_squared = []

for element in observed_expected:
    total = sum(element)
    total_prop = total / len(jeopardy.index)
    expected_high_value_count = total_prop * high_value_count
    expected_low_value_count = total_prop * low_value_count

    observed = np.array([element[0], element[1]])
    expected = np.array([expected_high_value_count, expected_low_value_count])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.5150423082236086, pvalue=0.21837128417807639),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.36767906209032747, pvalue=0.5442721040962595),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

In [33]:
# Transform the observed_expected values to a dataframe
results = pd.DataFrame(observed_expected, 
                       index=comparison_terms,
                      columns=['Low value count', 'Hight value count']
                      )

# Adding the chi-square and p-value results as columns
results['Chi'] = chi_squared
results[['Chi Square', 'p value']] = pd.DataFrame(results.Chi.tolist(), index= results.index)
results.drop('Chi', axis= 1, inplace= True)

# Display results
results

Unnamed: 0,Low value count,Hight value count,Chi Square,p value
winglike,0,1,0.401963,0.526077
fleamail,0,1,0.401963,0.526077
sixteen,1,1,0.444877,0.504778
gizzards,0,1,0.401963,0.526077
direct,5,6,1.515042,0.218371
heating,1,2,0.031881,0.858289
covering,2,8,0.367679,0.544272
slugger,1,3,0.026364,0.871013
pigments,1,1,0.444877,0.504778
shootout,0,1,0.401963,0.526077
