# Winning Jeopardy
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in US popular culture.

In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help a prospective candidate win.

The dataset has the following columns:
1. Show Number - the Jeopardy episode number
2. Air Date - the date the episode aired
3. Round - the round of Jeopardy
4. Category - the category of the question
5. Value - the number of dollars the correct answer is worth
6. Question - the text of the question
7. Answer - the text of the answer

Let's read and review the dataset.

In [1]:
import warnings
warnings.simplefilter(action='ignore')
import numpy as np
import pandas as pd
import re
import random
from scipy.stats import chisquare
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt

In [2]:
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


Let's clean the column names in the jeopardy dataset.

In [3]:
old_cols = [i for i in jeopardy.columns]
old_cols

['Show Number',
 ' Air Date',
 ' Round',
 ' Category',
 ' Value',
 ' Question',
 ' Answer']

In [4]:
new_cols = [i.strip() for i in old_cols]
new_cols

['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [5]:
col_names = dict()
for i, j in zip(old_cols, new_cols):
    col_names[i] = j

col_names

{'Show Number': 'Show Number',
 ' Air Date': 'Air Date',
 ' Round': 'Round',
 ' Category': 'Category',
 ' Value': 'Value',
 ' Question': 'Question',
 ' Answer': 'Answer'}

In [6]:
jeopardy.rename(columns=col_names, inplace=True)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


We'll need to clean the dataset further to reflect appropriate datatypes.

In [7]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


### Normalizing Text
Before we can start doing analysis on the Jeopardy questions, you need to normalize all the text columns (i.e., the Question and Answer columns). We'll write a function that normalizes these columns by converting the words in those columns into lowercase and removing any punctuation, so similarly spelled words are considered the same not separate.

The function will be output into new columns using the apply() method.

In [8]:
def normalize_text(row_string):
    row_string = re.sub('\W', ' ', row_string)
    row_string = row_string.lower()
    return row_string

In [9]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams


### Normalizing Columns
We'll also use a function to normalize the Value column and convert the Air Date column into datetime type.

In [10]:
def normalize_values(row_string):
    try:
        row_int = int(re.sub('\W', ' ', row_string))
        return row_int
    except:
        return 0

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,200


In [11]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Show Number     19999 non-null  int64         
 1   Air Date        19999 non-null  datetime64[ns]
 2   Round           19999 non-null  object        
 3   Category        19999 non-null  object        
 4   Value           19999 non-null  object        
 5   Question        19999 non-null  object        
 6   Answer          19999 non-null  object        
 7   clean_question  19999 non-null  object        
 8   clean_answer    19999 non-null  object        
 9   clean_value     19999 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


### Answers in Questions
We'll start considering study strategies when studying for jeopardy in order to determine how best to utilize our time efficiently. In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:
- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the second question by finding out how often complex words (i.e. > 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [12]:
def answer_in_question(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()

    match_count = 0
    try:
        split_answer.remove('the')
    except ValueError:
        pass

    if len(split_answer) == 0:
        return 0
    else:
        for item in split_answer:
            if item in split_question:
                match_count += 1

    result = match_count / len(split_answer)
    return result

In [13]:
jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis=1)

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,200,0.0


Let's find the mean of the answer_in_question column

In [14]:
jeopardy['answer_in_question'].mean()

0.06294645581984949

On average, the answer will feature in the question only 6% of the time. Using this as a study strategy may not be tactically helpful.

### Recycled Questions
Another study strategy is to investigate how often new questions are repeats of older ones. While we can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but we can investigate it using our existing sample.

In order to do this, we'll:
- Sort jeopardy air_date column in ascending order.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used. If it does, increment a counter.
- Add each word to terms_used.

This will allow us check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like 'the' and 'than', which are commonly used, but don't tell us a lot about the question.

We'll begin by creating an empty list and an empty set.

In [15]:
question_overlap = list()
terms_used = set()

Now, we'll sort jeopardy by air date in ascending order.

In [16]:
jeopardy.sort_values(by='Air Date', ascending=True, inplace=True)
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride...,theodore roosevelt,0,0.0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,$200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,200,0.0
19302,10,1984-09-21,Double Jeopardy!,1789,$200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first...,thanksgiving,200,0.0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,$200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug thi...,the grand canyon,200,0.0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,$200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones ...,tom,200,0.0


We'll loop through each row in the dataset to determine whether any questions were recycled.

In [17]:
for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [i for i in split_question if len(i) >= 6]

    match_count = 0
    for item in split_question:
        if item in terms_used:
            match_count += 1

    for item in split_question:
        terms_used.add(item)

    if len(split_question) > 0:
        match_count /= len(split_question)

    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.7197989717809739

On the average, questions are repeated on jeopardy 71% of the time. It is a useful strategy to revise past jeopardy questions before competing.

### Low Value vs High Value Questions
We may want to consider a strategy that involves only studying questions that are higher value. This strategy ensures we earn more money on the game.

We can identify which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:
- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

Then we'll loop through each of the terms from the terms_used set, and:
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.

Based on the percentage of questions the word occurs in, we'll find expected counts and compute the chi-squared value based on the expected counts and the observed counts for high and low value questions.

We'll then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

In [18]:
jeopardy['clean_value'].head()

19325      0
19301    200
19302    200
19303    200
19304    200
Name: clean_value, dtype: int64

In [19]:
def high_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(high_value, axis=1)

In [20]:
def value_count(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        if word in row['clean_question'].split():
            if row['high_value'] == 1:
                high_count += 1
            elif row['high_value'] == 0:
                low_count += 1

    return high_count, low_count


We'll randomly pick 10 elements from the terms_used set and append them to a list called comparison_terms. Then we'll create an empty list called observed_expected. We'll loop through each term in comparison_terms, and:
- Run the function on the term to get the high value and low value counts.
- Append the result of running the function (which will be a list) to observed_expected.

In [21]:
comparison_terms = random.choices(list(terms_used), k=10)
comparison_terms

['attempted',
 'drinker',
 'sublicius',
 'archibald',
 'zigzag',
 'neuchatel',
 'floors',
 'carton',
 'hassock',
 'scotty']

In [23]:
observed_expected = list()
for item in comparison_terms:
    observed_expected.append(value_count(item))

observed_expected

[(2, 2),
 (0, 3),
 (0, 1),
 (0, 3),
 (0, 1),
 (0, 1),
 (2, 1),
 (0, 1),
 (0, 1),
 (1, 1)]

### Applying the Chi-squared Test
Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [24]:
high_value_count = len(jeopardy.loc[jeopardy['high_value'] == 1])
low_value_count = len(jeopardy.loc[jeopardy['high_value'] == 0])

print(high_value_count, low_value_count)

4972 15027


In [25]:
chi_squared = list()
for item in observed_expected:
    total = sum(item)
    total_prop = total / len(jeopardy)
    exp_count_high_value = total_prop * high_value_count
    exp_count_low_value = total_prop * low_value_count

    observed = np.array([item[0], item[1]])
    expected = np.array([exp_count_high_value, exp_count_low_value])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=1.353196118801657, pvalue=0.2447201432712674),
 Power_divergenceResult(statistic=0.9926132960670793, pvalue=0.31910449982424866),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.9926132960670793, pvalue=0.31910449982424866),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=2.80672372637985, pvalue=0.09386990525628017),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.3308710986890265, pvalue=0.565146603267378),
 Power_divergenceResult(statistic=0.6765980594008285, pvalue=0.4107606373026975)]

Unfortunately all our selected words have p-values from the chi-squared test above 5% which means they are not statistically significant, meaning that there's not strong evidence from data analysis that the words correlate with high value questions.

### Next Steps
Here are some potential next steps:
- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available here) instead of the subset we used in this lesson.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.
