# Introduction

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. If you need help at any point, you can consult our solution notebook here.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions.

# Cleaning the Data

In [1]:
import pandas as pd

data = pd.read_csv("jeopardy.csv")
data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
data.dtypes

Show Number     int64
 Air Date      object
 Round         object
 Category      object
 Value         object
 Question      object
 Answer        object
dtype: object

In [3]:
# Remove whitespace in columns
data.columns = data.columns.str.replace(" ", "")
data.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [4]:
# Function to normalize text

import re

def normal(text):
    text = text.lower()
    text = re.sub("[^\w\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

In [5]:
# Normalize question column

data["clean_question"] = data["Question"].apply(normal)
data["clean_question"].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [6]:
# Normalize answer column

data["clean_answer"] = data["Answer"].apply(normal)
data["clean_answer"].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

In [7]:
data["Value"].unique()

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300',
       '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389'], dtype=object)

In [8]:
# Function to change value column to numeric

def val_clean(value):
    value = value.replace("$", "")
    value = value.replace(",", "")
    value = value.replace("None", "0")
    value = int(value)
    return value

In [9]:
# Clean value column

data["clean_value"] = data["Value"].apply(val_clean)
data["clean_value"].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [10]:
# Convert the 'AirDate' column to datetime format

data['AirDate']= pd.to_datetime(data['AirDate'])

In [11]:
data.dtypes

ShowNumber                 int64
AirDate           datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

# Investigating Repeat Words

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- **How often the answer can be used for a question.**
- **How often questions are repeated.**

You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [12]:
# Function to find questions where the answers are in them

def match(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    
    match_count = 0
    
    # If "the" is in answer, remove it as it doesn't help finding the answer.
    
    if "the" in split_answer: 
        split_answer.remove("the")
        
    if len(split_answer) == 0:
        return 0
    
    # Loop through each item in split_answer, 
    # and see if it occurs in split_question.
    
    for word in split_answer:
        if word in split_question:
            match_count += 1
            
    return match_count/ len(split_answer)


In [13]:
data["answer in question"] = data.apply(match, axis = 1)
data["answer in question"].head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: answer in question, dtype: float64

In [14]:
data["answer in question"].mean()

0.05900196524977763

On average, the answer only makes up 5.9% of the question. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

Let's say you want to investigate how often new questions are repeats of older ones. You can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but you can investigate it at least. This allows you to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables you to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [15]:
# Find repeating questions

question_overlap = []
terms_used = set()

data.sort_values(by='AirDate', ascending = True, inplace = True)

for i, row in data.iterrows():
    split_question = row["clean_question"].split()
    for word in split_question:
        if len(word) < 6:
            split_question.remove(word)
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count +=1
        else:
            terms_used.add(word)
            
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
    
data["question_overlap"] = question_overlap


data["question_overlap"].mean()

0.7990394226899685

There is about 80% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

# Low Values vs High Value Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:
- Low value: Any row where Value is less than 800.
- High value:Any row where Value is greater than 800.

In [16]:
# Function to classify high and low value questions

def high_low(row):
    if row["clean_value"] > 800:
        return 1
    else:
        return 0

data["Value"] = data.apply(high_low, axis = 1)        
data["Value"].head()

19325    0
19301    0
19302    0
19303    0
19304    0
Name: Value, dtype: int64

In [19]:
# Function counting how many times a word appears in high and low value questions

def word_count(word):
    low_count = 0
    high_count = 0
    for i, row in data.iterrows():
        if word in row["clean_question"].split():
            if row["Value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
        

In [26]:
# Pick random terms in terms_used to test this

from random import choice

comparison_terms = []
term_list = list(terms_used)

for i in range (10):
    comparison_terms.append(choice(term_list))

comparison_terms

['rancic',
 'farmers',
 'hrefhttpwwwjarchivecommedia20071204_j_30jpg',
 'viceroyalty',
 'shertogengosch',
 'kupets',
 'deliverance',
 'safinas',
 'target_blankexamplea',
 'quothed']

In [27]:
observed_expected = []

for term in comparison_terms:
    observed_expected.append(word_count(term))
    
observed_expected

[(0, 1),
 (2, 8),
 (1, 0),
 (0, 1),
 (1, 0),
 (1, 0),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1)]

Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.

In [31]:
high_q_count = data[data["Value"] == 1]["Value"].count()
high_q_count

5734

In [32]:
low_q_count = data[data["Value"] == 0]["Value"].count()
low_q_count

14265

In [36]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []

for counts in observed_expected:
    total = sum(counts)
    total_prop = total / data["Value"].count()
    exp_high_count = total_prop * high_q_count
    exp_low_count = total_prop * low_q_count
    
    observed = np.array(counts)
    expected = np.array([exp_high_count, exp_low_count])
    
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.36767906209032747, pvalue=0.5442721040962595),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

# Chi Squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.