# Winning Jeopardy: A Statistical Analysis
## Data Introduction

In this project, titled **Winning Jeopardy**, the focus is on analyzing the American TV show Jeopardy. This program, a notable fixture in U.S. popular culture, is well-known for its challenging trivia questions and the appeal of cash prizes. The objective is to identify patterns within the questions that may provide contestants with a strategic advantage.

The analysis utilizes a dataset named `jeopardy.csv`, which includes 20,000 rows sampled from a comprehensive collection of Jeopardy questions. Each row represents a single question from a specific episode of the show, reflecting the diverse and complex nature of the game's format. The dataset is available for download at [this link https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). 

The dataset includes the following key columns:

- **Show Number**: A unique identifier for each Jeopardy episode.
- **Air Date**: The date when the episode was broadcast.
- **Round**: The round of Jeopardy during which the question was asked.
- **Category**: The thematic category of the question.
- **Value**: The monetary value awarded for a correct answer.
- **Question**: The text of the trivia question posed to the contestants.
- **Answer**: The correct response to the question.

The goal of this analysis is to uncover trends and insights that could inform strategies for succeeding on Jeopardy.

In [2]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import re
import random
from scipy.stats import chisquare

In [4]:
# Read the dataset into a DataFrame called jeopardy using Pandas.
jeopardy = pd.read_csv("jeopardy.csv")

In [6]:
# Display the columns of jeopardy using jeopardy.columns.
jeopardy.columns


Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [8]:
# Some of the column names have spaces in front. Remove the spaces and assign the result back to jeopardy.columns
jeopardy.columns = ["Show Number", "Air Date", "Round", "Category", "Value", "Question", "Answer"]


In [10]:
# Define a function to normalize questions and answers.
def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text


In [12]:
# Normalize the Question and Answer columns.
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)


In [14]:
# Normalize the Value column and convert the Air Date column to a datetime column.
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])


In [16]:
# Display jeopardy
jeopardy

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200
...,...,...,...,...,...,...,...,...,...,...
19994,3582,2000-03-14,Jeopardy!,U.S. GEOGRAPHY,$200,"Of 8, 12 or 18, the number of U.S. states that...",18,of 8 12 or 18 the number of us states that tou...,18,200
19995,3582,2000-03-14,Jeopardy!,POP MUSIC PAIRINGS,$200,...& the New Power Generation,Prince,the new power generation,prince,200
19996,3582,2000-03-14,Jeopardy!,HISTORIC PEOPLE,$200,In 1589 he was appointed professor of mathemat...,Galileo,in 1589 he was appointed professor of mathemat...,galileo,200
19997,3582,2000-03-14,Jeopardy!,1998 QUOTATIONS,$200,"Before the grand jury she said, ""I'm really so...",Monica Lewinsky,before the grand jury she said im really sorry...,monica lewinsky,200


In [235]:
# Display data types
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

In [18]:
# Define a function to check how often words in the answer also occur in the question.
def match_calculation(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()

    # Remove 'the' from split_answer.
    if "the" in split_answer:
        split_answer.remove("the")

    # If the length of split_answer is 0, return 0 to prevent division by zero error.
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    
    # Loop through each word in split_answer and check if it occurs in split_question.
    for item in split_answer:
        if item in split_question:
            match_count += 1

    # Divide match_count by the length of split_answer and return the result.
    return match_count / len(split_answer)


In [20]:
# Apply the match_calculation function to each row in jeopardy.
jeopardy["answer_in_question"] = jeopardy.apply(match_calculation, axis=1)


In [22]:
# Display the mean of the answer_in_question column.
jeopardy["answer_in_question"].mean()


0.05900196524977763

## Recycled Questions  
On average, the answer accounts for only about 6% of the question. This is a relatively small portion, suggesting that simply hearing a question isn't enough to reliably identify the correct answer. It implies that studying will likely be necessary.

In [24]:
# Check how often new questions are repeats of older ones.
question_overlap = []
terms_used = set()


In [26]:
# Sort jeopardy by the "Air Date" column.
jeopardy = jeopardy.sort_values("Air Date")


In [28]:
# Iterate through each row of jeopardy to calculate question overlap.
for index, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [question for question in split_question if len(question) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)


In [30]:
# Add question_overlap column to jeopardy.
jeopardy["question_overlap"] = question_overlap


In [32]:
# Find the mean of the question_overlap column.
jeopardy["question_overlap"].mean()


0.6876260592169802

## Low Value vs. High Value Questions

Approximately `70%` of the terms in new questions also appear in previous questions. This analysis is based on a limited sample and focuses on individual terms rather than entire phrases, making its significance somewhat limited. However, this overlap suggests that further exploration into the repetition of questions could be valuable.

In [243]:
# Define a function to categorize questions as high or low value.
def categorize_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value


In [36]:
# Apply the categorize_value function to each row in jeopardy.
jeopardy["high_value"] = jeopardy.apply(categorize_value, axis=1)


In [38]:
# Define a function to count the usage of a word in high and low value questions.
def count_usage(word):
    low_count = 0
    high_count = 0

    for index, row in jeopardy.iterrows():
        if word in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1

    return high_count, low_count


In [40]:
# Randomly pick ten elements of terms_used for comparison.
terms_used_list = list(terms_used)
comparison_terms = [random.choice(terms_used_list) for _ in range(10)]

observed_expected = []


In [42]:
# Loop through each term in comparison_terms and append the result of count_usage to observed_expected.
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected


[(1, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 2),
 (0, 1),
 (0, 1)]

In [44]:
observed_expected

[(1, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (0, 1),
 (1, 0),
 (0, 1),
 (0, 2),
 (0, 1),
 (0, 1)]

In [47]:
# Find the number of rows in jeopardy where high_value is 1 and 0.
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []


In [49]:
# Loop through each list in observed_expected and compute the chi-squared value and p-value.
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

print(chi_squared)


[Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]


In [251]:
print(chi_squared)

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469), Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]


## Chi-Squared Analysis

There were no notable differences in term usage between high-value and low-value rows. Additionally, the chi-squared test's reliability is uncertain because all frequencies were below `5`. It would be more suitable to perform this test again with terms that have higher frequencies.