# Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture.

This project will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help a participant win.

The dataset is named `jeopardy.csv`, and contains `20000` rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

Here are explanations of each column:

* `Show Number` -- the Jeopardy episode number of the show this question was in.
* `Air Date` -- the date the episode aired.
* `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* `Category` -- the category of the question.
* `Value` -- the number of dollars answering the question correctly is worth.
* `Question` -- the text of the question.
* `Answer` -- the text of the answer.

## Introduction to the Dataset

In [1]:
import pandas as pd
import re

jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
jeopardy.columns = jeopardy.columns.str.lstrip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Nomalising Text

In [4]:
def normalise_str(s):
    s = s.lower()
    s = re.sub(r'[^\w\s]','',s)
    return s

In [5]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalise_str)
jeopardy['clean_question'].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [6]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalise_str)
jeopardy['clean_answer'].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

## Normalising Columns

In [7]:
def normalise_value(s):
    s = s[1:]
    s = re.sub(r'[^\w\s]','',s)
    try:
        return int(s)
    except ValueError:
        return 0

jeopardy['clean_value'] = jeopardy['Value'].apply(normalise_value)
jeopardy['clean_value'].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
jeopardy['Air Date'].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: Air Date, dtype: datetime64[ns]

## Answer the Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

In [9]:
def matches(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for x in split_answer:
        if x in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(matches, axis=1)
jeopardy['answer_in_question'].head(20)

0     0.000000
1     0.000000
2     0.000000
3     0.000000
4     0.000000
5     0.000000
6     0.000000
7     0.000000
8     0.000000
9     0.333333
10    0.000000
11    0.000000
12    0.000000
13    0.000000
14    0.500000
15    0.000000
16    0.000000
17    0.000000
18    0.000000
19    0.000000
Name: answer_in_question, dtype: float64

In [10]:
mean_answer_in_question = jeopardy['answer_in_question'].mean()
print(mean_answer_in_question)

0.06049325706933587


## Recycled Questions

In [11]:
question_overlap = list()
terms_used = set()

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [s for s in split_question if len(s) >= 6]
    match_count = 0
    for x in split_question:
        if x in terms_used:
            match_count += 1
        terms_used.add(x)
    if len(split_question) > 0:
        question_overlap.append(match_count / len(split_question))

jeopardy['question_overlap'] = pd.Series(question_overlap)
mean_question_overlap = jeopardy['question_overlap'].mean()

print(mean_question_overlap)

0.707450713451737


## Low Value vs High Value Questions

* Low value -- Any row where `Value` is less than `800`.
* High value -- Any row where `Value` is greater than `800`.

In [12]:
def is_high(row):
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
    
jeopardy['high_value'] = jeopardy.apply(is_high, axis=1)
jeopardy['high_value'].head()

0    0
1    0
2    0
3    0
4    0
Name: high_value, dtype: int64

In [13]:
def counts(w):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if w in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

observed_expected = list()
comparison_terms = list(terms_used)[:5]

for term in comparison_terms:
    observed_expected.append(counts(term))
    
print(observed_expected)

[(0, 1), (1, 0), (1, 4), (0, 1), (0, 3)]


## Applying the Chi-Squared Test

In [14]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
chi_squared = list()

for x in observed_expected:
    total = sum(x)
    total_prop = total / jeopardy.shape[0]
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    observed = np.array([x[0], x[1]])
    expected = np.array([expected_high, expected_low])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.18383953104516373, pvalue=0.6680941623250602),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766902047)]