# Statistics on Joepardy
---
The dataset contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

Each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` -- the Jeopardy episode number of the show this question was in.
- `Air Date` -- the date the episode aired.
- `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- `Category` -- the category of the question.
- `Value` -- the number of dollars answering the question correctly is worth.
- `Question` -- the text of the question.
- `Answer` -- the text of the answer.

## Importing the data

In [1]:
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [3]:
# Removing white spaces in column names
jeopardy.columns = jeopardy.columns.str.strip()

In [4]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [5]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


## Normalizing `Question` and `Answer` column

Before doing analysis on the Jeopardy questions, you need to normalize all of the text columns.

In [6]:
import string
import re

regex = re.compile('[%s]' % re.escape(string.punctuation))

def removing_punctuation(s):
    return regex.sub('', s.lower())

jeopardy['clean_question'] = jeopardy['Question'].apply(removing_punctuation)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(removing_punctuation)

## Normalizing `Value` column

In [7]:
import string
import re

regex = re.compile('[%s]' % re.escape(string.punctuation))

def norm_value(s):
    try:
        return int(regex.sub('', s))
    except:
        return 0
    
jeopardy['clean_value'] = jeopardy['Value'].apply(norm_value)

In [8]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

## Analysing term frequencies

### Question 1
How often the answer is deducible from the question?

In [9]:
def count(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    match_count = 0
    # The term 'the' is very frequent, so we remove it
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for each in split_answer:
        if each in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(count, axis=1)

In [10]:
jeopardy['answer_in_question'].mean()

0.060352773854699004

We obtain that on average terms in the answer appear in the question 6% of the time, so answers can only rarely be deducted from the question.

### Question 2

Let's investigate how often new questions are repeats of older ones.

In [11]:
question_overlap = []
terms_used = set()

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    # We only take "complex" words into account --> length >= 6
    split_question = [each for each in split_question if len(each) >= 6]
    match_count = 0
    for each in split_question:
        if each in terms_used:
            match_count += 1
        terms_used.add(each)
    if len(split_question) > 0: 
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6919577992203563

Interesting! It looks like questions might be recycled. Note that we are only looking at 10% of the full Jeopardy question dataset in this analysis.

### Question 3
Which terms correspond to high-value questions?  To anwser this, we will use a **chi_squared test**.

We will narrow down the questions into two categories:
- `Low value` -- Any row where Value is less than 800.
- `High value` -- Any row where Value is greater than 800.

Then, we will loop through each of the terms from `terms_used`, and we will:
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

Finally we can find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values.

In [12]:
def categorize(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(categorize, axis=1)

In [13]:
jeopardy['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [14]:
def count_w(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1: 
                high_count += 1
            else:
                low_count += 1
    return [high_count, low_count]

observed_expected = []
# We only do the comparison with a small sample of words
comparison_terms = list(terms_used)[0:5]
for each in comparison_terms:
    observed_expected.append(count_w(each))

In [15]:
comparison_terms

['evaporation', 'dealer', 'pacino', 'inductees', 'bowler']

In [16]:
observed_expected

[[1, 1], [4, 3], [1, 3], [1, 1], [3, 1]]

In [17]:
# We compute the number of high/low value questions in the dataset
high_value_count = jeopardy['high_value'].value_counts()[1]
low_value_count = jeopardy['high_value'].value_counts()[0]

In [18]:
from scipy.stats import chisquare
chi_squared = []
total_rows = jeopardy.shape[0]
for each in observed_expected:
    total = sum(each)
    total_prop = total/total_rows
    high_value_count_exp = total_prop * high_value_count
    low_value_count_exp = total_prop * low_value_count
    chi_sq, p_val = chisquare(each, [high_value_count_exp, low_value_count_exp])
    chi_squared.append([chi_sq, p_val])

In [19]:
chi_squared

[[0.44487748166127949, 0.50477764875459963],
 [2.7746199271818219, 0.09576938744167536],
 [0.026364433084407689, 0.87101348468892104],
 [0.44487748166127949, 0.50477764875459963],
 [4.1980229752219893, 0.040471136200959497]]

### Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.