# Winning Jeopardy

The project is aimed to work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help to potentially win.

The dataset, containing 20,000 rows, is available [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

Let's get familiar with the dataset.


In [2]:
#Reading the dataset
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
#Exploring the columns
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
#Removing spaces 
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing text

Before starting the analysis we should normalize all of the text columns to be lowercase and without punctuation.

In [5]:
#Writing a function to normalize text
def normalize(string):
    string  = string.lower()
    string = string.replace('[^A-Za-z0-9\s]','')
    string = string.replace('\s+','')
    return string

#Normilizing the 'Question' column
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)

#Normilizing the 'Answer' column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)




## Normalizing columns

Besides normalizing the text columns, we should also normalize columns 'Value' and 'Air Date' to convert them to numeric and datetime types respectively.

In [6]:
#Writing a function to convert to numeric type
import re
def numeric(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

#Normalizing 'Value' column
jeopardy['clean_value'] = jeopardy['Value'].apply(numeric)

#Normalizing 'Air Date' column
jeopardy['Air Date']  = pd.to_datetime(jeopardy['Air Date'])


In [7]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Answers in Questions

It would be helpful to figure out the following things for our analysis:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

The first question could be answered by seeing how many times words in answer also occur in the question.
The second question could be answered by seeing how often complex words (> 6 characters) reoccur.

In [8]:
#First question
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count/len(split_answer)
   
jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)        

In [9]:
jeopardy["answer_in_question"].mean()

0.045522472593826156

The answer appears in question only 5% of the time, which means that we can't rely solely on questions to reply to the questions.

## Recycled Questions

At this point we will try to investigate how often new questions are repeats of older ones. We will look at words longer than 6 characters to filter out words like _the_ and _than_, which are commonly used, but don't tell a lot about a question.

In [10]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()
    

0.6019236436252964

So there is a 60% overlap between terms in new questions and terms in old questions. However, it does take only terms into the consideration, not phrases.

## Low value vs High value questions

Now we are going to study questions that pertain to high value questions instead of low value quesitons.

The low value question will be in any row where 'Value' is less than $800, and vice versa for high value questions.

In [11]:
def high_low_count(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

jeopardy['high_value'] = jeopardy.apply(high_low_count, axis=1)

def word_count(word):
    low_count = 0
    high_count = 0
    for i,row in jeopardy.iterrows():
        split_question = row['clean_question'].split(' ')
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count 

In [12]:
import random
comparison_terms = random.sample(terms_used, 10)
observed_expected = []

for term in comparison_terms:
    counts = word_count(term)
    observed_expected.append(counts)

observed_expected
    

[(1, 2),
 (2, 3),
 (2, 2),
 (0, 1),
 (1, 2),
 (0, 1),
 (1, 0),
 (1, 0),
 (0, 1),
 (1, 3)]

## Applying the chi-squared test

Now that we have found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [18]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total/jeopardy.shape[0]
    exp_high = total_prop * high_value_count
    exp_low = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([exp_high, exp_low])
    chi_squared.append(chisquare(observed, expected))

chi_squared
    

[Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.3137668167849311, pvalue=0.5753778622944691),
 Power_divergenceResult(statistic=0.889754963322559, pvalue=0.3455437191483468),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921)]

## Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.