# Winning Jeopardy

In this project we will be analysing a dataset of [Jeopardy](https://www.jeopardy.com/) questions to try and identify patterns in questions, which might offer a potential advance in winning the show.

In [6]:
import pandas as pd
import datetime as datetime

jeopardy = pd.read_csv('JEOPARDY_CSV.csv')
jeopardy[' Air Date'] = pd.to_datetime(jeopardy[' Air Date'])
jeopardy = jeopardy[jeopardy[' Air Date'] > '2008-12-31']
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41804 entries, 56 to 216746
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Show Number  41804 non-null  int64         
 1    Air Date    41804 non-null  datetime64[ns]
 2    Round       41804 non-null  object        
 3    Category    41804 non-null  object        
 4    Value       41804 non-null  object        
 5    Question    41804 non-null  object        
 6    Answer      41803 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 2.6+ MB


In [7]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [8]:
jeopardy.rename(columns = {'Show Number':'show_number',' Air Date':'air_date', ' Round':'round', ' Category':'category', 
                           ' Value':'value', ' Question':'question', ' Answer':'answer'}, inplace=True)
jeopardy.columns

Index(['show_number', 'air_date', 'round', 'category', 'value', 'question',
       'answer'],
      dtype='object')

In [9]:
jeopardy.dtypes

show_number             int64
air_date       datetime64[ns]
round                  object
category               object
value                  object
question               object
answer                 object
dtype: object

In [10]:
jeopardy['answer'].head()

56        England
57    Miley Cyrus
58       the skin
59             48
60      dribbling
Name: answer, dtype: object

## Normalising Text

Before we can commence our analysis we need to normalise some of the columns, specifically:

-`question` and `answer`: remove punctuation, change all characters to lower case

-`values`: remove dollar signs

-`air_date`: convert to datetime format

In [11]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [12]:
jeopardy['answer'] = jeopardy['answer'].astype('str')

In [13]:
jeopardy["clean_question"] = jeopardy["question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["value"].apply(normalize_values)

In [14]:
jeopardy.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,clean_value
56,5957,2010-07-06,Jeopardy!,"GEOGRAPHY ""E""",$200,It's the largest kingdom in the United Kingdom,England,its the largest kingdom in the united kingdom,england,200
57,5957,2010-07-06,Jeopardy!,RADIO DISNEY,$200,"""Party In The U.S.A."" is by this singer who al...",Miley Cyrus,party in the usa is by this singer who also pl...,miley cyrus,200
58,5957,2010-07-06,Jeopardy!,PARTS OF PEACH,$200,"If this part of a peach is downy or fuzzy, the...",the skin,if this part of a peach is downy or fuzzy the ...,the skin,200
59,5957,2010-07-06,Jeopardy!,BE FRUITFUL & MULTIPLY,$200,4 x 12,48,4 x 12,48,200
60,5957,2010-07-06,Jeopardy!,LET'S BOUNCE,$200,This verb for bouncing a basketball sounds lik...,dribbling,this verb for bouncing a basketball sounds lik...,dribbling,200


In [15]:
import datetime as datetime
jeopardy['air_date'] = pd.to_datetime(jeopardy['air_date'])
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41804 entries, 56 to 216746
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   show_number     41804 non-null  int64         
 1   air_date        41804 non-null  datetime64[ns]
 2   round           41804 non-null  object        
 3   category        41804 non-null  object        
 4   value           41804 non-null  object        
 5   question        41804 non-null  object        
 6   answer          41804 non-null  object        
 7   clean_question  41804 non-null  object        
 8   clean_answer    41804 non-null  object        
 9   clean_value     41804 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 3.5+ MB


## Answers in Questions

In order to decide whether we should study past questions, study general knowledge, or not study at all, it would be helpful to understand:

- How often the answer can be derived from the question
- How often new questions are repeats of older questions

For the former we can look at how many times words in the answer also occur in the question.

In [16]:
def answer_in_question(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    match_count = 0
    
    if 'the' in split_answer:
        split_answer.remove('the')
    
    if len(split_answer) == 0:
        return 0
    
    for i in split_answer:
        if i in split_question:
            match_count += 1
    
    result = match_count/len(split_answer)
    return result    

In [17]:
jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis=1)
jeopardy['answer_in_question'].mean()

0.06187277230419057

From the output above, we can see that the answer can be deduced less than 6% of the time. Therefore this approach is unlikely to give us an advantage, and we should now look at how many times questions are repeated.

In [18]:
question_overlap = []
terms_used = set()
jeopardy.sort_values(by=['air_date'], inplace=True)

for i, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(" ")
    #Looking at more complex words, using an arbitrary length filter of 6+ characters
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.7486609603581684

We can see a 75% overlap between terms in new questions and terms in old questions. Whilst we are only working with a dataset containing 10% of all jeopardy questions, and this only looks at individual words rather than phrases, it is likely to be more of a worthwhile strategy than studying for answers in questions.

## Low Value vs High Value Questions

If we segment the dataset into two categories we can then identify, through use of a Chi-Squared Test, which terms correspond to high value questions. 

To do this we can categorise data through the `value` column, with any row containing a value of less than 800 being considered low value, and anything above this classified as high value.

We can then loop through the `terms_used` set to:

- Find the number of low value questions the word occurs in
- Find the number of high value questions the word occurs in
- Find the percentage of questions the word occurs in
- Based on the above percentage, find expected counts
- Calculate the chi-squared value based on expected counts and the observed counts for high and low value questions

Any words with the highest associated chi-squared values will suggest the largest differences in usage between high and low value questions.

In [19]:
#Alternative to the high_value functon below
#row['high_or_low'] = [lambda x: 1 for x in row['clean_value'] if i > 800, else 0]

In [20]:
def high_value(row):
    value = 0
    if row['clean_value'] > 800:
        value = 1
    return value

jeopardy['high_value'] = jeopardy.apply(high_value, axis=1)

In [21]:
jeopardy['high_value'].value_counts()
jeopardy.shape

(41804, 13)

In [22]:
def high_or_low(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row['clean_question'].split(" "):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count    

In [23]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for word in comparison_terms:
    observed_expected.append(high_or_low(word))

observed_expected

[(8, 6),
 (1, 5),
 (2, 4),
 (0, 1),
 (1, 0),
 (3, 0),
 (1, 0),
 (5, 15),
 (0, 1),
 (2, 0)]

## Applying Chi-Square Test

Now we have found the observed counts for a few words, we can compute the expected counts and the [chi-squared](https://en.wikipedia.org/wiki/Chi-squared_test) value.

In [27]:
import numpy as np
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []
for i in observed_expected:
    total = sum(i)
    total_prop = total / jeopardy.shape[0]
    
    exp_high_value = total_prop * high_value_count
    exp_low_value = total_prop * low_value_count
    
    observed = np.array([i[0], i[1]])
    expected = np.array([exp_high_value, exp_low_value])
    
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=1.69721811617478, pvalue=0.19265218905181142),
 Power_divergenceResult(statistic=1.369540142184742, pvalue=0.24189090934291385),
 Power_divergenceResult(statistic=0.11371488228441806, pvalue=0.7359537954729485),
 Power_divergenceResult(statistic=0.6688889776038963, pvalue=0.41343922833074787),
 Power_divergenceResult(statistic=1.4950164130110415, pvalue=0.22143976330838874),
 Power_divergenceResult(statistic=4.485049239033124, pvalue=0.03419256146400087),
 Power_divergenceResult(statistic=1.4950164130110415, pvalue=0.22143976330838874),
 Power_divergenceResult(statistic=1.893771514307635, pvalue=0.16877714486249135),
 Power_divergenceResult(statistic=0.6688889776038963, pvalue=0.41343922833074787),
 Power_divergenceResult(statistic=2.990032826022083, pvalue=0.08377847019037524)]

In [30]:
comparison_terms[5]

'reacts'

From the chi-squared test results we can see only one term has statistically significant difference - 'reacts'. 

This is as far as we will go within project scope, but possible next steps for this could include:

- Creating an alternative, more systematic approach to selecting complex words rather than taking an arbitrary character limit
- Repeat the chi-squared test across a larger range of terms to try and identify which have larger differences
- Analyse the categories column to calculate probabilities, helping people who are going on to the show to use ratios for their preparation.