# Winning Jeopardy
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for a few decades, and is a major force in popular culture. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

Here are explanations of each column:
- `Show Number` -- the Jeopardy episode number of the show this question was in.
- `Air Date` -- the date the episode aired.
- `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- `Category` -- the category of the question.
- `Value` -- the number of dollars answering the question correctly is worth.
- `Question` -- the text of the question.
- `Answer` -- the text of the answer.

Let's begin reading in our data set using pandas and look at some rows as well as the columns:

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
print(jeopardy.head(5))
print("\n", jeopardy.columns)

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  

 Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dty

Observe a couple of things:
- `Air Date` is parsable. Let's re-read our data set and parse the dates.
- Several columns have a spaces in front. After we've re-read the data set we'll remove these spaces.

In [2]:
jeopardy = pd.read_csv('jeopardy.csv', parse_dates = [' Air Date'])
cols = []
for i in jeopardy.columns:
    cols.append(i.strip())
    
jeopardy.columns = cols
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

## Normalizing text
Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the Question and Answer columns). We will write a function to achieve this by converting to lower case and removing all punctuation. We will assign the cleaned values to two new columns; `clean_question` and `clean_answer`.

In [3]:
import re

def normalize_str(s):
    s = s.lower()
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    s = re.sub("\s+", " ", s)
    return s

In [4]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_str)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_str)

## Normalizing values
Now the text-columns are normalized but the numerical `value`-column remains to be normalized. We will write a corresponding function to remove any punctuation and convert it to a numerical data type. If the conversion has an error we'll just put in 0.

In [5]:
def normalize_val(i):
    i = re.sub("[^A-Za-z0-9\s]", "", i)
    try:
        i = int(i)
    except Exception:
        i = 0
    return i

In [6]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_val)

In [7]:
jeopardy.head(15)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sha...,the ant,200
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 bc to link rome the south of ital...,the appian way,400
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2306 ...,michael jordan,400
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 197172 a record 1122 inches o...,washington,400
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel,400


## Answers in questions
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

In [8]:
def count_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    match_count = 0
    
    # 'the' commonly occurs in both questions and answer and is not meaningful in this context so we'll remove it
    if 'the' in split_answer:
        split_answer.remove('the')
    
    # Returning 0 if length of split_answer is 0, to avoid risking dividing with 0 later
    if len(split_answer) == 0:
        return 0
    
    question_words = set(split_question)
    
    for word in split_answer:
        if word in question_words:
            match_count += 1
    
    return match_count/len(split_answer)

In [9]:
answer_in_question = jeopardy.apply(count_matches, axis = 1)
answer_in_question.mean()

0.059001965249777744

It appears only 5.9% of the words in the answers also appears in the questions. This is not a huge number and it likely means we can't really rely on deducting an answer purely from hearing the question.

## Recycled questions
Another interesting matter to study is how often questions are recycled. If a large share of questions are repetitions of old ones, studying previous questions can be a viable strategy.

In [10]:
question_overlap = []
terms_used = set()
jeopardy.sort_values(by='Air Date', inplace = True)

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split()
    split_question = [question for question in split_question if len(question) > 5]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
        else:
            terms_used.add(word)
            
    if len(split_question) > 0:
        match_count /= len(split_question)
    
    question_overlap.append(match_count)

jeopardy['question_overlap'] = question_overlap
jeopardy['question_overlap'].mean()

0.6894031359073217

Around 68.9% of questions appear to have been recycled which is quite a large share. Keep in mind though that the data we're looking at only is made up of 10% of the entire data set. Thus, we don't look at all phrases and it may not be fully representative. However, the results are interesting and legitimize looking further at recycled questions.

## Low value vs High value questions
Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

You'll then be able to loop through each of the terms from the last screen, terms_used, and:
- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

You can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [11]:
def set_val(row):
    val = 0
    if row["clean_value"] > 800:
        val = 1
    return val

jeopardy["high_value"] = jeopardy.apply(set_val, axis=1)

In [12]:
def count_low_high(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row["clean_question"].split(" "):
            if row["high_value"] == 0:
                low_count += 1
            else:
                high_count += 1
    return low_count, high_count

In [13]:
import random
comparison_terms = random.sample(terms_used, 10)
comparison_terms

observed_expected = []
for word in comparison_terms:
    observed_expected.append(count_low_high(word))

observed_expected

[(1, 0),
 (1, 2),
 (2, 0),
 (2, 0),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 1),
 (2, 1),
 (0, 1)]

## Applying the chi-squared test
Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

Use the scipy.stats.chisquare function to compute the chi-squared value and p-value given the expected and observed counts.
Append the results to chi_squared.
Look over the chi-squared values and the associated p-values. Are there any statistically significant results? Write up your thoughts in a markdown cell.

In [14]:
high_value_count = jeopardy['high_value'].sum()
low_value_count = jeopardy['high_value'].shape[0] - jeopardy['high_value'].sum()

In [15]:
from scipy.stats import chisquare
import numpy as np

chi_squared = []
for i in observed_expected:
    total = sum(i)
    total_prop = total / jeopardy.shape[0]
    expected_low = total_prop * low_value_count
    expected_high = total_prop * high_value_count
    
    observed = np.array([i[0], i[1]])
    expected = np.array([expected_low, expected_high])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.1177104383031944, pvalue=0.14560406868263753),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.803925692253768, pvalue=0.3699222378079571),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047)]

## Chi-squared results
p-values are all exceeding 0.10, indicating that there are no statistically significant differences between high value and low value rows. Many of the words, however, have a very low frequency as indicated earlier, which makes the chi-squared test less valid. It would be better to run the test only for words with relatively high frequencies.