# Guided Project #14: Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, we will work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

The dataset is named jeopardy.csv, and contains 20,000 rows from the beginning of a full dataset of Jeopardy questions, which can be downloaded [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

### Let's start by reading in and looking at the dataset

In [20]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


As we can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- Show Number - the Jeopardy episode number
- Air Date - the date the episode aired
- Round - the round of Jeopardy
- Category - the category of the question
- Value - the number of dollars the correct answer is worth
- Question - the text of the question
- Answer - the text of the answer

In [21]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some of the column names have spaces in front. Let's get rid of them

In [22]:
jeopardy.columns = [col.replace(' ','') for col in jeopardy.columns]
jeopardy.columns

Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Cleaning data
Before we can start doing analysis on the Jeopardy questions, you need to normalize all of the text columns (the Question and Answer columns). The idea of normalization is to ensure that words are in lowercase and punctuation is removed. So Don't and don't aren't considered to be different words when we compare them.

In [30]:
import re
def normalize(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]', '', text)
    text = re.sub('\s+',' ', text)
    return text

# quick test
normalize("That's a string we wanted Normalized.")

'thats a string we wanted normalized'

In [26]:
help(re.sub)

Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used.



In [33]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize)

jeopardy[['Question','clean_question','Answer','clean_answer']].head()

Unnamed: 0,Question,clean_question,Answer,clean_answer
0,"For the last 8 years of his life, Galileo was ...",for the last 8 years of his life galileo was u...,Copernicus,copernicus
1,No. 2: 1912 Olympian; football star at Carlisl...,no 2 1912 olympian football star at carlisle i...,Jim Thorpe,jim thorpe
2,The city of Yuma in this state has a record av...,the city of yuma in this state has a record av...,Arizona,arizona
3,"In 1963, live on ""The Art Linkletter Show"", th...",in 1963 live on the art linkletter show this c...,McDonald's,mcdonalds
4,"Signer of the Dec. of Indep., framer of the Co...",signer of the dec of indep framer of the const...,John Adams,john adams


The Value column should be numeric, to allow us to manipulate it easier. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

In [53]:
def normalize_value(string):
    string = re.sub('[$,]','',string)
    try:
        return int(string)
    except ValueError:
        return 0
    
# quick test
print(normalize_value('$2,000'))
print(normalize_value('None'))

2000
0


In [61]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])
jeopardy[['AirDate','Value','clean_value']].head()

Unnamed: 0,AirDate,Value,clean_value
0,2004-12-31,$200,200
1,2004-12-31,$200,200
2,2004-12-31,$200,200
3,2004-12-31,$200,200
4,2004-12-31,$200,200


In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer can be used for a question.
- How often questions are repeated.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [73]:
def match_ratio(row):
    split_question = row['clean_question'].split(' ')
    split_answer = row['clean_answer'].split(' ')
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count += 1
    return match_count / len(split_question)

jeopardy['answer_in_question'] = jeopardy.apply(match_ratio, axis=1)
jeopardy['answer_in_question'].mean()

0.011923255048185965

It seems, the answer words occur in the question only about 1% of times. Probably not worth hoping for that

Let's investigate how often new questions are repeats of older ones. We can't completely answer this, because you only have about 10% of the full Jeopardy question dataset, but we can investigate it at least.

To do this, you will:

- Sort jeopardy in order of ascending air date.
- Maintain a set called terms_used that will be empty initially.
- Iterate through each row of jeopardy.
- Split clean_question into words, remove any word shorter than 6 characters, and check if each word occurs in terms_used.
- If it does, increment a counter.
- Add each word to terms_used.

This allows us to check if the terms in questions have been used previously or not. Only looking at words with six or more characters enables us to filter out words like the and than, which are commonly used, but don't tell you a lot about a question.

In [79]:
question_overlap = []
terms_used = set()

jeopardy.sort_values('AirDate', inplace = True)

for index, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [w for w in split_question if len(w) >= 6]
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
        terms_used.add(w)
    if len(split_question) == 0:
        question_overlap.append(0)
    else:
        question_overlap.append(match_count/len(split_question))

In [80]:
jeopardy['question_ovelap'] = question_overlap
jeopardy['question_ovelap'].mean()

0.6895114174922486

^ The overlap percentage seems to be pretty high

### High value questions

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

In [87]:
jeopardy['high_value'] = jeopardy['clean_value'].apply(lambda x: 1 if x>800 else 0)
jeopardy[['Value','high_value']].head()

Unnamed: 0,Value,high_value
19323,$1000,1
19295,$500,0
19294,$400,0
19293,$400,0
19292,$400,0


In [91]:
def value_counts(word):
    low_count = 0
    high_count = 0
    for index, row in jeopardy.iterrows():
        if word in row['clean_question'].split(' '):
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count 

In [96]:
import random
sample = random.sample(terms_used, 10)
sample

['valette',
 'aumont',
 'collaborative',
 'yourself',
 'perked',
 'november',
 'siddals',
 'elders',
 'contracts',
 'headmaster']

In [97]:
observed_expected = []
for word in sample:
    observed_expected.append(value_counts(word))
observed_expected

[(0, 1),
 (0, 1),
 (1, 0),
 (5, 8),
 (0, 1),
 (10, 32),
 (0, 1),
 (1, 2),
 (0, 1),
 (0, 1)]

In [99]:
from scipy.stats import chisquare

high_value_count = len(jeopardy[jeopardy['high_value']==1])
low_value_count = len(jeopardy[jeopardy['high_value']==0])
chi_squared = []

for counts in observed_expected:
    total = sum(counts)
    total_ratio = total/len(jeopardy)
    high_expected = total_ratio * high_value_count
    low_expected = total_ratio * low_value_count
    
    chi_squared.append(chisquare(counts,[high_expected, low_expected]))
    
chi_squared

[Power_divergenceResult(statistic=0.4019346698443852, pvalue=0.5260918005187468),
 Power_divergenceResult(statistic=0.4019346698443852, pvalue=0.5260918005187468),
 Power_divergenceResult(statistic=2.4879665155214514, pvalue=0.11471986177699109),
 Power_divergenceResult(statistic=0.6094601352366873, pvalue=0.4349911511897153),
 Power_divergenceResult(statistic=0.4019346698443852, pvalue=0.5260918005187468),
 Power_divergenceResult(statistic=0.4851846064951335, pvalue=0.48608324093624855),
 Power_divergenceResult(statistic=0.4019346698443852, pvalue=0.5260918005187468),
 Power_divergenceResult(statistic=0.03190173163299733, pvalue=0.8582435032724245),
 Power_divergenceResult(statistic=0.4019346698443852, pvalue=0.5260918005187468),
 Power_divergenceResult(statistic=0.4019346698443852, pvalue=0.5260918005187468)]

### Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.