## Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. Let's say you want to compete on Jeopardy, and you're looking for any edge you can get to win. In this project, we'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win. The dataset was recopilated by crawling the [archive of Jeopardy](http://www.j-archive.com/) and can be downloaded from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

The format of the file is a JSON which structure is an unordered list of questions where each question has:

- `category` : the question category, e.g. "HISTORY".
- `value` : value of the question as string, e.g. "\$200" . _Note_: This is "None" for Final Jeopardy! and Tiebreaker questions.
- `question` : text of question. _Note_: This sometimes contains hyperlinks and other things messy text such as when there's a picture or video question.
- `answer` : text of answer.
- `round` : one of "Jeopardy!","Double Jeopardy!","Final Jeopardy!" or "Tiebreaker". _Note_: Tiebreaker questions do happen but they're very rare (like once every 20 years).
- `show_number` : string of show number, e.g '4680'.
- `air_date` : the show air date in format YYYY-MM-DD.

We'll use a subset of the dataset, in which we sample a total of 20000 rows in which each row represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` -- the Jeopardy episode number of the show this question was in.
- `Air Date` -- the date the episode aired.
- `Round` -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
- `Category` -- the category of the question.
- `Value` -- the number of dollars answering the question correctly is worth.
- `Question` -- the text of the question.
- `Answer` -- the text of the answer.

### Reading and cleaning dataset

In [1]:
# Read dataset
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')
rows, columns = jeopardy.shape
print('Nr of rows: {}\tNr of columns: {}'.format(rows, columns))
print('First rows of Jeopardy:', jeopardy.head(), sep='\n')
print('\nColumns: {}'.format(jeopardy.columns))

Nr of rows: 19999	Nr of columns: 7
First rows of Jeopardy:
   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  

Columns: Index(['Show Number', ' Air Date', ' Round

In [2]:
# Clean columns
new_columns = []
for column in jeopardy.columns:
    column = column.strip()
    new_columns.append(column)
jeopardy.columns = new_columns
print('Columns: {}'.format(jeopardy.columns))

Columns: Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


Before we can start doing analysis on the Jeopardy questions, we need to normalize all of the text columns (the `Question` and `Answer` columns). The idea is to ensure that we lowercase words and remove punctuation so _Don't_ and _don't_ aren't considered to be different words when we compare them.

After that, there are also some other columns to normalize. The `Value` column should also be numeric, to allow us to manipulate it more easily. We'll need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

The `Air Date` column should also be a datetime, not a string, to enable us to work with it more easily.

In [3]:
def normalize_text(string_):
    import re
    string_ = string_.lower()
    string_ = re.sub('[^a-zA-Z0-9_\s]', '', string_)
    string_ = re.sub('\s+', ' ', string_)
    return string_

jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
print('First rows of Jeopardy:', jeopardy.head(), sep='\n')

First rows of Jeopardy:
   Show Number    Air Date      Round                         Category Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  
0  for the last 8 

In [4]:
def normalize_values(string_):
    import re
    string_ = re.sub('[^a-zA-Z0-9_\s]', '', string_)
    try:
        string_ = int(string_)
    except Exception:
        string_ = 0
    return string_

jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_values)
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])
print('First rows of Jeopardy:', jeopardy.head(), sep='\n')

First rows of Jeopardy:
   Show Number   Air Date      Round                         Category Value  \
0         4680 2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680 2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680 2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  clean_value  
0  for the 

In [5]:
print(jeopardy.dtypes)

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object


### Finding matches for answer in questions and recycled questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

- How often the answer is deducible from the question.
- How often new questions are repeats of older questions.

We'll begin with the first question. We can answer it by seeing how many times words in the answer also occur in the question.

In [6]:
def match_answer_in_question(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy['answer_in_question'] = jeopardy.apply(match_answer_in_question, axis=1)
mean = jeopardy['answer_in_question'].mean()
print('First rows of Jeopardy:', jeopardy.head(), sep='\n')
print('\nMean of matches answer inside question: {}'.format(mean))

First rows of Jeopardy:
   Show Number   Air Date      Round                         Category Value  \
0         4680 2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680 2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680 2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  \
0  for the last 8 years

We proceed now with the second question. Let's say we want to investigate how often new questions are repeats of older ones. To do this, is possible to:

- Sort the dataset in order of ascending air date.
- Maintain a set called `terms_used` that will be empty initially.
- Iterate through each row of the dataset.
- Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
- If it does, increment a counter.
- Add each word to `terms_used`.

This will enable us to check if the terms in questions have been used previously or not.  Only looking at words with six or more characters enables us to filter out words like _the_ and _than_, which are commonly used, but don't tell us a lot about a question.

In [13]:
question_overlap = []
terms_used = set()
jeopardy.sort_values(by='Air Date', inplace=True)

for row in jeopardy.itertuples():
    split_question = row[8].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
mean = jeopardy['question_overlap'].mean()
print('First rows of Jeopardy:', jeopardy.head(), sep='\n')
print('\nMean of matches recycled questions: {}'.format(mean))

First rows of Jeopardy:
       Show Number   Air Date             Round         Category  Value  \
19325           10 1984-09-21   Final Jeopardy!  U.S. PRESIDENTS   None   
19286           10 1984-09-21         Jeopardy!      DOUBLE TALK   $300   
19285           10 1984-09-21         Jeopardy!        GEOGRAPHY   $300   
19324           10 1984-09-21  Double Jeopardy!        TV TRIVIA  $1000   
19301           10 1984-09-21  Double Jeopardy!     LABOR UNIONS   $200   

                                                Question              Answer  \
19325  Adventurous 26th president, he was 1st to ride...  Theodore Roosevelt   
19286              Adopted baby of Barney & Betty Rubble           Bamm-Bamm   
19285  8th most populous country in the world, this "...          Bangladesh   
19324  In court, he'd always make mincemeat of Hamilt...         Perry Mason   
19301           Notorious labor leader missing since '75         Jimmy Hoffa   

                                          cl

As per the results, we can see that the mean of matches answer inside question is very low, so we can discard the importance of this feature. However, there is about 70% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. Thus, this makes it relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

### Finding high value questions with chi-squared test

Let's say we only want to study questions that pertain to high value questions instead of low value questions. This will help us earn more money when we're on Jeopardy.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. We'll first need to narrow down the questions into two categories:

- `Low value` -- Any row where Value is less than 800.
- `High value` -- Any row where Value is greater than 800.

We'll then be able to loop through each of the terms from the last input, `terms_used`, and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [14]:
def assing_value(row):
    return 1 if row["clean_value"] > 800 else 0

jeopardy['high_value'] = jeopardy.apply(assing_value, axis=1)
print('First rows of Jeopardy:', jeopardy.head(), sep='\n')

First rows of Jeopardy:
       Show Number   Air Date             Round         Category  Value  \
19325           10 1984-09-21   Final Jeopardy!  U.S. PRESIDENTS   None   
19286           10 1984-09-21         Jeopardy!      DOUBLE TALK   $300   
19285           10 1984-09-21         Jeopardy!        GEOGRAPHY   $300   
19324           10 1984-09-21  Double Jeopardy!        TV TRIVIA  $1000   
19301           10 1984-09-21  Double Jeopardy!     LABOR UNIONS   $200   

                                                Question              Answer  \
19325  Adventurous 26th president, he was 1st to ride...  Theodore Roosevelt   
19286              Adopted baby of Barney & Betty Rubble           Bamm-Bamm   
19285  8th most populous country in the world, this "...          Bangladesh   
19324  In court, he'd always make mincemeat of Hamilt...         Perry Mason   
19301           Notorious labor leader missing since '75         Jimmy Hoffa   

                                          cl

In [32]:
def count_value(word):
    low_count = 0
    high_count = 0
    for row in jeopardy.itertuples():
        split_question = row[8].split(" ")
        if word in split_question:
            if row[13] == 1:
                high_count += 1
            else:
                low_count += 1
    return low_count, high_count

import random
comparison_terms = random.sample(terms_used, 10)
observed_expected = []

for word in comparison_terms:
    low_value, high_value = count_value(word)
    observed_expected.append([low_value, high_value])

print('10 random words used:\n')
print(comparison_terms)
print('\n10 random words frequencies result [low_value questions, high_value questions]:\n')
print(observed_expected)

10 random words used:

['forked', 'cleverly', 'telekinetic', 'statue', 'hrefhttpwwwjarchivecommedia20040628_j_08jpg', 'classics', 'golightly', 'spensers', 'medals', 'manmade']

10 random words frequencies result [low_value questions, high_value questions]:

[[2, 1], [2, 0], [1, 0], [20, 14], [1, 0], [2, 2], [1, 0], [1, 0], [2, 2], [3, 3]]


In [42]:
# Compute the expected counts and the chi-squared value
high_value_count = (jeopardy['high_value'] == 1).sum()
low_value_count = (jeopardy['high_value'] == 0).sum()
print('high_value_questions: {}\tlow_value_questions: {}'.format(high_value_count, low_value_count))

import numpy as np
from scipy.stats import chisquare

chi_squared = []
for item in observed_expected:
    total = np.sum(item)
    total_prop = total / rows
    high_value_expected = total_prop * high_value_count
    low_value_expected = total_prop * low_value_count
    observed = np.array([item[0], item[1]])
    expected = np.array([high_value_expected, low_value_expected])
    chisqr_value, p_value = chisquare(observed, expected)
    chi_squared.append([chisqr_value, p_value])

print('\nChi Squared test results [chi_squared_value, p_value]:')
print(chi_squared)

high_value_questions: 5734	low_value_questions: 14265

Chi Squared test results [chi_squared_value, p_value]:
[[2.1177104383031944, 0.14560406868264344], [4.97558423439135, 0.025707519787911092], [2.487792117195675, 0.11473257634454047], [15.114751903504095, 0.00010116961911142579], [2.487792117195675, 0.11473257634454047], [0.889754963322559, 0.3455437191483468], [2.487792117195675, 0.11473257634454047], [2.487792117195675, 0.11473257634454047], [0.889754963322559, 0.3455437191483468], [1.3346324449838385, 0.24798277007881886]]


From the terms analyzed, we can see that almost none of them had a significant difference in usage between high value and low value rows. This can be explained because the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

However, in the case of the term with more usage, `statue` (34 times), we can see a significant difference in usage in favor of low value questions (p_value is near 0). We can conclude that this term is used in low value questions with statistical significance.

We should repeat this experiment with all the terms to analyze the ones with a significant use in high value questions.