# How to Win Jeopardy

## Introduction

We would like to compete on Jeopardy, the popular TV game show where contestants answer trivia questions to win money. To develop a strategy to help us win, we analyze a dataset of 20,000 past Jeopardy questions, looking for any patterns in the questions that may appear.

The dataset we are working with is a subset of a full dataset of Jeopardy questions, which can be downloaded from [this Reddit post](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). Each row of the dataset represents an individual question from an individual episode of the the show and contains the following columns:
- `Show Number` -- the episode number of the show in which the question appeared;
- `Air Date` -- the date on which the episode aired;
- `Round` -- the round of Jeopardy in which the question was posed;
- `Category` -- the category of question, e.g;
- `Value` -- the monetary value, in US dollars, of the correct answer to the question;
- `Question` -- the text of the question itself;
- `Answer` -- the text of the answer to the question.

**COME BACK AND FILL IN GOAL / RESULTS**

## Exploring and Cleaning the Dataset

Before we begin the analysis, we read in, explore, and clean the dataset.

In [1]:
# read in and explore dataset
import pandas as pd
jeopardy = pd.read_csv('jeopardy.csv')
# view first five rows
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# view shape
jeopardy.shape

(19999, 7)

In [3]:
# view columns
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We notice first that some of the column names contain erroneous leading spaces. Let's remove the spaces for consistency.

In [4]:
# clean up column names
new_columns = []
for col in jeopardy.columns:
    new_columns.append(col.strip())
jeopardy.columns = new_columns
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

We next want to normalize the text columns. This entails transforming the text to lowercase and removing all punctuation. Below, we write a function to perform the normalization and then apply the function to the `Question` and `Answer` columns.

In [5]:
# import regular expression library
import re

# define normalization function
def normalize_text(text):
    """
    Normalize text by converting to lowercase and removing 
    punctuation.
    
    Parameters
    ----------
    text : str
        Text to be normalized.
        
    Returns
    -------
    str
        Normalized text.
    """

    text = text.lower()
    text = re.sub('\W', ' ', text)
    
    return text

In [6]:
# normalize text columns
jeopardy['clean_question'] = jeopardy['Question'].apply(
    normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(
    normalize_text)
# view updates
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams


We will also need to normalize the non-text columns. Most of the columns are initially string types, but some of these contain numeric and date-time information that would be easier to manipulate and analyze if converted to the appropriate type. Therefore, we normalize the `Value` and `Air Date` columns, reformatting the `Value` entries as integers and the `Air Date` entries as datetime objects.

In [7]:
# define numeric normalization function
def normalize_num(text):
    """
    Normalize numeric text by removing punctuation and converting 
    to integer type.
    
    Parameters
    ----------
    text : str
        Numeric text to be normalized.
        
    Returns
    -------
    int
        Normalized numeric text.
    """

    text = re.sub('\W', ' ', text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [8]:
# normalize `Value` column
jeopardy['clean_value'] = jeopardy['Value'].apply(
    normalize_num)

In [9]:
# normalize `Air Date` column
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [10]:
# check column types
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

## Investigating Study Strategies

Now that we've cleaned our dataset, we are ready to begin analyzing it. We want to use the data to investigate the best study strategies that are likely to maximize our winnings potential. The analysis should hopefully help us decide whether it is more worthwhile to study past questions or general knowledge, or if we should not even bother studying at all.

### Looking for Answers in Questions

One metric of interest is the relative frequency with which the answer appears in the question. This metric will help us to determine our chances of correctly guessing from the question alone an answer we do not know. If this metric is large, we may not need to study at all; instead, we can simply guess the answer each time based on the wording of the question. 

We calculate this metric by defining a function that measures the fraction of words in an individual answer that appear in the corresponding question. We then apply this function to each question-answer pair in the dataset and calculate the mean of the resulting distribution of answer-in-question fractions. If this mean is close to one, then all or most of the answer appears in the question very often, and we can safely guess the answers based on the wordings of the questions without studying. If the mean is close to zero, however, then the answer can rarely be found in the question, and we will have to develop a study strategy to prepare for the show.

In [11]:
def frac_ans_in_q(row):
    """
    Calculate the fraction of words in the answer that appear in 
    the question.
    
    Parameters
    ----------
    row : pandas.Series
        Row of dataframe containing questions and answers.
        
    Returns
    -------
    float
        Fraction of answer appearing in question.
    """
    
    # split answer and question into lists
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    # remove 'the' instances from answer list
    split_answer = [answer for answer in split_answer if 
                    answer != 'the']
    
    # avoid dividing by zero
    if len(split_answer) == 0:
        return 0
    
    # loop through answer list
    match_count = 0
    for answer in split_answer:
        if answer in split_question:
            match_count += 1
            
    # find fraction of answer contained in question
    return match_count / len(split_answer)

In [12]:
# calculate answer-in-question fraction for each row
jeopardy['answer_in_question'] = jeopardy.apply(
    frac_ans_in_q, axis=1)

In [13]:
mean_ans_in_q = jeopardy['answer_in_question'].mean()
mean_ans_in_q

0.06229526885934705

On average, only 6% of the answer appears in each question. In other words, the answer only appears in the question about 6% of the time, which is not a large enough percentage for us to hope to win by simply guessing all of the answers from the context of the questions alone. It looks like we'll have to study for the show after all...

### Considering Recycled Questions

Another metric we may be interested in that could help us determine whether to studyd past questions is the frequency of recycled questions, or questions that are reused in multiple episodes throughout the show's run. 

Unfortunately, we cannot answer this question completely, since we are only working about 10% of the full Jeopardy question dataset, but we can still investigate this metric in the subset of data we do have. We will not have a concrete number for the frequency of recycled questions, but we should be able to get a general idea of whether questions are recycled often or not.

Below, we sort the questions chronologically by air date and compare the meaningful terms in an individual question to the set of all meaningful terms from previously aired questions. The mean of the distribution of the fraction of terms in a question that appear in old questions gives us a rough measure of how much overlap there is between the words used in new questions and old, with a value close to one indicating a large overlap, and a value close to zero indicating a small or negligible overlap. 

Because this overlap is calculated based on individual words rather than phrases, however, it is not a clear indicator of whether questions have been fully recycled from individual past questions, or if they simply reuse words from many different past questions. Still, this overlap metric can indicate whether recycling questions is likely a rare occurrence, if the overlap is very small, or if it is a real possibility that we should investigate further, if the overlap is large.

In [14]:
# initialize list of question overlap + set of terms
question_overlap = []
terms_used = set()

# sort dataset by ascending air date
jeopardy = jeopardy.sort_values(by=['Air Date'])

In [15]:
# iterate through dataset rows
for i, row in jeopardy.iterrows():
    
    # split question into list of terms
    split_question = row['clean_question'].split()
    
    # remove (meaningless) terms with less than 6 characters
    split_question = [q for q in split_question if 
                      len(q) >= 6]
    
    # calculate fraction of terms in question used previously
    match_count = 0
    for q in split_question:
        if q in terms_used:
            match_count += 1
        terms_used.add(q)  # add term to set
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    # add fraction to overlap list
    question_overlap.append(match_count)

In [16]:
# assign question overlap to column in dataset
jeopardy['question_overlap'] = question_overlap

In [17]:
# calculate mean of question overlap
mean_q_overlap = jeopardy['question_overlap'].mean()
mean_q_overlap

0.721603243720504

On average, about 70% of the terms used in new questions are recycled from previous questions. In other words, new questions overlap completely with terms used in previous questions about 70% of the time. Again, this metric does not yield enough information for us to determine that 70% of the questions are wholly recycled, but it is indicative of the recycling of many terms that appear in questions, which points to the possibility of question recycling. We should thus investigate this further.

This result also tells us that we should probably dedicate some time to studying past questions, since it appears many of the same topics, if not the same questions, come up more than once. It is not unexpected, then, that we may encounter a question in our episode that is at least reminiscent of one of the past questions we will have studied.

### Targeting High-Value Questions

Rather than studying all previous questions, we want to instead focus only on the high-value questions, i.e. those worth the most money. This will help us to increase the amount of our winnings when we are on the show.

We can isolate specific terms that correspond to high-value questions, which will give us an idea of the topics to study to maximize our earning potential. We first divide the questions into two categories based on their monetary value: (1) low value, where the value is less than \$800, and (2) high value, where the value is at least \$800. We then count the numbers of times each term in our term set is used in a high-value question, a low-value question, and any question. From there, we can calculate the chi-squared value for our observed and expected high/low-value counts for each term, which we can use to identify the terms with statistically significant differences in usage between high- and low-value questions. 

If the chi-squared value of the term is small, or the p-value is large, the difference in usage is not statistically significant, and there is no observed relationship between the word and a particular value. On the other hand, if the chi-squared value is large, or the p-value is small, there is some statistically significant correlation between the word usage and its value, with the word more likely to be used in either high- or low-value questions.

The procedure outlined above entails looping over every word in the term set, which is time-consuming, so we begin by testing a small sample of words only.

In [18]:
# define function to categorize question by value
def value_type(row):
    """
    Categorize question as high- or low-value.
    
    Parameters
    ----------
    row : pandas.Series
        Individual row of dataset.
        
    Returns
    -------
    int
        Integer value of question: 0 for low; 1 for high.
    """
    
    value = 0
    if row['clean_value'] >= 800:
        value = 1
    return value

In [19]:
# apply value categorization to each row of dataframe
jeopardy['high_value'] = jeopardy.apply(value_type, axis=1)

In [31]:
# define function to count high- and low-value occurrences of word
def value_count(word):
    """
    Return counts of occurrences of word in high- and low-value 
    questions.
    
    Parameters
    ----------
    word : str
        Word of which to count occurrences.
        
    Returns
    -------
    high_count, low_count : int
        Numbers of occurrences of word in high- and low-value
        questions.
    """
    
    # initialize counts
    low_count = 0
    high_count = 0
    
    # iterate through dataset rows
    for i, row in jeopardy.iterrows():
        if word in row['clean_question'].split():
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count

In [49]:
# randomly sample 10 elements from term set
import random
random.seed(1)
comparison_terms = random.sample(terms_used, 10)
comparison_terms

['collided',
 'shouldered',
 'indians',
 'arabella',
 'calling',
 'benetton',
 'invigorate',
 'fitzroy',
 'allworthy',
 'jeweler']

In [50]:
# get counts for each term
observed_expected = []
for term in comparison_terms:
    observed_expected.append(value_count(term))

In [51]:
observed_expected

[(0, 1),
 (1, 0),
 (6, 6),
 (0, 1),
 (5, 8),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 0),
 (3, 0)]

In [52]:
# get total high/low-value question counts
high_value_count = jeopardy['high_value'].sum()
low_value_count = jeopardy.shape[0] - high_value_count

In [53]:
# calculate chi-squared of value counts for each term
from scipy.stats import chisquare
import numpy as np

chi_squared = []
p_value = []

for obs in observed_expected:
    
    # get total word count
    total = obs[0] + obs[1]
    
    # get total word proportion across all questions
    total_prop = total / jeopardy.shape[0]
    
    # get expected counts
    exp_high_count = total_prop * high_value_count
    exp_low_count = total_prop * low_value_count
    
    # compute chi-squared value and p-value
    observed = np.array([obs[0], obs[1]])
    expected = np.array([exp_high_count, exp_low_count])
    chisq, pvalue = chisquare(observed, expected)
    
    # append values to lists
    chi_squared.append(chisq)
    p_value.append(pvalue)

In [54]:
chi_squared

[0.6600813480534574,
 1.5149647887323945,
 0.5251384103575547,
 0.6600813480534574,
 0.00917892259470195,
 0.6600813480534574,
 0.6600813480534574,
 0.6600813480534574,
 1.5149647887323945,
 4.544894366197182]

In [55]:
p_value

[0.41653122582698476,
 0.2183830639074722,
 0.4686579749653339,
 0.41653122582698476,
 0.9236741008612599,
 0.41653122582698476,
 0.41653122582698476,
 0.41653122582698476,
 0.2183830639074722,
 0.03301705930176248]

For the above ten randomly sampled terms, only one has a large enough chi-squared value, or a low enough p-value, to be considered statistically significant. The word "jeweler" has a chi-squared value of approximately 4.54 and a p-value of about 0.03. This is the only term in our small random sample with a p-value below threshold (0.05) for statistical significance. This indicates a significant distinction between usage in high- and low-value questions for this one particular term, although we can not tell from the significance alone whether the usage is larger for high- or low-value questions. 

Only 10% of our random sample indicates a statistically significant difference in usage for words between high- and low-value questions. However, we only sampled ten rows from our dataset, which is only 1/2000 of our data. We can't draw a reasonable conclusion from such a small sample. We should redo this chi-squared test across a larger sample of terms, but this would involve modifying our procedure to increase its speed, since our current method is too slow. We leave this for the future, along with some other ideas for better developing useful study strategies.

## Conclusion

In this analysis, we determined it would suit us best to study for our upcoming appearance on the show "Jeopardy". At this point, it is clear we should study past questions, but since there are so many, we should find a way to narrow down the set of past questions to study. 

We are interested in maximizing our winnings, so we would like to focus on high-value questions about topics that occur frequently. We found analyzing by individual words yielded results that were too difficult to draw any meaninful conclusions from. We should instead try to search for common phrases when analyzing both the question overlap and the chi-squared of the high/low-value counts. 

We would also like to analyze the category data, looking at the most frequently used categories or the probability that a given category will occur in an episode or a specific round. This may help us to determine which topics to spend the most time studying.

We leave these steps for the future.