# Introduction
Let us say we are going to compete on the TV show Jeopardy, and we wish to have an edge. We will be looking at a dataset of previous Jeopardy questions to figure out patterns in the questions that can help us win. 

We will be looking at a dataset from "jeopardy.csv", downloaded from: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

# Exploring and Normalizing the Dataset
We will first examine the jeopardy dataset, and we will also clean string objects.

In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')

# Print out the first five rows of jeopardy
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [2]:
# Examine the columns of jeopardy
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [3]:
# Remove the spaces in the column names
jeopardy.columns = jeopardy.columns.str.strip()
print(jeopardy.columns)

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')


We wish to normalize the text columns, particularly the Question and Answer columns. We also wish to normalize the Value column (convert to numeric, remove the dollar sign) and the Air Date column (convert to datetime).

In [13]:
# Take in a string, convert string to lowercase, remove all punctuation, return string
def normalizeText(jeopardy_string):
    jeopardy_string = jeopardy_string.lower()
    
    punctuations = [".", ",", ";", ":", "\'", "-", "_", "\"", "?", "!", "(", ")"]
    
    for punct in punctuations:
        jeopardy_string = jeopardy_string.replace(punct, "")
        
    return jeopardy_string

# Take in the dollar value, remove the dollar sign, return number

def normalizeValue(jeopardy_string):
    jeopardy_string = jeopardy_string.replace("$", "")
    jeopardy_string = jeopardy_string.replace(",", "")
    
    # Convert the string to an integer. If there is a conversion error, return 0
    try:
        return int(jeopardy_string)
    except:
        return 0

In [17]:
# Normalize the Question column - Result will be in the new clean_question column
jeopardy['clean_question'] = jeopardy['Question'].apply(normalizeText)

# Normalize the Answer column - Result will be in the new clean_answer column
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizeText)

# Normalize the Value columns - Result will be in the new clean_value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalizeValue)

# Convert the Air Date column to a datetime column
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'], format = '%Y-%m-%d')

jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200


# Extracting Answers and Repeats from Questions
If we want to prepare for Jeopardy, we want to know if it is worth studying past questions or general knowledge, or if we are better off not studying at all. We want to figure out:
- How often the answer is deucible from the question
- How often new questions are repeats of older questions

For the first question, we can examine how many times words in the answer also occur in the question. For the second question, we can see how often complex words (> 6 characters) reoccur.

In [20]:
# See how often words in the answer also appear in the question
def answer_in_question(jeopardy_row):
    # Split the clean_answer and clean_question columns objects into lists of strings
    split_answer = jeopardy_row['clean_answer'].split()
    split_question = jeopardy_row['clean_question'].split()
    
    # Count number of matches between answer and question strings
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")        # "the" is common, but there is no meaningful use
    
    if len(split_answer) == 0:
        return 0        # Prevents a division by zero error
    
    else:
        # Loop through each item in split_answer and see if it occurs in split_question
        for item in split_answer:
            if item in split_question:
                match_count += 1
                
        return match_count / float(len(split_answer))

In [23]:
# Count how many times terms in clean_answer occur in clean_question by applying our function on each row in jeopardy
# These numbers will be written into the new answer_in_question column

jeopardy['answer_in_question'] = jeopardy.apply(answer_in_question, axis = 1)

# Get average of the number of times terms in the answer occur in the question
mean_answer_in_question = jeopardy['answer_in_question'].mean()
print(mean_answer_in_question)

0.05860143628782072


Only 5.8% of the 20,000 answers in the jeopardy dataset have terms that are also found in the corresponding questions. If we want to study for Jeopardy, it would not be fruitful to prepare by seeking the answers in the questions, considering how this occurs rarely.

To answer the second question, we stated that we can see how often complex words (>6 characters) reoccur. To do this, we will be counting reoccurring terms through a jeopardy DataFrame that is sorted by ascending air date. We will be looking at words with more than 6 characters as this allows us to filter out words like "the" and "then", which are commonly used, but do not tell us a lot about a question.

In [30]:
# We will be keeping track of what complex terms in questions get reused
# and how often complex terms in questions get reused, indicative of recycled questions
question_overlap = []
terms_used = set()

# The jeopardy dataset is already ordered by Air Date in ascending order
for index, row in jeopardy.iterrows():
    # Convert the clean_question into a list of strings
    split_question = row['clean_question'].split()
    
    # Remove any words in split_question that are less than 6 characters long
    split_question_duplicate = split_question
    
    for word in split_question_duplicate:
        if len(word) < 6:
            split_question.remove(word)
    
    # Loop through each word in split_question, keeping track of matches
    # Add complex terms to terms_used. Unique terms will be added, repeated ones won't add to set
    # Return a probability for the given question
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
            
        terms_used.add(word)
        
    if len(split_question) > 0:
        match_count /= float(len(split_question))
        question_overlap.append(match_count)
            
    else:
        question_overlap.append(0)
            
jeopardy['question_overlap'] = question_overlap

print(jeopardy['question_overlap'].mean())

0.8023190764775775


It appears that with time, 80% of the words in a given question were used in a previous question. It means that present and future questions are likely going to have words that have appeared in previous questions. That said, there is no guarantee that the subject of the question will be the same as past questions, as we only looked at matches in words, not phrases. That said, this may still indicate that certain _ideas_ may still be repeated in Jeopardy questions.

## Repetition in High Value Questions
Let's say we only want to study questions that pertain to high-value questions instead of low-value questions, as this may help us earn more money when we're on Jeopardy. We can figure out which terms correspond to high-value questions using a chi-squared test. We have defined the dollar amount that divides low-value questions from high-value questions to be $800.

In [32]:
# Classify a question as high value (above $800) or low value
def highValueQuestion(jeopardy_row):
    if jeopardy_row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    
    return value

jeopardy['high_value'] = jeopardy.apply(highValueQuestion, axis = 1)

In [33]:
# This function takes in a word. We look through the jeopardy dataset to see how many times
# the word appears in a high-value question and how many times it appears in a low-value question
def count_high_low(word):
    low_count = 0
    high_count = 0
    
    # Loop through each row in jeopardy to count how often the words appears in high/low-value question
    for index, row in jeopardy.iterrows():
        split_question = row['clean_question'].split()
        
        if word in split_question:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count, low_count

# Create a list of observed high/low counts for each word
observed_expected = []

# Convert the sets to a list. We will look at the first five words (looking at all words would take too much time)
terms_used = list(terms_used)
comparison_terms = terms_used[:5]

for term in comparison_terms:
    high, low = count_high_low(term)
    observed_expected.append([high, low])
    
observed_expected

[[4, 3], [0, 1], [41, 111], [0, 1], [0, 1]]

We have found the observed counts for the first five terms. We now want to compute the expected counts and the chi-squared value.

In [35]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]

chi_squared = []

for term in observed_expected:
    total = sum(term)
    
    total_prop = total / float(jeopardy.shape[0])
    
    expected_high = total_prop * high_value_count
    expected_low = total_prop * low_value_count
    
    chi, p_value = chisquare(term, [expected_high, expected_low])
    
    chi_squared.append([chi, p_value])
    
for term in chi_squared:
    print(term)

[2.774619927181822, 0.09576938744167536]
[0.401962846126884, 0.5260772985705469]
[0.21422879036359924, 0.6434729205350347]
[0.401962846126884, 0.5260772985705469]
[0.401962846126884, 0.5260772985705469]


Here are some potential next steps:

- Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
    - Manually create a list of words to remove, like the, than, etc.
    - Find a list of stopwords to remove.
    - Remove words that occur in more than a certain percentage (like 5%) of questions.
- Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
    - Use the apply method to make the code that calculates frequencies more efficient.
    - Only select terms that have high frequencies across the dataset, and ignore the others.
- Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
    - See which categories appear the most often.
    - Find the probability of each category appearing in each round.
- Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
- Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.
We recommend creating a Github repository and placing this project there. It will help other people, including employers, see your work. As you start to put multiple projects on Github, you'll have the beginnings of a strong portfolio.

You're welcome to keep working on the project here, but we recommend downloading it to your computer using the download icon above and working on it there.