In [1]:
import pandas as pd

jeopardy = pd.read_json('data/jeopardy_questions1.json')
jeopardy.head(10)

Unnamed: 0,air_date,answer,category,question,round,show_number,value
0,2004-12-31,Copernicus,HISTORY,"'For the last 8 years of his life, Galileo was...",Jeopardy!,4680,$200
1,2004-12-31,Jim Thorpe,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 2: 1912 Olympian; football star at Carlis...,Jeopardy!,4680,$200
2,2004-12-31,Arizona,EVERYBODY TALKS ABOUT IT...,'The city of Yuma in this state has a record a...,Jeopardy!,4680,$200
3,2004-12-31,McDonald\'s,THE COMPANY LINE,"'In 1963, live on ""The Art Linkletter Show"", t...",Jeopardy!,4680,$200
4,2004-12-31,John Adams,EPITAPHS & TRIBUTES,"'Signer of the Dec. of Indep., framer of the C...",Jeopardy!,4680,$200
5,2004-12-31,the ant,3-LETTER WORDS,"'In the title of an Aesop fable, this insect s...",Jeopardy!,4680,$200
6,2004-12-31,the Appian Way,HISTORY,'Built in 312 B.C. to link Rome & the South of...,Jeopardy!,4680,$400
7,2004-12-31,Michael Jordan,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 8: 30 steals for the Birmingham Barons; 2...,Jeopardy!,4680,$400
8,2004-12-31,Washington,EVERYBODY TALKS ABOUT IT...,"'In the winter of 1971-72, a record 1,122 inch...",Jeopardy!,4680,$400
9,2004-12-31,Crate & Barrel,THE COMPANY LINE,'This housewares store was named for the packa...,Jeopardy!,4680,$400


In [None]:
jeopardy.columns

## Cleaning the Data
After reviewing the data, there are some clean up tasks to perform to make data analysis easier.

First step is to normalise the `question` and `answer` columns. We'll do this by creating a function to lowercase the strings and remove all punctuation from them.

Next, the `value` column also needs to be cleaned. This column should be numeric. Because the source data comes with a dollar sign, it does not come in a numeric format. The dollar sign must be removed and then the column has to be converted to a numeric.

`air_date` is a string but should be a date. Should be a fairly straightforward conversion.

In [2]:
import re

def normalise_string(s):
    pattern = '[\.\'\:\"\,\\\/\!\?]'
    s = re.sub(pattern, '', s)
    return s.lower()

def remove_dollar(cash):
    cash.remove('$', '')

jeopardy["clean_question"] = jeopardy["question"].apply(normalise_string)
jeopardy["clean_answer"] = jeopardy["answer"].apply(normalise_string)
jeopardy["clean_value"] = jeopardy["value"].replace([None],'0')
jeopardy["clean_value"] = jeopardy["clean_value"].str.replace('[$,]', '').astype('int32')
jeopardy["air_date"] = pd.to_datetime(jeopardy["air_date"])

In [3]:
jeopardy.head(10)

Unnamed: 0,air_date,answer,category,question,round,show_number,value,clean_question,clean_answer,clean_value
0,2004-12-31,Copernicus,HISTORY,"'For the last 8 years of his life, Galileo was...",Jeopardy!,4680,$200,for the last 8 years of his life galileo was u...,copernicus,200
1,2004-12-31,Jim Thorpe,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 2: 1912 Olympian; football star at Carlis...,Jeopardy!,4680,$200,no 2 1912 olympian; football star at carlisle ...,jim thorpe,200
2,2004-12-31,Arizona,EVERYBODY TALKS ABOUT IT...,'The city of Yuma in this state has a record a...,Jeopardy!,4680,$200,the city of yuma in this state has a record av...,arizona,200
3,2004-12-31,McDonald\'s,THE COMPANY LINE,"'In 1963, live on ""The Art Linkletter Show"", t...",Jeopardy!,4680,$200,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,2004-12-31,John Adams,EPITAPHS & TRIBUTES,"'Signer of the Dec. of Indep., framer of the C...",Jeopardy!,4680,$200,signer of the dec of indep framer of the const...,john adams,200
5,2004-12-31,the ant,3-LETTER WORDS,"'In the title of an Aesop fable, this insect s...",Jeopardy!,4680,$200,in the title of an aesop fable this insect sha...,the ant,200
6,2004-12-31,the Appian Way,HISTORY,'Built in 312 B.C. to link Rome & the South of...,Jeopardy!,4680,$400,built in 312 bc to link rome & the south of it...,the appian way,400
7,2004-12-31,Michael Jordan,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 8: 30 steals for the Birmingham Barons; 2...,Jeopardy!,4680,$400,no 8 30 steals for the birmingham barons; 2306...,michael jordan,400
8,2004-12-31,Washington,EVERYBODY TALKS ABOUT IT...,"'In the winter of 1971-72, a record 1,122 inch...",Jeopardy!,4680,$400,in the winter of 1971-72 a record 1122 inches ...,washington,400
9,2004-12-31,Crate & Barrel,THE COMPANY LINE,'This housewares store was named for the packa...,Jeopardy!,4680,$400,this housewares store was named for the packag...,crate & barrel,400


## Finding our Strategy
Now that our data has been cleaned, the next step is to find some patterns to determine whether studying past questions, or general knowledge, or not all would be the best strategy. In order to do this, we have two questions to answer:
- How often the answer is deducible from the question
- How often new questiosn are repeats of older questions
For the first question, we can analyze how many times words in the answer also occur in the question.
For the second one, we can look by how often complex words (defined by the project as >6 characters) occur.

We will first investigate by looking at how often the answer is deducible from the question:

In [16]:
def count_matches(row):
    match_count = 0
    split_question = row["clean_question"].split(" ")
    split_answer = row["clean_answer"].split(" ")
    
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
        
    for answer in split_answer:
        if answer in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [21]:
print("Mean: ",jeopardy["answer_in_question"].mean())
print("Mode: ",jeopardy["answer_in_question"].mode()[0])

Mean:  0.05294048584813152
Mode:  0.0


The mean for the answer in question represents the mean percentage where the answer appears in the question. This hovers at about 5%, and the mode is a clear zero. Using this method of deducing the answer in the question does not seem like a very effective way of studying.

Moving on... let's find out how often questions repeat. Note that the data we have only represents a sample. The data has about 215K Jeopardy questions. At the time of compilation, the post noted about 250K total questions. The dataset was prepared in 2014. This project was created in 2019. We won't have a complete count, but we have enough to see if there are any patterns in the questions.

In [32]:
# walk through the steps...
question_overlap = []
terms_used = set()
# row by row... is there a better way?
for i, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [q for q in split_question if len(q) > 5]
    
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    [terms_used.add(word) for word in split_question]
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

In [34]:
jeopardy["question_overlap"].mean()

0.85513053959041

## Question Overlap
Our mean indicates about an 85% overlap between terms in questions. This looks good at the surface, but it only looks at individual words and not phrases. It may not look as significant as it seems. However, this is a better method compared to the previous one.

## Priortising Study Time
Instead, why don't we focus on high value questions? This may be a good strategy since as we'll do better on questions that are worth more.

Using a chi-squared test it's possible to narrow down the questions into two categories:
- Low value: Value less than 800
- High value: Value greater than 800

In [36]:
def determine_high_value(row):
    if row["clean_value"] > 800:
        return 1
    else:
        return 0

jeopardy["high_value"] = jeopardy.apply(determine_high_value, axis=1)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
Name: high_value, dtype: int64

In [41]:
def count_word(word):
    low_count = 0
    high_count = 0
    
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

terms_used = list(terms_used)
comparison_terms = terms_used[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_word(term))

In [42]:
observed_expected

[(1, 5), (0, 4), (0, 1), (1, 0), (1, 0)]

In [72]:
import numpy as np
from scipy import stats
high_value_count = jeopardy[jeopardy["high_value"]==1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"]==0].shape[0]
chi_squared = []

for tu in observed_expected:
    total = sum(tu)
    total_prop = total / jeopardy.shape[0]
    print(total_prop)
    expected_high_count = total_prop * high_value_count
    expected_low_count = total_prop * low_value_count
    
    expected = np.array([expected_high_count, expected_low_count])
    observed = np.array([tu[0], tu[1]])
    chi_squared.append(stats.chisquare(observed, expected))

chi_squared


2.7658691743880515e-05
1.8439127829253676e-05
4.609781957313419e-06
4.609781957313419e-06
4.609781957313419e-06


[Power_divergenceResult(statistic=0.4010346717612653, pvalue=0.5265553925560025),
 Power_divergenceResult(statistic=1.5799058569334052, pvalue=0.2087742545638461),
 Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751)]

## Results
None of the p-values listed show a significant difference in the usage of words for high and low value questions. Although we used the most frequent terms, the actual count of them in the end is still low with nothing greater than five. Therefore, the our idea of focusing on finding terms used frequently in higher value questions does not result in any statistically significant advantages.