### Read in the data set

In [24]:
import pandas as pd
jeopardy=pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [25]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [26]:
### some column names have a space in front, removing it
cols = [i.strip() for i in jeopardy.columns]
jeopardy.columns = cols
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

### Normalizing Question and Answer columns

In [27]:
import re
def normalize(s):
    s = s.lower()
    s = re.sub("[^A-Za-z0-9\s]","",s)
    s = re.sub(" +"," ",s)
    return s

In [28]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize)
jeopardy["clean_question"].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [29]:
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize)
jeopardy["clean_answer"].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

### Normalizing Value and Ait Date columns

In [30]:
def norm_value(s):
    s = re.sub("[^0-9\s]","",s)
    try:
        s = int(s)
    except Exception:
        s = 0
    return s

jeopardy["clean_value"] = jeopardy["Value"].apply(norm_value)
jeopardy["clean_value"].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [31]:
jeopardy["Air Date"] = jeopardy["Air Date"].apply(pd.to_datetime)
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

### What strategy?

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now.

In [32]:
def deducible(r):
    split_answer = r["clean_answer"].split(' ')
    split_question = r["clean_question"].split(' ')
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count/len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(deducible, axis=1)
jeopardy["answer_in_question"].mean()

0.05898946462474648

The answer only appears in the question about 6% of the time. This isn't a huge number, and means that we probably can't just hope that hearing a question will enable us to figure out the answer. We'll probably have to study.

Let's work on the second question now:
* How often new questions are repeats of older questions.

In [33]:
jeopardy = jeopardy.sort_values(["Air Date"])

question_overlap = []
terms_used = set()
for index, row in jeopardy.iterrows():
    split_question = row["clean_question"].split(" ")
    split_question = [s for s in split_question if len(s)>5]
    match_count = 0
    for w in split_question:
        if w in terms_used:
            match_count += 1
    for w in split_question:
        terms_used.add(w)
    if len(split_question) > 0:
        match_count = match_count/len(split_question)
    question_overlap.append(match_count)

jeopardy["question_overlap"] = question_overlap
jeopardy["question_overlap"].mean()

0.6876260592169776

### Question overlap

There is about 68% overlap between terms in new questions and terms in old questions. This only looks at a small set of questions, and it doesn't look at phrases, it looks at single terms. This makes it relatively insignificant, but it means that it's worth looking more into the recycling of questions.

### What to study?
Let's say now we want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

We can find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [34]:
def weight_values(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

### We'll first need to narrow down the questions into two categories:
### * Low value -- Any row where Value is less than 800.
### * High value -- Any row where Value is greater than 800.

jeopardy["high_value"] = jeopardy.apply(weight_values, axis=1)

In [35]:
def high_low_count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(' ')
        if word in split_question:
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

### Find the number of high/low value questions the word occurs in
observed_expected = []
comparison_terms = list(terms_used)[0:5] ### let's use a small sample
for t in comparison_terms:
    observed_expected.append(high_low_count(t))
observed_expected

[(0, 1), (1, 0), (1, 3), (1, 0), (0, 1)]

Now that we've found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

In [36]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]
chi_squared = []

for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    exp_high = total_prop*high_value_count
    exp_low = total_prop*low_value_count
    observed = np.array([obs[0], obs[1]])
    expected = np.array([exp_high,exp_low])
    chi_squared.append(chisquare(observed,expected))

chi_squared

[Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.026364433084407689, pvalue=0.87101348468892104),
 Power_divergenceResult(statistic=2.4877921171956752, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.40196284612688399, pvalue=0.52607729857054686)]

### Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

### Further investigations

Here are some potential next steps:

* Find a better way to eliminate non-informative words than just removing words that are less than 6 characters long. Some ideas:
  * Manually create a list of words to remove, like the, than, etc.
  * Find a list of stopwords to remove.
  * Remove words that occur in more than a certain percentage (like 5%) of questions.


* Perform the chi-squared test across more terms to see what terms have larger differences. This is hard to do currently because the code is slow, but here are some ideas:
  * Use the apply method to make the code that calculates frequencies more efficient.
  * Only select terms that have high frequencies across the dataset, and ignore the others.


* Look more into the Category column and see if any interesting analysis can be done with it. Some ideas:
  * See which categories appear the most often.
  * Find the probability of each category appearing in each round.


* Use the whole Jeopardy dataset (available here) instead of the subset we used in this mission.
* Use phrases instead of single words when seeing if there's overlap between questions. Single words don't capture the whole context of the question well.
