# TODO:

Question Difficulty Monetary value acts as a proxy for the difficulty of the questions asked in Jeopardy!. Using topic modeling, explore what aspects of the questions make them more difficult. And, tying this to the above, how have these trends changed over time?


Details for Question 2:

- First take the ’Jeopardy’ and the ’Double Jeopardy’ portions questions across all time and break them into separate monetary value collections according to their ’value’. For each ’value’, use LDA and/or NMF to identify topics of the questions asked. Compare and contrast the topics identified for each ’value’ and analyze your findings. Are these topics inherently different? Are there overlap among the difficulties. What kinds of questions seem to be the most difficult?

- Now take the questions from above and further split questions across year. So you will have collections of questions according to all ’value”s for each year. Use LDA and/or NMF to identify topics in each ’value’ - year combination. Investigate the dynamics of the questions and analyze what topics were considered easy and what were considered difficult over time. Are there any interesting trends here? Think about what would be interesting to report to the world about Jeopardy!

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import re

__Helper Functions__

In [2]:
def clean_column_names(cols: list) -> list:
    """
    Given a list (or any iterable) of columns names, 
    returns a list of the cleaned column names in lower case, 
    with extra spaces stripped off,
    and words separated by underscores.
    """
    clean_cols = []
    for col in cols:
        clean_cols.append(col.lower().strip().replace(" ", "_"))
    return clean_cols


def clean_value(amounts: np.ndarray) -> np.ndarray:
    """
    Given an array of string values under the `value`
    column, returns an array of floating point
    monetary values, replacing `None` as nan.
    """
    clean_values = np.zeros(len(amounts))
    for idx, amount in enumerate(amounts):
        if amount == "None":
            clean_values[idx] = np.nan
        else:
            clean_values[idx] = float(amount.replace("$", "").replace(",", ""))
    return clean_values

def get_encoding_dict(df: pd.core.frame.DataFrame, col: str) -> dict:
    
    """
    Given a dataframe and the name of column
    to be encoded, returns the dictionary with
    original values as keys and encodings as values .
    """
    uniq_values = df[col].unique()
    enc_dict = {}
    for idx, val in enumerate(uniq_values):
        enc_dict[val] = idx
    return enc_dict

def extract_hyperlink_media(df: pd.core.frame.DataFrame, col:str):
    
    """
    Given a dataframe and the name of column
    containing hyperlinks, returns the hyperlink 
    and the associated media type.
    """
    s = df[col]
    hyperlink = np.empty(len(s), dtype=object)
    media_type = np.empty(len(s), dtype=object)
    for i, q in enumerate(s):
        match = re.search('(http://(.+).)">(.+)$', q)
        if match is None:
            hyperlink[i] = 'nan'
            media_type[i] = 'nan'
        else:
            hyp_link = match.group(1).split('"')[0]
            hyperlink[i] = hyp_link
            media_type[i] = hyp_link.rpartition(".")[-1]
    return hyperlink, media_type

__Load Data__

In [3]:
raw_data = pd.read_csv("../data/jeopardy.csv")
raw_data.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [4]:
raw_data.shape

(216930, 7)

__Transform Data__

In [5]:
# clean column names
raw_data.columns = clean_column_names(raw_data.columns)
raw_data.dtypes

show_number     int64
air_date       object
round          object
category       object
value          object
question       object
answer         object
dtype: object

In [6]:
# convert `air_date` to datetime
raw_data.air_date = pd.to_datetime(raw_data.air_date)

In [7]:
# clean `value` column
raw_data.value = clean_value(raw_data.value.values)

In [8]:
raw_data["round"].value_counts()

Jeopardy!           107384
Double Jeopardy!    105912
Final Jeopardy!       3631
Tiebreaker               3
Name: round, dtype: int64

In [10]:
# encode `round` column, (already clean)
round_dict = get_encoding_dict(raw_data, "round")
print(round_dict)
enc_data = raw_data.replace({"round": round_dict})
print(enc_data["round"].value_counts())

{'Jeopardy!': 0, 'Double Jeopardy!': 1, 'Final Jeopardy!': 2, 'Tiebreaker': 3}
0    107384
1    105912
2      3631
3         3
Name: round, dtype: int64


In [11]:
# extract hyperlinks and the media types associated with hyperlink
hyperlink, media_type = extract_hyperlink_media(enc_data, "question")
enc_data["hyperlink"] = hyperlink
enc_data["media_type"] = media_type
enc_data.media_type.value_counts()

nan                           206407
jpg                             8280
wmv                             1224
mp3                             1017
mov                                1
com/media/2001-04-02_DJ_29         1
Name: media_type, dtype: int64

In [12]:
# one of the hyperlinks is broken (missing a .jpg extension). Fix the hyperlink and the associated media type
hlink = enc_data.loc[enc_data.media_type=="com/media/2001-04-02_DJ_29", "hyperlink"]
enc_data.loc[enc_data.media_type=="com/media/2001-04-02_DJ_29", "hyperlink"] = hlink + ".jpg"
enc_data.loc[enc_data.media_type=="com/media/2001-04-02_DJ_29", "media_type"] = "jpg"

# check the media_type value counts again
enc_data.media_type.value_counts()

nan    206407
jpg      8281
wmv      1224
mp3      1017
mov         1
Name: media_type, dtype: int64

In [13]:
# encode media_type column
media_type_dict = get_encoding_dict(enc_data, col = "media_type")
print(media_type_dict)
enc_data = enc_data.replace({"media_type": media_type_dict})

# check the media_type value counts again
enc_data.media_type.value_counts()

{'nan': 0, 'mp3': 1, 'jpg': 2, 'wmv': 3, 'mov': 4}


0    206407
2      8281
3      1224
1      1017
4         1
Name: media_type, dtype: int64

In [14]:
enc_data

Unnamed: 0,show_number,air_date,round,category,value,question,answer,hyperlink,media_type
0,4680,2004-12-31,0,HISTORY,200.0,"For the last 8 years of his life, Galileo was ...",Copernicus,,0
1,4680,2004-12-31,0,ESPN's TOP 10 ALL-TIME ATHLETES,200.0,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,,0
2,4680,2004-12-31,0,EVERYBODY TALKS ABOUT IT...,200.0,The city of Yuma in this state has a record av...,Arizona,,0
3,4680,2004-12-31,0,THE COMPANY LINE,200.0,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,,0
4,4680,2004-12-31,0,EPITAPHS & TRIBUTES,200.0,"Signer of the Dec. of Indep., framer of the Co...",John Adams,,0
...,...,...,...,...,...,...,...,...,...
216925,4999,2006-05-11,1,RIDDLE ME THIS,2000.0,This Puccini opera turns on the solution to 3 ...,Turandot,,0
216926,4999,2006-05-11,1,"""T"" BIRDS",2000.0,In North America this term is properly applied...,a titmouse,,0
216927,4999,2006-05-11,1,AUTHORS IN THEIR YOUTH,2000.0,"In Penny Lane, where this ""Hellraiser"" grew up...",Clive Barker,,0
216928,4999,2006-05-11,1,QUOTATIONS,2000.0,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo,,0
