# Analying Jeopardy (the TV show) Dataset
In this analysis, the historical questions/answers from Jeopardy are to be analyzed to determine if any particular patterns existed.

In [1]:
# import modules
import pandas as pd
import string
import datetime

In [2]:
# read in data
jeopardy = pd.read_csv('jeopardy.csv')
print(jeopardy.head(5))
print(jeopardy.columns)

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype

In [3]:
# notice that some column names contain spaces; clean the column names...
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [4]:
# check
print(jeopardy.columns)
print(jeopardy.head(5))

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')
   Show Number    Air Date      Round                         Category Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


In [5]:
# before analysis can be performed, we need to clean/normalize the current data

# create a function to perform normalization text
def normalize_text(data):
    # build the punctuation table for translation
    punc_table = str.maketrans('', '', string.punctuation)
    # remove the punctuations
    new_data = data.lower().translate(punc_table)
    return new_data

# create a function to normalize the 'Value' data into numerics
def normalize_value(data):
    try:
        new_data = normalize_text(data)
        new_data = int(new_data)
    except:
        new_data = 0
    return new_data

In [6]:
# normalize 'Question', 'Answer', and 'Value'
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [7]:
# normalize 'Air Date' into datetime format
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

## How often the answer is deducible from the question?
To analyze this question, we measure how many times word in the answer also occur in its corresponding question.

In [8]:
# define a function that calculate, for each question, what percentage
# of the words in its answer also appear in its question
def analyze1(row):
    split_answer = row['clean_answer'].split(' ')
    split_question = row['clean_question'].split(' ')
    split_answer_dropped = [each for each in split_answer if each != 'the' and each != '']
    match_count = 0
    if len(split_answer_dropped) == 0:
        return 0
    else:
        for each_answer in split_answer_dropped:
            if each_answer in split_question:
                match_count += 1
        result = match_count / len(split_answer_dropped)
    return result

In [9]:
# apply the above function across all rows
jeopardy['answer_in_question'] = jeopardy.apply(analyze1, axis=1)

In [10]:
# find the average for 'answer_in_question'
print(jeopardy['answer_in_question'].mean())

# find the number of rows where 'answer_in_question' > 0
print(jeopardy[jeopardy['answer_in_question'] > 0].shape[0])

0.0582069615746
2485


## Findings So Far
Out of the 20000 historical questions, 2485 (~12.4%) of them had words in their answers also appear in their questions.

However, most of the answers contain multiple words, and thus, on average, approximately 5.82% of all the words in answers appear in the questions.

Even though the probabilites are rather low, when desperate, it may be useful to look for answers that have the keywords from the questions.
___

## How often are the new questions repeats of past questions?

In [11]:
# first sort by 'Air Date' in ascending order
jeopardy.sort_values(by='Air Date',inplace=True,ascending=True)

# check sorted dataframe, confirm that last row has a later date than first row
print(jeopardy.head(1))
print(jeopardy.tail(1))

       Show Number   Air Date            Round         Category Value  \
19325           10 1984-09-21  Final Jeopardy!  U.S. PRESIDENTS  None   

                                                Question              Answer  \
19325  Adventurous 26th president, he was 1st to ride...  Theodore Roosevelt   

                                          clean_question        clean_answer  \
19325  adventurous 26th president he was 1st to ride ...  theodore roosevelt   

       clean_value  answer_in_question  
19325            0                 0.0  
      Show Number   Air Date      Round         Category Value  \
1922         6294 2012-01-19  Jeopardy!  THAT'S BUSINESS  $400   

                                               Question   Answer  \
1922  In 1997 Tyco International moved to this U.K. ...  Bermuda   

                                         clean_question clean_answer  \
1922  in 1997 tyco international moved to this uk te...      bermuda   

      clean_value  answer_in_quest

In [12]:
# for each row, calculate the percentage of complex words in questions that have
  # appeared before
question_overlap = []
terms_used = set()
for idx, row in jeopardy.iterrows():
    split_question = row['clean_question'].split(' ')
    split_question = [each for each in split_question if len(each) >= 6]
    match_count = 0
    for each in split_question:
        if each in terms_used:
            match_count += 1
        terms_used.add(each)
    if len(split_question) > 0:
        result = match_count / len(split_question)
    question_overlap.append(result)
jeopardy['question_overlap'] = question_overlap

In [13]:
# calculate the mean for 'question_overlap'
jeopardy['question_overlap'].mean()

0.70327187421603465

## Findings So Far
On average, 70% of the complex words (those with at least 6 characters) have appeared at one point or another in the history of Jeopardy.

This is interesting as it indicates that it is likely complex words in future questions may have already appeared in the question sets before, and thus contestants may use the historical questions as references to prepare in advance.

Nonetheless, it may be useful to further identify the natures of these complex words.
___

## Which words are associated with high / low values?

In [14]:
# define a function to determine whether a row is of high value (over $800)
def value(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [15]:
# apply the above function across the dataframe
jeopardy['high_value'] = jeopardy.apply(value, axis=1)

In [16]:
# define a function that determine the frequencies of a word appearing in
# high and low value questions
def word_value(word):
    low_count = 0
    high_count = 0
    for idx, row in jeopardy.iterrows():
        splitted = row['clean_question'].split(' ')
        if word in splitted:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [17]:
# sample 5 words from the previously generated 'terms_used'
comparison_terms = list(terms_used)[0:5]
comparison_terms

['blancmange', 'pacific', 'dumber', 'michaelmas', 'riyadh']

In [18]:
# generate a list of observed frequencies in both high/low value questions
# for those 5 words
observed_expected = []
for each in comparison_terms:
    observed_expected.append(word_value(each))
print(observed_expected)

[(1, 0), (9, 28), (0, 1), (0, 1), (0, 1)]


## Computing expected counts and chi-squared value for the sampled words

In [19]:
# calculate the number of questions that are either of high value or low value
high_value_count = jeopardy[jeopardy['high_value'] == 1].shape[0]
low_value_count = jeopardy[jeopardy['high_value'] == 0].shape[0]
print(high_value_count, low_value_count)

5734 14265


In [20]:
# for each of the 5 sampled words, calculate their chi-squared / p-value
# based on their observed frequencies (calculated previously) and their
# expected frequencies (to be calculated below)
from scipy.stats import chisquare
chi_squared = []
for each in observed_expected:
    total = sum(each)
    total_prop = total / jeopardy.shape[0]
    expected_high_count = total_prop * high_value_count
    expected_low_count = total_prop * low_value_count
    chi2_value, p_value = chisquare(each, (expected_high_count, expected_low_count))
    chi_squared.append((chi2_value, p_value))

In [21]:
# display the chi-squared values and p-values
chi_squared

[(2.4877921171956752, 0.11473257634454047),
 (0.34189277990072214, 0.55873870578406826),
 (0.40196284612688399, 0.52607729857054686),
 (0.40196284612688399, 0.52607729857054686),
 (0.40196284612688399, 0.52607729857054686)]

## Findings So Far
First of all, the sampled frequencies were rather low, ranging from 0 occurences to 2 occurences across both high/low value questions. Therefore, the statistical outcomes would be questionable at best.

Secondly, taking the chi-squared statistics at face values. None of the outcomes exhibited significant differences. In other words, there were no differences in their likliehood of appearances in either high-value questions or low-value questions.
___