Read the dataset and get some info about it.

In [12]:
import pandas as pd
jeopardy = pd.read_csv('JEOPARDY_CSV.csv')
print(jeopardy.head())

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


In [13]:
print(jeopardy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [14]:
jeopardy.columns = jeopardy.columns.str.strip()

In [15]:
print(jeopardy.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Show Number  216930 non-null  int64 
 1   Air Date     216930 non-null  object
 2   Round        216930 non-null  object
 3   Category     216930 non-null  object
 4   Value        216930 non-null  object
 5   Question     216930 non-null  object
 6   Answer       216928 non-null  object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB
None


Clean the question, answer, value and date columns to analyze further.

In [16]:
import re
def normalize_function(text):
    text = str(text)
    text = text.lower()
    text = re.sub(r'[^\w\s]','',text)
    return text

def normalize_value(value):
    value = re.sub(r'[^\w\s]','',value)
    try:
        value = int(value)
    except:
        value = 0
    return value

In [17]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_function)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_function)
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)

In [18]:
import pandas as pd
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [19]:
print(jeopardy.head())

   Show Number   Air Date      Round                         Category Value  \
0         4680 2004-12-31  Jeopardy!                          HISTORY  $200   
1         4680 2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES  $200   
2         4680 2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...  $200   
3         4680 2004-12-31  Jeopardy!                 THE COMPANY LINE  $200   
4         4680 2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES  $200   

                                            Question      Answer  \
0  For the last 8 years of his life, Galileo was ...  Copernicus   
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe   
2  The city of Yuma in this state has a record av...     Arizona   
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's   
4  Signer of the Dec. of Indep., framer of the Co...  John Adams   

                                      clean_question clean_answer  clean_value  
0  for the last 8 years of his life

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

How often the answer can be used for a question.
How often questions are repeated.
You can answer the second question by seeing how often complex words (> 6 characters) reoccur. You can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question and come back to the second.

In [20]:
jeopardy['split_answer'] = jeopardy['clean_answer'].str.split()
jeopardy['split_question'] = jeopardy['clean_question'].str.split()

In [21]:
print(type(jeopardy['split_answer']))

<class 'pandas.core.series.Series'>


In [22]:
def func(row):
    match_count = 0
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    
    return(match_count/len(split_answer))

In [23]:
jeopardy['answer_in_question'] = jeopardy.apply(func, axis=1)

In [24]:
jeopardy['answer_in_question'].mean()

0.05792070323661065

Now it seems the probability of answer in a question is pretty low. We cann't rely on questions to get answer.

I want to investigate how often new questions are repeats of older ones. Let's investigate it now.

In [25]:
jeopardy = jeopardy.sort_values(by='Air Date')

In [26]:
print(jeopardy.head())

       Show Number   Air Date             Round            Category  Value  \
84523            1 1984-09-10         Jeopardy!      LAKES & RIVERS   $100   
84565            1 1984-09-10  Double Jeopardy!           THE BIBLE  $1000   
84566            1 1984-09-10  Double Jeopardy!            '50'S TV  $1000   
84567            1 1984-09-10  Double Jeopardy!  NATIONAL LANDMARKS  $1000   
84568            1 1984-09-10  Double Jeopardy!           NOTORIOUS  $1000   

                                                Question             Answer  \
84523            River mentioned most often in the Bible         the Jordan   
84565  According to 1st Timothy, it is the "root of a...  the love of money   
84566  Name under which experimenter Don Herbert taug...         Mr. Wizard   
84567    D.C. building shaken by November '83 bomb blast        the Capitol   
84568  After the deed, he leaped to the stage shoutin...  John Wilkes Booth   

                                          clean_question

In [27]:
print(jeopardy.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216930 entries, 84523 to 105930
Data columns (total 13 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   Show Number         216930 non-null  int64         
 1   Air Date            216930 non-null  datetime64[ns]
 2   Round               216930 non-null  object        
 3   Category            216930 non-null  object        
 4   Value               216930 non-null  object        
 5   Question            216930 non-null  object        
 6   Answer              216928 non-null  object        
 7   clean_question      216930 non-null  object        
 8   clean_answer        216930 non-null  object        
 9   clean_value         216930 non-null  int64         
 10  split_answer        216930 non-null  object        
 11  split_question      216930 non-null  object        
 12  answer_in_question  216930 non-null  float64       
dtypes: datetime64[ns](1), flo

In [28]:
print(jeopardy['split_question'].head())

84523      [river, mentioned, most, often, in, the, bible]
84565    [according, to, 1st, timothy, it, is, the, roo...
84566    [name, under, which, experimenter, don, herber...
84567    [dc, building, shaken, by, november, 83, bomb,...
84568    [after, the, deed, he, leaped, to, the, stage,...
Name: split_question, dtype: object


In [32]:
def check_question(col):
    terms_used = set()
    match_counter = 0
    for index, cell in enumerate(col):
        for i in cell:
            if len(i) < 6:
                cell.remove(i)
            if i in terms_used:
                match_counter += 1
            else:
                terms_used.add(i)
        if index % 10000 == 0:
            print(match_counter)
    return (match_counter/len(terms_used))
            

In [33]:
print(check_question(jeopardy['split_question']))

0
39019
87476
138083
189424
239526
288869
337850
388409
439243
493262
546869
602052
657416
714890
772202
829654
887413
944223
1000548
1057069
1112823
11.889935808788623


Use another method.

In [35]:
question_overlap = []
terms_used = set()

for i, row in jeopardy.iterrows():
        split_qion = [q for q in row['split_question'] if len(q) > 5]
        match_count = 0
        for word in split_qion:
            if word in terms_used:
                match_count += 1
        for word in split_qion:
            terms_used.add(word)
        if len(split_qion) > 0:
            match_count /= len(split_qion)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.8721766377742689

# Find the low value and high value questions.

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

You can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800.

In [36]:
low_value = jeopardy[jeopardy['clean_value'] < 800]
high_value = jeopardy[jeopardy['clean_value'] >= 800]

In [37]:
def value_check(row):
    if row['clean_value'] > 800:
        value = 1
    else:
        value = 0
    return value

In [39]:
jeopardy['high_value'] = jeopardy.apply(value_check, axis=1)

In [40]:
def count(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if word in row['split_question']:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count
            

In [46]:
count('crisper')

(0, 2)

In [47]:
import random
comparison_terms = random.choices(list(terms_used), k=10)
print(comparison_terms)
observed_expected = []
for i in comparison_terms:
    observed_expected.append(count(i))
print(observed_expected)

['tyrannis', 'ingvar', 'grammarian', 'hepburns', 'storsjon', 'riders', 'refectory', 'rokeby', 'hrefhttpwwwjarchivecommedia20100506_j_16jpg', 'hrefhttpwwwjarchivecommedia20071115_j_28jpg']
[(2, 4), (3, 2), (0, 1), (3, 7), (0, 1), (12, 34), (0, 1), (1, 0), (0, 1), (0, 1)]


Now that you've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.

In [48]:
import numpy as np
from scipy.stats import chisquare
high_value_count = len(high_value.axes[0])
low_value_count = len(low_value.axes[0])
chi_squared = []
for i in observed_expected:
    total = i[0] + i[1]
    total_prop = total / len(jeopardy.axes[0])
    term_count_high = total_prop * high_value_count
    term_count_low = total_prop * low_value_count
    observed = np.array(i)
    expected = np.array([term_count_high, term_count_low])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

[Power_divergenceResult(statistic=0.22879472296719727, pvalue=0.6324189668574394),
 Power_divergenceResult(statistic=0.5894848824462151, pvalue=0.44261838063075143),
 Power_divergenceResult(statistic=0.7544157608695651, pvalue=0.38508176583769604),
 Power_divergenceResult(statistic=0.6896133651361509, pvalue=0.4062959206925859),
 Power_divergenceResult(statistic=0.7544157608695651, pvalue=0.38508176583769604),
 Power_divergenceResult(statistic=5.369147857940487, pvalue=0.02049598824755676),
 Power_divergenceResult(statistic=0.7544157608695651, pvalue=0.38508176583769604),
 Power_divergenceResult(statistic=1.325529040972535, pvalue=0.24960216618620146),
 Power_divergenceResult(statistic=0.7544157608695651, pvalue=0.38508176583769604),
 Power_divergenceResult(statistic=0.7544157608695651, pvalue=0.38508176583769604)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.