# Winning Jeopardy

Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture.

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, I'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help one win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which can be download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

In [26]:
import pandas as pd
from scipy.stats import chisquare
import numpy as np
import re
from random import choice

## Reading the Data and First Inspections

In [27]:
df = pd.read_csv('jeopardy.csv')
df.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [28]:
df.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Some columns have spaces in front of them. These need to be removed before continuing the analysis:

In [29]:
# remove empty whitespaces
df.columns = df.columns.str.strip()
df.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1   Air Date     19999 non-null  object
 2   Round        19999 non-null  object
 3   Category     19999 non-null  object
 4   Value        19999 non-null  object
 5   Question     19999 non-null  object
 6   Answer       19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


The Air Date column contains the date the question was asked. Thus it needs to be converted to datetime object.

In [31]:
df['Air Date'] = pd.to_datetime(df['Air Date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Show Number  19999 non-null  int64         
 1   Air Date     19999 non-null  datetime64[ns]
 2   Round        19999 non-null  object        
 3   Category     19999 non-null  object        
 4   Value        19999 non-null  object        
 5   Question     19999 non-null  object        
 6   Answer       19999 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB


## Normalizing Columns

In order to better analyze the `Question`, `Answer` and `Value` Columns, they need to be cleaned up. The cleaned up versions will be appended to new columns at the end of the dataframe:

- `Question` becomes `clean_question`
- `Answer` becomes `clean_answer`
- `Value` becomes `clean_value`

`Question` and `Answer` will be made lowercase and the punctuation will be removed. 
The same goes for `Value`, however this row also contains some none values which will transformed to text first and then set to zero.

In [32]:
df = pd.read_csv('jeopardy.csv') # load data
df.columns = df.columns.str.strip() # clean column names
df['Air Date'] = pd.to_datetime(df['Air Date']) # convert Air Date to datetime-obj

def normalize(string):
    '''
    take in string and:
    - convert to lowercase
    - remove punctuation
    - return cleaned string
    '''
    string = string.lower() # lowercase
    string = re.sub(r'[^\w\s]', '', string)
    #string = re.sub("[^A-Za-z0-9\s]", "", string)
    #string = re.sub("\s+", " ", string)
    return string

def normalize_values(text):
    '''
    extra function for value column to deal with none values
    '''
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

# apply normalize func to Question column and add result to clean_question column
df['clean_question'] = df['Question'].apply(normalize) 

# apply normalize func to Answer column and add result to clean_answer column
df['clean_answer'] = df['Answer'].apply(normalize)

# apply normalize func to Value column and add result to clean_value column
df['clean_value'] = df['Value'].apply(normalize_values)

df['clean_question'] 

0        for the last 8 years of his life galileo was u...
1        no 2 1912 olympian football star at carlisle i...
2        the city of yuma in this state has a record av...
3        in 1963 live on the art linkletter show this c...
4        signer of the dec of indep framer of the const...
                               ...                        
19994    of 8 12 or 18 the number of us states that tou...
19995                             the new power generation
19996    in 1589 he was appointed professor of mathemat...
19997    before the grand jury she said im really sorry...
19998    llamas are the heftiest south american members...
Name: clean_question, Length: 19999, dtype: object

In [33]:
df['clean_answer']

0             copernicus
1             jim thorpe
2                arizona
3              mcdonalds
4             john adams
              ...       
19994                 18
19995             prince
19996            galileo
19997    monica lewinsky
19998             camels
Name: clean_answer, Length: 19999, dtype: object

Visual inspection of the data checks out!

In [34]:
df['clean_value'].value_counts()

400     3892
800     2980
200     2784
1000    1980
600     1890
        ... 
4100       1
2021       1
3300       1
2900       1
6100       1
Name: clean_value, Length: 72, dtype: int64

no '$' signs - data checks out

# Finding Patterns in the Data

Now I will try to prove or disprove some expected trends in the data. These are:

- How often does the question contain the answer?
- How often doe questions repeat themselves?
- What are the value of the questions?

## How big is the Chance of the Question containing the Answer?

I will start by determinining the chance of the question containing the answer already. Therefore I will build a function `count_matches` which takes the row of df as an argument and returns the percentage of words contained in answer that are also contained in question. For this purpose I am removing 'the' from the strings as it is irrelevant and would distort the result.

In [35]:
def count_matches(row):
    '''
    takes row of datafraem
    calculates the matches between (clean) answer and question rows
    returns relative value
    '''
    
    # split clean answer and question columns:
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    match_count = 0
    
    # remove 'the' from string-series as it is unnecessary:
    if 'the' in split_answer:
        #split_answer = re.sub(r'the', '', split_answer)
        split_answer.remove("the")
    
    # make sure no division by zero later on
    if len(split_answer) == 0:
        return 0
    
    # finally count matches between split answer and question:
    for item in split_answer:
        if item in split_question:
            match_count += 1
            
    return match_count / len(split_answer)

df["answer_in_question"] = df.apply(count_matches, axis=1)
df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show this c...,mcdonalds,200,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the const...,john adams,200,0.0


In [36]:
df['answer_in_question'].mean()

0.059001965249777744

6% of the time, the quesiton is already in the answer. We can conclude that the chance is pretty low!

## How often are Questions Repeats of older ones?

Now let's take a look at the chance that questions are repeated. To do this, I will use the following workflow:

- Sort df in order of ascending `Air Date`.
- Maintain a set called `terms_used` that will be empty initially.
- Iterate through each row of df.
- Split `clean_question` into words, remove any word shorter than 6 characters, and check if each word occurs in `terms_used`.
    - If it does, increment a counter.
    - Add each word to `terms_used`.

In [42]:
questions_overlap = [] # empty list to be filled with recuring questions
terms_used = set() # sempty set

df = df.sort_values(by='Air Date') # sort dataframe by date

for i, row in df.iterrows():
    split_question = row['clean_question'].split(' ') # split each row of clean_questions into list of strings
    split_question = [q for q in split_question if len(q) > 5] # remove words with less than 6 characters
    match_count = 0 # counting the no of words that match each other
    
    for word in split_question: # iterate through each word 
        if word in terms_used: # if word already used, match_count + 1
            match_count += 1
    for word in split_question: # add each word in split_question to terms_used
        terms_used.add(word)
    if len(split_question) > 0: # if split_questions contains a value, take average of match_count
        match_count /= len(split_question)
    
    questions_overlap.append(match_count) # append match_count for each row to questions_overlap
    
df['questions_overlap'] = questions_overlap # append questions overlap to dataframe

# print mean of questions_overlap
print(df['questions_overlap'].mean())

0.6877209804150494


It seems like 69% of all questions are being recycled. Therefore it feels like a good strategy to look at old questions.

## Determining the Value of Questions

Let's say you only want to study questions that pertain to high value questions instead of low value questions. This will help you earn more money when you're on Jeopardy.

One can actually figure out which terms correspond to high-value questions using a chi-squared test. I'll first need to narrow down the questions into two categories:

First some definitions:

- Low value -- Any row where Value is less than 800.
- High value -- Any row where Value is greater than 800.

Again, I will loop through each of the terms in `terms_used` and:

- Find the number of low value questions the word occurs in.
- Find the number of high value questions the word occurs in.
- Find the percentage of questions the word occurs in.
- Based on the percentage of questions the word occurs in, find expected counts.
- Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

Afterwards I will be able to find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so I'll just do it for a small sample now.

In [38]:
#df['clean_value'] = df['clean_value'].fillna(0)
#df = df.dropna(how="all")

def determine_value(row):
    '''
    Function that returns 1 for high value question and 0 for low value ones. 
    It checks if clean_value is greater than 800 and: 
    returns 1 for yes and 0 for no
    accepts row of dataframe
    '''
    value = 0
    if row['clean_value'] > 800:
        return 1
    else:
        return 0
    
df['high_value'] = df.apply(determine_value, axis = 1)
df['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In the below code-cell I am randomly choosing 10 words to see how many of them are associated high and low value questions to serve as my `observed_expected` parameter for the chi-squared test. This means that the results will be fluctuating a bit each time the cell is run.

In [39]:
def count_usage(term):
    '''
    function that takes in a word and counts how many times it is found within high and low value columns.
    returns how many times the given word has been high and low counted
    '''
    low_count = 0
    high_count = 0
    
    for i, row in df.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

# randomly pick ten elements of terms_used and append them to list comparison_terms:
#terms_used_list = list(terms_used) # transform terms_used from set to list
terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)] # randomly pick ten words
print(comparison_terms)

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))
observed_expected

['elburz', 'posters', 'shipwrecks', 'rocking', 'labrador', 'alamos', 'removal', 'robber', 'percent', 'morial']


[(0, 1),
 (0, 1),
 (0, 1),
 (1, 1),
 (0, 1),
 (0, 1),
 (0, 3),
 (3, 1),
 (1, 3),
 (0, 1)]

Now, I can compute the expected counts and the chi-squared value

In [40]:
df['high_value'].value_counts()

0    14265
1     5734
Name: high_value, dtype: int64

In [41]:
high_value_count = df['high_value'].value_counts()[1] # how many high_values are 1
low_value_count = df['high_value'].value_counts()[0] # how many high_values are 0

print('sum of high values greater than 800: ' + str(high_value_count))
print('sum of values smaller than 800: ' + str(low_value_count))

# chi_squared hypothesis test
chi_squared = []

for i, entry in enumerate(observed_expected):
    total = sum(entry)
    total_prop = total / len(df)
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([entry[0], entry[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared

sum of high values greater than 800: 5734
sum of values smaller than 800: 14265


[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.4448774816612795, pvalue=0.5047776487545996),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=1.205888538380652, pvalue=0.27214791766901714),
 Power_divergenceResult(statistic=4.198022975221989, pvalue=0.0404711362009595),
 Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

# Overall Conclusion:

In this project I looked at three major trends of the Jeopardy questions sample dataset. I was able to derive the following conclusions:

- The chance of finding the answer in the question is pretty low. This was the case only 6% of the time.
- The majority of questions (69%) were repeats of older ones. Therefore it feels like a good strategy to at least look at old questions.
- The chi-squared test to figure out which terms correspond to high-value questions wasn't valid as the frequencies were all lower than 5. Thus its conclusion is irrelevant.