![](https://upload.wikimedia.org/wikipedia/commons/c/ce/Jeopardy%21_logo.png)

"Jeopardy!" is a popular American TV game show created by Merv Griffin. You can find further information about the show [here](https://en.wikipedia.org/wiki/Jeopardy!).

In this project, we will assume that we want to compete and we are going to try gaining some insights from a dataset of "Jeopardy!" questions to maximize our chances of winning the competition.

You can find information about the data and the columns from [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/). It's available in both JSON and CSV formats and contains information about the questions. Let's import and explore the data a bit.

# 1. Reading in the data

In [1]:
import pandas as pd
import numpy as np
import re
import random
from scipy.stats import chisquare

from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

In [2]:
jeopardy = pd.read_csv("jeopardy.csv")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Show Number  19999 non-null  int64 
 1    Air Date    19999 non-null  object
 2    Round       19999 non-null  object
 3    Category    19999 non-null  object
 4    Value       19999 non-null  object
 5    Question    19999 non-null  object
 6    Answer      19999 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


As we can see, our dataset includes 19999 questions with no missing data. What seems strange, though, is that some column names are unexpectedly indented. We will fix this by using strip().

In [4]:
jeopardy.columns = [column.strip() for column in jeopardy.columns]
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

Now that we are done with this, let's further explore and see if the data requires some cleaning.

# 2. Exploring and cleaning the data

As we recall, the "Air Date" column contains dates but is not in datetime type. Let's first fix this.

In [5]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])
jeopardy.head()
jeopardy.info()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Show Number  19999 non-null  int64         
 1   Air Date     19999 non-null  datetime64[ns]
 2   Round        19999 non-null  object        
 3   Category     19999 non-null  object        
 4   Value        19999 non-null  object        
 5   Question     19999 non-null  object        
 6   Answer       19999 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 1.1+ MB


Now that we are done with converting that column, we will take a look at other columns.

In [6]:
jeopardy.Round.value_counts(dropna=False)

Jeopardy!           9901
Double Jeopardy!    9762
Final Jeopardy!      335
Tiebreaker             1
Name: Round, dtype: int64

In [7]:
jeopardy.Category.value_counts(dropna=False)

TELEVISION                51
U.S. GEOGRAPHY            50
LITERATURE                45
AMERICAN HISTORY          40
BEFORE & AFTER            40
                          ..
SISTER CITIES              1
THEATRE                    1
20th CENTURY NOVELISTS     1
SPORTS & THE MOVIES        1
HISTORIC HEADLINES         1
Name: Category, Length: 3581, dtype: int64

In [8]:
jeopardy.Value.head(10)

0    $200
1    $200
2    $200
3    $200
4    $200
5    $200
6    $400
7    $400
8    $400
9    $400
Name: Value, dtype: object

The Value column needs some cleaning. We will be stripping the dollar signs off and converting the column to numeric type.

In [9]:
def convert_value(value):
    value = re.sub(r"([^\w\s]*)", "", value)
    
    if value == "None":
        return 0
    else:
        return int(value)

jeopardy.Value = jeopardy.Value.apply(convert_value)
jeopardy.Value = jeopardy.Value.astype(int)
jeopardy.Value.head(10)

0    200
1    200
2    200
3    200
4    200
5    200
6    400
7    400
8    400
9    400
Name: Value, dtype: int32

We will now move on to columns Question and Answer. We will be normalizing these columns by making all characters lowercase & removing punctuation.

In [10]:
jeopardy.Question = jeopardy.Question.str.lower().str.replace(r"([^\w\s]*)", "")
jeopardy.Answer = jeopardy.Answer.str.lower().str.replace(r"([^\w\s]*)", "")
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,200,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,the city of yuma in this state has a record av...,arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,in 1963 live on the art linkletter show this c...,mcdonalds
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,signer of the dec of indep framer of the const...,john adams


# 3. Choosing what to study

When there are so many things to take into consideration, choosing what kind of questions you should be working on can be complicated. So, we will first take a look at questions where answers are pretty obvious from them. For this, we will be taking a look at the words in columns Question and Answer.

In [11]:
def count_word_matches(row):
    split_question = row[-2].split()
    split_answer = row[-1].split()

    match_counter = 0 #Initializing counter
    
    if "the" in split_answer:
        split_answer.remove("the") #Removing the from answers
            
    if len(split_answer) == 0: #Avoiding zero division error
        return 0
    
    else:
        for word in split_answer:
            if word in split_question:
                match_counter += 1
        return match_counter / len(split_answer)
        
obvious_answer = jeopardy.apply(count_word_matches, axis=1)
round(obvious_answer.mean(), 2)

0.06

As we can see, about 6% of the answers are given in the questions. We will next take a look at recycled questions. To do this, we will be using complex words that are longer than 5 characters.

In [12]:
jeopardy = jeopardy.sort_values("Air Date")

words_used = set() #Using a set to prevent duplications

def overlap_counter(question):
    split_question = question.split()
    
    match_counter = 0
    
    split_question = [word for word in split_question if len(word) > 5]
    
    for word in split_question:
        if word in words_used:
            match_counter += 1
        words_used.add(word)
        
    if len(split_question) > 0:
        match_counter /= len(split_question)
        
    return match_counter
    
jeopardy["question_overlap"] = jeopardy.Question.apply(overlap_counter)
round(jeopardy.question_overlap.mean(), 2)

0.69

As we can see, 69% of the complex words have been used previously. This means that some of the questions might be recycled.

An important factor in choosing what to study is question values. If we recall from our cleaning session, some questions were even bringing in 0 dollars. We will now take a look at the Value column to differentiate between low value and high value questions. Our value threshold will be 600 dollars. We will be finding the words that differ most in usage between high and low valued questions. We will be using a random fraction of the words_used set from the cell above.

In [13]:
jeopardy["high_value"] = jeopardy.Value.apply(lambda value: True if value > 600 else False)

words_used = list(words_used)

def high_low_counter(word):
    low = 0
    high = 0
    
    for index, row in jeopardy.iterrows():
        question_split = row["Question"].split()
        
        if word in question_split:
            if row["high_value"]:
                high += 1
            else:
                low += 1
    
    return high, low

random.seed(1) #For reproducibility of the results
sample_words = random.sample(words_used, 10)

print(sample_words)

observation_expectation = [high_low_counter(w) for w in sample_words]

print(observation_expectation)

['middlesboro', 'drummers', 'debuted', 'bermudas', 'descartes', 'overwhelming', 'respecta', 'screenwriting', 'hacked', 'foundation']
[(1, 0), (0, 2), (13, 17), (0, 1), (1, 0), (0, 1), (1, 0), (0, 1), (0, 1), (5, 7)]


Now we have our sample of words (longer than 5 characters) and their corresponding occurence counts in high and low valued questions, respectively. 

# 4. Chi-Squared test

We will now run a Chi-Squared test to determine statistical significance.

In [14]:
high_value_count = jeopardy.high_value.sum()
low_value_count = jeopardy.high_value.shape[0] - high_value_count

for high, low in observation_expectation:
    total = high + low
    
    total_proportion = total / jeopardy.shape[0]
    
    expected_high_term_freq = total_proportion * high_value_count
    expected_low_term_freq = total_proportion * low_value_count
    
    observe = np.array([high, low])
    expect = np.array([expected_high_term_freq, expected_low_term_freq])
    
    chisquare_value, pvalue = chisquare(observe, expect)

    print(chisquare_value, pvalue)

1.290836197021764 0.2558939073829579
1.5493832638211025 0.21322651092741213
0.0012399639609127077 0.9719097988788106
0.7746916319105512 0.37876956025360686
1.290836197021764 0.2558939073829579
0.7746916319105512 0.37876956025360686
1.290836197021764 0.2558939073829579
0.7746916319105512 0.37876956025360686
0.7746916319105512 0.37876956025360686
0.01923290743009192 0.8897008433673155


As we can see, all of the p values are higher than 5%. This means none of the words in our sample vocabulary had shown any significant difference in value. These words can be translated into: The difference in words' occurence in high vs. low valued questions can be said to be due to chance.

# Conlusion

In this project, after brief exploration and cleaning of our dataset, we have analyzed our data from different aspects. We have gained insights that might help us create a more fitting study path for competition prep. These include:

* 6% of the answers occurred in the questions.
* 69% of the complex words (usually terms) have occurred in previous questions.
* Terms' occurrence in high and low valued questions had no significance. Also, their occurences were actually too low to even take this into consideration.