In [1]:
from IPython.display import Image
from IPython.core.display import HTML 
Image(url= "http://d3v3hv0xzfxuou.cloudfront.net/wp-content/uploads/2018/04/10103408/maxresdefault-2.jpg")


## Introduction:

Jeopardy is a popular TV show in the United States where participants answer random trivia questions to win money. It's been running since the early 60's and has still maintained a large viewership. 

## Objective:

Let's say I want to compete on Jeopardy, so I decide to analyze some questions from their database to see if I can discover any patterns or insights for a competitive advantage. In this project I am going to clean up the dataset, come up with a strategy, and explore the questions to see if there are any relevant insights to help me win the game.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv("JEOPARDY_CSV.csv")
df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
Show Number    216930 non-null int64
 Air Date      216930 non-null object
 Round         216930 non-null object
 Category      216930 non-null object
 Value         216930 non-null object
 Question      216930 non-null object
 Answer        216928 non-null object
dtypes: int64(1), object(6)
memory usage: 11.6+ MB


In [4]:
df = df.dropna()

## Tidying up the data:

- Since some of the columns names have spaces I'll tidy up the names.
- The answer, question, and value columns need to be cleaned. Both the answer and question columns have non-numeric characters like "." that need to be removed. I'll create a function to this.
- The value column isn't in numeric format, so I'll convert that to numeric type.
- The air_date column needs to be converted to datetime.

In [5]:
df.columns = ['show_number', 'air_date', 'round', 'category', 'value', 'question', 'answer']

In [6]:
df.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [7]:
df["answer"].sample(5), df["question"].sample(5)

(71230            shin splints
 57064         Double entendre
 172170    Freddie Prinze, Jr.
 105169       Ulysses S. Grant
 42834           Vinson Massif
 Name: answer, dtype: object,
 72905     The pygmy variety of this small South American...
 150680    Once home to the Hittites, today this region i...
 110763    Her final testament, read in public after her ...
 153370             To poke, perhaps with a cattle implement
 189599    In 1954 Monaco's Prince Rainier created the Or...
 Name: question, dtype: object)

Creating a function to clean answer and question columns - this will make it much easier to analyze later.

In [8]:
import re
def cleaner(string):
    lower = string.lower()
    clean = re.sub("[^A-za-z0-9\s]", "", lower)
    return clean

In [9]:
# check function
cleaner("No way. That's impossible.")

'no way thats impossible'

In [10]:
df["clean_answer"] = df["answer"].apply(cleaner)
df["clean_answer"].head()

0    copernicus
1    jim thorpe
2       arizona
3     mcdonalds
4    john adams
Name: clean_answer, dtype: object

In [11]:
df["clean_question"] = df["question"].apply(cleaner)
df["clean_question"].head()

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
3    in 1963 live on the art linkletter show this c...
4    signer of the dec of indep framer of the const...
Name: clean_question, dtype: object

In [12]:
df["value"].head(), df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216928 entries, 0 to 216929
Data columns (total 9 columns):
show_number       216928 non-null int64
air_date          216928 non-null object
round             216928 non-null object
category          216928 non-null object
value             216928 non-null object
question          216928 non-null object
answer            216928 non-null object
clean_answer      216928 non-null object
clean_question    216928 non-null object
dtypes: int64(1), object(8)
memory usage: 16.6+ MB


(0    $200
 1    $200
 2    $200
 3    $200
 4    $200
 Name: value, dtype: object, None)

Creating a function to convert value column into integers - this way we can do calculations.

In [13]:
def cleaner2(string):
    string = re.sub("[^A-za-z0-9\s]", "", string)
    try:
        num = int(string)
    except Exception:
        num = 0
        
    return num

In [14]:
df["clean_value"] = df["value"].apply(cleaner2)
df["clean_value"].head()

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

Converting air_date into datetime - this will make analyzing time much easier.

In [15]:
df["air_date"] = pd.to_datetime(df["air_date"])
df["air_date"].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: air_date, dtype: datetime64[ns]

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 216928 entries, 0 to 216929
Data columns (total 10 columns):
show_number       216928 non-null int64
air_date          216928 non-null datetime64[ns]
round             216928 non-null object
category          216928 non-null object
value             216928 non-null object
question          216928 non-null object
answer            216928 non-null object
clean_answer      216928 non-null object
clean_question    216928 non-null object
clean_value       216928 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 18.2+ MB


In [17]:
df.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_answer,clean_question,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,copernicus,for the last 8 years of his life galileo was u...,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,jim thorpe,no 2 1912 olympian football star at carlisle i...,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,arizona,the city of yuma in this state has a record av...,200
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,mcdonalds,in 1963 live on the art linkletter show this c...,200
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,john adams,signer of the dec of indep framer of the const...,200


## Strategy: 

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things before choosing a strategy:

- How often the answer is deducible from the question?
- How often new questions are repeats of older questions?

I can answer the second question by figuring out how often "complex" words (any word over 6 characters) reoccur. I can answer the first question by figuring out how many times some words in the answer also occur in the question. I'll start with the first question then work my way to the second.

I'll create a function to go through the answers and questions, removing words like "the" (I consider those worse extraneous for this analysis), and finding the proportion of answers that occur in the questions. Afterwards, I'll take the average and convert the result into a percentage.

In [18]:
def jeopordy(row):
    split_answer = row['clean_answer'].split(" ")
    split_question = row['clean_question'].split(" ")
    match_count = 0
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count/len(split_answer)
            

In [19]:
df["answer_in_question"] = df.apply(jeopordy, axis=1)

In [20]:
df["answer_in_question"].mean()*100

5.932118970835468

## Findings:

Only 6% of the answers are in the question and this isn't a large amount given the sample size - this dataset is only a small sample from Jeoparday's database of questions. 

Now, I'll create a function to filter for the words that have at least 5 letters, and then find the proportion of repeated questions. Afterwards, I'll the find mean and convert it to a percentage. Along the way I'll find the unique terms used using a set().

In [21]:
question_overlap = []
terms_used = set()
for i, v in df.iterrows():
    split_question = v["clean_question"].split(" ")
    split_question = [i for i in split_question if len(i) > 5]
    match_count = 0
    for i in split_question:
        if i in terms_used:
            match_count += 1
    for i in split_question:
        terms_used.add(i)
    if len(split_question) > 0:
        match_count /= len(split_question)
        
    question_overlap.append(match_count)
        



In [31]:
df["question_overlap"] = question_overlap
df["question_overlap"].mean()*100

87.34641512699052

In [23]:
terms_used

{'19941996',
 'jockey',
 'veraellen',
 'stitzer',
 'goldsmith',
 'wheeling',
 'stagecoaches',
 'deceive',
 'grooving',
 'dingdong',
 'gramps',
 'rostropovich',
 'mormons',
 'divebombing',
 'kaminker',
 '60foot',
 'hrefhttpwwwjarchivecommedia20080918_dj_08jpg',
 'giftsbut',
 'spelman',
 'hrefhttpwwwjarchivecommedia20060221_dj_17jpg',
 'tomorrowland',
 'hrefhttpwwwjarchivecommedia20120102_j_02ajpg',
 'coworker',
 'charmingly',
 'omniscient',
 'excitement',
 'burgos',
 'handiwork',
 'instrumentals',
 'publicize',
 'yemanja',
 'defibrillator',
 'autoclaved',
 'javelin',
 'ginastera',
 'attallah',
 'hrefhttpwwwjarchivecommedia20080725_j_04jpg',
 'rijsttafel',
 'struts',
 '22211foot',
 'westerplatte',
 'overlooking',
 'officesa',
 'copperrolling',
 'longricola',
 'microprocessor',
 'hrefhttpwwwjarchivecommedia20070404_j_09jpg',
 'hrefhttpwwwjarchivecommedia20090316_j_12mp3monsieur',
 'ananas',
 'desertdwelling',
 'rawalpindi',
 'reginalds',
 'singlehorned',
 'writfen',
 'classifieds',
 'perf

## Findings:

There is about an 87% overlap between the terms in new questions and the terms in old questions. This only looks at a small set of questions though - it doesn't look at phrases - and it only looks at single terms. This makes it relatively insignificant because were not looking at the entirety of the question; but, it does mean that it's worth looking more into the recycling of questions to see if we can discover any insight.

I'll now take a look at the values of the questions - maybe I can find something by comparing high and low value questions. I'll create a function to categorize questions into high (greater than 800) and low: 0 corresponds to low, 1 corresponds to high.

In [24]:
def values(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value
        
df["high_value"] = df.apply(values, axis=1)
df["high_value"].value_counts(normalize=True)

0    0.716855
1    0.283145
Name: high_value, dtype: float64

With value categories, I can create a function to count up the low and high value questions. Afterwards, I will find the amount of times the terms - in the terms_used (unique words) set - are found within high or low value questions. We can then use this observed counts to find their expected counts to do a chi square test. A chi square test can possibly help me figure out which terms correspond to high value questions - which can maybe help me earn more money on the game.

In [25]:
def counts(term):
    low_count = 0
    high_count = 0
    for i, v in df.iterrows():
        if term in v["clean_question"].split(" "):
            if v["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return low_count, high_count   

observed_expected = []
comparison_terms = list(terms_used)[:5]
for term in comparison_terms:
    observed_expected.append(counts(term))
    
observed_expected

[(1, 0), (31, 6), (0, 1), (1, 0), (8, 11)]

Finding chi-square values and p-values for a small sample of terms.

In [29]:
from scipy.stats import chisquare


high_value_count = df[df["high_value"] == 1].shape[0]
low_value_count = df[df["high_value"] == 0].shape[0]

chi_squared = []
for tup in observed_expected:
    total = sum(tup)
    total_prop = total/len(df)
    high_val_rows = total_prop * high_value_count
    low_val_rows = total_prop * low_value_count
    
    observed = np.array([tup[0], tup[1]])
    expected = np.array([high_val_rows, low_val_rows])
    chi_squared.append(chisquare(observed, expected))
    

In [30]:
chi_squared

[Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=56.08768670372833, pvalue=6.930983431963572e-14),
 Power_divergenceResult(statistic=0.39498154412048414, pvalue=0.5296924445254796),
 Power_divergenceResult(statistic=2.5317638631109367, pvalue=0.11157543053393834),
 Power_divergenceResult(statistic=1.7802975830357144, pvalue=0.18211278920192042)]

## Conclusion:

None of the terms had a statistically significant difference in their usage between high value and low value rows. Also, the frequencies were all lower than 5, giving the chi-squared test not much statistical significance. It would be better to run this test with only terms that have higher frequencies, but I'll save that for a future project.