# Winning Jeopardy
---
Jeopardy is a popular TV show in the US where participants answer questions to win money. It's been running for many years, and is a major force in popular culture. 

Imagine that you want to compete on Jeopardy, and you're looking for any way to win. In this project, you'll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help you win.

The dataset is named jeopardy.csv, and contains 20000 rows from the beginning of a full dataset of Jeopardy questions, which you can download [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file).

As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

- `Show Number` - the Jeopardy episode number
- `Air Date` - the date the episode aired
- `Round` - the round of Jeopardy
- `Category` - the category of the question
- `Value` - the number of dollars the correct answer is worth
- `Question` - the text of the question
- `Answer` - the text of the answer

### Prerequisite
#### 1. Import modules and CSV

In [1]:
#Importing modules
import pandas as pd
import numpy as np
import re

In [2]:
#Importing the dataset
jeopardy=pd.read_csv('jeopardy.csv')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


#### 2. Removing spaces from column names

In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

As we can see from the column names above, some of the names have spaces in front, we are going to change that. 

In [4]:
jeopardy.columns=jeopardy.columns.str.replace(r'^ ','')
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


#### 3. Normalizing strings columns (Question and Answer)
Before doing our analysis, we will put everyword into lowercase and remove punctuations so that so `Don't` and `don't` aren't considered to be different words when you compare them.

In [5]:
def stringnorm(x):
    temp=x.lower()
    temp=re.sub(r'[^\w\d\s]','',temp)
    return temp

#Applying the function above
jeopardy['clean_question']=jeopardy['Question'].apply(stringnorm)
jeopardy['clean_answer']=jeopardy['Answer'].apply(stringnorm)

#Displaying the result
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona


#### 4. Normalizing the Value and Air Date Column

The `Value` column should be numeric, thus we need to remove the dollar sign and convert the string into and integer. `Air Date` should be in datetime.

In [6]:
#Modifying the value column
def valuenorm(x):
    try:
        return int(re.sub(r'[$,]','',x))
    except:
        return 0

jeopardy['clean_value']=jeopardy['Value'].apply(valuenorm)

jeopardy['clean_value'].head(2)

0    200
1    200
Name: clean_value, dtype: int64

In [7]:
#Converting Air Date into datetime
jeopardy['Air Date']=pd.to_datetime(jeopardy['Air Date'])

jeopardy['Air Date'].head(2)

0   2004-12-31
1   2004-12-31
Name: Air Date, dtype: datetime64[ns]

### How often the answer can be used for a question?

In [8]:
def ansque(x):
    split_answer=x['clean_answer'].split(' ')
    split_question=x['clean_question'].split(' ')
    match_count=0
    if 'the' in split_answer:split_answer.remove('the')
    if len(split_answer)==0:return 0
    for i in split_answer:
        if i in split_question:match_count+=1
    return match_count/len(split_answer)

#Applying the function
jeopardy['answer_in_question']=jeopardy.apply(ansque,axis=1)

#Finding the mean of answer in question
print('{}% of the time, the answer is in the question itself.'.format(round(jeopardy['answer_in_question'].mean()*100,3)))

6.049% of the time, the answer is in the question itself.


As mentioned above, 6.049% of the time, the answer of a Jeopardy questions lies in the question itself. But it still means nothing, since the percentage is quite low, we will still need to study.

### How often new questions are repeats of older ones?
We cannot fully answer the question above since we only have about 10% of all jeopardy questions, but at the very least, we have some ideas about it.

In [9]:
question_overlap=[]
terms_used=set()
jeopardy.sort_values(by='Air Date',inplace=True)

for i, x in jeopardy.iterrows():
    split_question=x['clean_question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    match_count=0
    for j in split_question:
        if j in terms_used:match_count+=1
        terms_used.add(j)
    if len(split_question)>0:match_count=match_count/len(split_question)
    question_overlap.append(match_count)
    
jeopardy['question_overlap']=question_overlap

print('{}% of the questions overlap'.format(100*jeopardy['question_overlap'].mean()))
            

68.94006357823183% of the questions overlap


Based on the samples we have, we can see that 69% of the time, jeopardy questions will overlap.

### Words in low vs high value questions 

In [10]:
def highval(x):
    if x['clean_value']>800:return 1
    else: return 0

#Applying the function which determines if the value is greater than 800
jeopardy['high_value']=jeopardy.apply(highval,axis=1)

In [11]:
def use_count(x):
    low_count=0
    high_count=0
    for i,row in jeopardy.iterrows():
        split_question=row['clean_question'].split(' ')
        if x in split_question: 
            if row['high_value']==1:
                high_count+=1
            else:low_count+=1
    return high_count,low_count

In [12]:
from random import sample

terms_used_list = list(terms_used)
comparison_terms = sample(terms_used_list,10)

observed_expected=[]

for i in comparison_terms:
    observed_expected.append(use_count(i))

observed_expected

[(0, 1),
 (0, 1),
 (0, 1),
 (5, 16),
 (0, 1),
 (1, 0),
 (3, 1),
 (0, 1),
 (1, 0),
 (0, 1)]

### Calculating chi-square

In [13]:
from scipy.stats import chisquare

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared=[]

for i in observed_expected:
    total=sum(i)
    total_prop=total/jeopardy.shape[0]
    high_ex=total_prop*high_value_count
    low_ex=total_prop*low_value_count
    expected=[high_ex,low_ex]
    observed=[i[0],i[1]]
    chi_squared.append(chisquare(observed,expected))
    
chi_squared

[Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.24272816849400825, pvalue=0.6222425942945591),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=4.198022975221989, pvalue=0.0404711362009595),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469)]

As we can see the result for the 10 samples above, there isn't any sample which is statistically significant (pvalue is greater than the common tolerance of 0.05). But we cannot say that the test is necessarily successful since chi-square works more effectively for high number of samples.