# Winning Jeopardy

In this project, I am going to explore Jeopardy data to see if any  pattern in the questiones.

The dataset is named jeopardy.csv and contains 20000 rows from the beginning of a full dataset of Jeopardy questions.

Each row in the dataset represents a single question on a single episode of Jeopardy.

## Review the Data

In [1]:
import numpy as np
import pandas as pd
df=pd.read_csv('jeopardy.csv')

In [2]:
df.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
 Air Date      19999 non-null object
 Round         19999 non-null object
 Category      19999 non-null object
 Value         19999 non-null object
 Question      19999 non-null object
 Answer        19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [4]:
df.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [5]:
#creat a function to remove the spaces in each item in jeopardy.columns.
df.columns=df.columns.str.strip()
df.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

We need to normalize all of the text columns (the Question and Answer columns).We need to ensure that you lowercase words and remove punctuation so Don't and don't aren't considered to be different words

In [6]:
import re
def normalize(string):
    text=string.lower()
    text=re.sub('[^\w\s]','',text)
    return text

df['clean_question']=df['Question'].apply(normalize)
df['clean_answer']=df['Answer'].apply(normalize)

The Value column should also be numeric, to allow you to manipulate it more easily. We need to remove the dollar sign from the beginning of each value and convert the column from text to numeric.

In [7]:
def normalize_value(string):
    value=re.sub('[^\w\s]','',string)
    try:
        value=int(value)
    except Exception:
        value=0
    return value

normalize_value('$200')

200

In [8]:
df['clean_value']=df['Value'].apply(normalize_value)

Change Air Date data type from object to datetime

In [9]:
df['Air Date']=pd.to_datetime(df['Air Date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


# Analysis
In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

1,How often the answer is deducible from the question.
2,How often new questions are repeats of older questions.
We can answer the second question by seeing how often complex words (> 6 characters) reoccur. We can answer the first question by seeing how many times words in the answer also occur in the question. We'll work on the first question now, and come back to the second.

### 1,How often the answer is deducible from the question?

In [10]:
def count_matches_ratio(row):
    answer = row['clean_answer']
    question = row['clean_question']
    split_answer = answer.split()
    split_question = question.split()
    match_count = 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count/len(split_answer)
  
df['answer_in_question'] = df.apply(count_matches_ratio, axis = 1)  
df['answer_in_question'].mean()

0.05900196524977763

Only 6% of words are found in the questions. It is a quiet low number. We can't deduce answer from the question.  Hence,We have to study the question. 

### 2,How often new questions are repeats of older questions?

In [11]:
question_overlap = []
terms_used = set()
df.sort_values('Air Date', inplace = True)
for i, row in df.iterrows():
    split_question = row['clean_question'].split()
    split_question = [q for q in split_question if len(q)>= 6]
    match_count = 0
    for term in split_question:
        if term in terms_used:
            match_count += 1
        else:
            terms_used.add(term)
    if len(split_question) > 0:
        match_count = match_count / len(split_question)
    question_overlap.append(match_count)
    
df['question_overlap']=question_overlap
df['question_overlap'].mean()

0.6894006357823182

69% of question overlap means that 69% of words in new questions are same as old question. It is not low number, however, we are studying on single words only instaed of a phrase, which means that it is relatively insignificant, but it does mean that it's worth looking more into the recycling of questions.

### Low value vs high value questions

If we want to study only high value questions to win more money. We can figure out which terms correspond to high-value questions using a chi-squared test.

We'll first need to narrow down the questions into two categories:

Low value -- Any row where Value is less than 800.
High value -- Any row where Value is greater than 800.

We'll then be able to loop through each of the terms from the last screen, terms_used, and:

-Find the number of low value questions the word occurs in.
-Find the number of high value questions the word occurs in.
-Find the percentage of questions the word occurs in.
-Based on the percentage of questions the word occurs in, find expected counts.
-Compute the chi squared value based on the expected counts and the observed counts for high and low value questions.

We can then find the words with the biggest differences in usage between high and low value questions, by selecting the words with the highest associated chi-squared values. Doing this for all of the words would take a very long time, so we'll just do it for a small sample now.

In [12]:
def determin_value(row):
        if row['clean_value']>800:
            value=1
        else:
            value=0
        return value

df['high_value'] = df.apply(determin_value, axis = 1)    
df['high_value'].sum()/len(df['high_value'])

0.28671433571678584

Only 29% questions are categoried as high-value question.

In [13]:
def count_word(word):
    low_count=0
    high_count=0
    for i,row in df.iterrows():
        spli_ques=row['clean_question'].split()
        if word in spli_ques:
            if row['high_value']==1:
                high_count +=1
            else:
                low_count +=1
    return low_count,high_count

In [14]:
terms_used_list=list(terms_used)
import random 
sample=random.sample(terms_used_list,10)
comparison_terms=sample
comparison_terms




['immense',
 'livres',
 'unauthorized',
 'dagger',
 'selling',
 'topten',
 'undefeated',
 'chalabi',
 'leprosy',
 'christendom']

In [15]:
observed_expected=[]
for word in comparison_terms:
    observed_expected.append(count_word(word))

    
print(comparison_terms)    
print(observed_expected)#use observed counts to compute expercted counts

['immense', 'livres', 'unauthorized', 'dagger', 'selling', 'topten', 'undefeated', 'chalabi', 'leprosy', 'christendom']
[(2, 0), (1, 0), (1, 0), (1, 0), (14, 3), (1, 0), (2, 0), (1, 0), (1, 0), (1, 0)]


### Chi-Squared Test


Now that we've found the observed counts for a few terms, you can compute the expected counts and the chi-squared value.

In [20]:
from scipy.stats import chisquare
import numpy as np

high_value_count = df[df["high_value"] == 1].shape[0]
low_value_count = df[df["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / df.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=22.21568659171183, pvalue=2.436747872923759e-06),
 Power_divergenceResult(statistic=11.268291701322743, pvalue=0.0007884224991512313),
 Power_divergenceResult(statistic=33.2332216282488, pvalue=8.17420115468397e-09),
 Power_divergenceResult(statistic=11.268291701322743, pvalue=0.0007884224991512313),
 Power_divergenceResult(statistic=44.63061715194016, pvalue=2.3794153105180522e-11),
 Power_divergenceResult(statistic=11.107843295855915, pvalue=0.0008596339784277207),
 Power_divergenceResult(statistic=377.36865867552007, pvalue=4.654261402112294e-84),
 Power_divergenceResult(statistic=11.268291701322743, pvalue=0.0007884224991512313),
 Power_divergenceResult(statistic=11.268291701322743, pvalue=0.0007884224991512313),
 Power_divergenceResult(statistic=11.107843295855915, pvalue=0.0008596339784277207)]

### Chi-squared results
None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.

# Conclusion

-On average about 6% of the words of answers are found in the questions. So the chance of deducing the answer from the question is quite low.
-About 69% of the complex words in questions are repeated so studying the past questions can be really helpful to win.