## Winning Jeopardy

In this project I've worked with a dataset of Jeopardy questions and figured out some patterns in the questions that could help users win.

In [1]:
import pandas as pd
import numpy as np
import re
import random
from scipy.stats import chisquare
import warnings
warnings.filterwarnings('ignore')

In [2]:
jeopardy= pd.read_csv('jeopardy.csv')
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

#### Columns cleaning

In [4]:
jeopardy.columns= jeopardy.columns.str.strip()
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

#### Normalizing text within Question and Answer col       

In [5]:
def normalize(string):
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*+_~'''
    string= string.lower()
    clean_str= ''
    for char in string:
        if char not in punctuations:
            clean_str += char
    return  clean_str

In [6]:
jeopardy.Question= jeopardy.Question.apply(normalize)
jeopardy.Question.head(3)

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
Name: Question, dtype: object

In [7]:
jeopardy['clean_question']= jeopardy.Question.apply(normalize)
jeopardy.clean_question.head(3)

0    for the last 8 years of his life galileo was u...
1    no 2 1912 olympian football star at carlisle i...
2    the city of yuma in this state has a record av...
Name: clean_question, dtype: object

In [8]:
jeopardy['clean_answer']= jeopardy.Answer.apply(normalize)
jeopardy.clean_answer.head(3)

0    copernicus
1    jim thorpe
2       arizona
Name: clean_answer, dtype: object

#### Normalizing columns 

In [9]:
def normalize_values(string):
    string = re.sub("[^A-Za-z0-9\s]", "", string)
    try:
        string = int(string)
    except Exception:
        string = 0
    return string

In [10]:
jeopardy['clean_value']= jeopardy['Value'].apply(normalize_values)
jeopardy.clean_value[:5]

0    200
1    200
2    200
3    200
4    200
Name: clean_value, dtype: int64

In [11]:
jeopardy.dtypes

Show Number        int64
Air Date          object
Round             object
Category          object
Value             object
Question          object
Answer            object
clean_question    object
clean_answer      object
clean_value        int64
dtype: object

In [12]:
jeopardy['Air Date']= pd.to_datetime(jeopardy['Air Date'])

In [13]:
jeopardy.dtypes

Show Number                int64
Air Date          datetime64[ns]
Round                     object
Category                  object
Value                     object
Question                  object
Answer                    object
clean_question            object
clean_answer              object
clean_value                int64
dtype: object

#### Answers in questions

In [14]:
def ans(row):
    split_answer= row['clean_answer'].split()
    split_question= row['clean_question'].split()
    match_count= 0
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    for i in split_answer:
        if i in split_question:
            match_count += 1
    return match_count/ len(split_answer)       

In [15]:
jeopardy['answer_in_question']= jeopardy.apply(ans, axis=1)
jeopardy['answer_in_question'][:5]

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: answer_in_question, dtype: float64

In [16]:
jeopardy['answer_in_question'].mean()

0.058861482035140716

#### The calculated mean gave the insight about 'How often the answer is deducible from the question'. 

#### Recycled questions

In [17]:
question_overlap= list()
terms_used= set()
jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
    split_question= row['clean_question'].split()
    split_question= [word for word in split_question if len(word)>5]       
    match_count= 0    
    
    for word in split_question:
        if word in terms_used:
            match_count +=1
        terms_used.add(word)
    
    if len(split_question) >0:
        match_count /= len(split_question)
    question_overlap.append(match_count)    

In [18]:
jeopardy['question_overlap']= question_overlap
jeopardy['question_overlap'][:3]

19325    0.0
19301    0.0
19302    0.0
Name: question_overlap, dtype: float64

In [19]:
jeopardy['question_overlap'].mean()

0.6889055316620328

#### The calculated mean gave the insight about 'How often new questions are repeats of older questions'.

#### Low value vs high value questions

Want to study questions that pertain to high value questions instead of low value questions, this would help user earn more money when they're on Jeopardy.

For this Chi-squared test used to figure out which terms correspond to high-value questions.

In [20]:
def determine_value(row):
    value= 0
    if row['clean_value'] > 800:
        value= 1
    else:
        value= 0
    return value    

In [21]:
jeopardy['high_value']= jeopardy.apply(determine_value, axis=1)

In [22]:
def low_high(word):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        split_question= row['clean_question'].split()
        if word in split_question:
            if row['high_value'] ==1:
                high_count +=1
            else:
                low_count +=1
    return high_count, low_count           

In [23]:
comparison_terms = random.sample(list(terms_used), k=10)

In [24]:
observed_expected= list()

for i in comparison_terms:
    observed_expected.append(low_high(i))

In [25]:
observed_expected

[(3, 3),
 (0, 1),
 (0, 1),
 (0, 1),
 (2, 0),
 (1, 2),
 (0, 1),
 (0, 1),
 (0, 1),
 (1, 2)]

#### Applying the chi-squared test

In [26]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared= list()

for i in observed_expected:
    total= sum(i)
    total_prop= total/ jeopardy.shape[0]
    exp_high= total_prop*high_value_count
    exp_low= total_prop*low_value_count
    
    observed= np.array([i[0],i[1]])
    expected= np.array([exp_high, exp_low])
    chi_squared.append(chisquare(observed, expected))
    
chi_squared    

[Power_divergenceResult(statistic=1.3346324449838385, pvalue=0.24798277007881886),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=4.97558423439135, pvalue=0.025707519787911092),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.03188116723440362, pvalue=0.8582887163235293)]

#### Chi-squared results

None of the terms had a significant difference in usage between high value and low value rows. Additionally, the frequencies were all lower than 5, so the chi-squared test isn't as valid. It would be better to run this test with only terms that have higher frequencies.