# Jeopardy patterns
The goal of this project is to find patterns between in Jeopardy questions and answers that could help you win. 

This is a guided project completed for the Dataquest data science program.

The dataset used in this project is a subset (10%) of a Jeopardy dataset found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).



In [1]:
import pandas as pd

jeopardy = pd.read_csv('jeopardy.csv')

In [2]:
jeopardy.head(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
#removing spaces from in front of the column names
jeopardy.columns = jeopardy.columns.str.lstrip()

In [5]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [6]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
Show Number    19999 non-null int64
Air Date       19999 non-null object
Round          19999 non-null object
Category       19999 non-null object
Value          19999 non-null object
Question       19999 non-null object
Answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


In [7]:
#normalizing text in questions and answers
import re
def normalizing_text(string):
    string_lower = string.lower()
    string_cln = re.sub("[^\w\d\s]", "", string_lower)
    return string_cln

In [8]:
jeopardy['clean_question'] = jeopardy['Question'].apply(normalizing_text)

In [9]:
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalizing_text)

In [10]:
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona


In [11]:
#normalize values

def normalizing_values(string):
    string_no_punc = re.sub("[^\w\d\s]", "", string)
    try:
        string_cln = int(string_no_punc)
    except Exception:
        string_cln = 0
    return string_cln
        

In [12]:
jeopardy['clean_value'] = jeopardy['Value'].apply(normalizing_values)

In [13]:
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200


In [14]:
jeopardy['Air Date'] = pd.to_datetime(jeopardy['Air Date'])

In [15]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
Show Number       19999 non-null int64
Air Date          19999 non-null datetime64[ns]
Round             19999 non-null object
Category          19999 non-null object
Value             19999 non-null object
Question          19999 non-null object
Answer            19999 non-null object
clean_question    19999 non-null object
clean_answer      19999 non-null object
clean_value       19999 non-null int64
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


### Questions to consider
Two questions to consider are:
1. How often the answer can be used for a question?

This we answer by seeing how many times words in the answer occur in the question

2. How often are questions repeated?

We can answer this question by seeing how often complex words (>6 characters) repeat


## How often is the answer used for a question

In [19]:
def counting_matches(row):
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()
    
    if 'the' in split_answer:
        split_answer.remove('the')
    if len(split_answer) == 0:
        return 0
    
    match_count = 0
    for word in split_answer:
        if word in split_question:
            match_count +=1
    
    return match_count/len(split_answer)

In [20]:
jeopardy['answer_in_question'] = jeopardy.apply(counting_matches, axis = 1)

In [22]:
jeopardy['answer_in_question'].mean()

0.05900196524977763

### The answer is used in the question on average only 6% of the time. 
This means that we will need to use another strategy to study.