# Winning Jeopardy

## Project Summary

Suppose that you want to win the popular TV show "Jeopardy". We will use a dataset of jeopardy questions to better understand the questions and the patterns between the questions to improve our odds. 

The dataset can be found [here](https://drive.google.com/file/d/0BwT5wj_P7BKXb2hfM3d2RHU1ckE/view?pli=1&resourcekey=0-1abK4cJq-mqxFoSg86ieIg) provided by the Reddit subreddit /r/datasets. The dataset is a collection of questions found from the website www.j-archive.com and includes questions from the show up to October of 2021. 

## Data Dictionary

* Category - Question category such as "History".
* Value - The USD value assigned to the question, *i.e.* $200. Questions that are only used for *Final Jeopary!* and Tiebreaker questions will have a value of **None**.
* Question - The text of the question. This may include hyperlinks or other non-standard text when an image or video is present.
* Answer - The text of the answer.
* Round - Values contains "Jeopardy!", "Final Jeopardy!", "Double Jeopardy!", or "Tiebreaker".
* Show Number - Contains the show number for that episode.
* Air Date - The show air date in the form of YYYY-MM-DD.




In [1]:
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
import json


# Reviewing the Data

First we will use our json library to upload the data into a dataframe. We will review the data types, check for any obvious missing values and validate what data needs to be normalized or cleaned before evaluation. 




In [2]:
# Load the JSON file
with open("JEOPARDY_QUESTIONS1.json", "r") as f:
    jep_json_data = json.load(f)


jep_df = pd.DataFrame(jep_json_data)

jep_df.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680


In [4]:
jep_df.columns

Index(['category', 'air_date', 'question', 'value', 'answer', 'round',
       'show_number'],
      dtype='object')

In [7]:
jep_df.dtypes

category       object
air_date       object
question       object
value          object
answer         object
round          object
show_number    object
dtype: object

In [30]:
jep_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   category     216930 non-null  object
 1   air_date     216930 non-null  object
 2   question     216930 non-null  object
 3   value        216930 non-null  int32 
 4   answer       216930 non-null  object
 5   round        216930 non-null  object
 6   show_number  216930 non-null  object
dtypes: int32(1), object(6)
memory usage: 10.8+ MB


We can see that all data is treated as an object and we only have missing values for the 'value' column. This was expected as per our data dictionary explanation. 

Let's start cleaning some of the data for easier review. Value contains a dollar sign and a comma for thousands, we have already defined this in our data dictionary so the distinction is no longer needed. We can remove symbols and convert the value to integer for any future mathematics. 




In [22]:
jep_df["value"] = jep_df["value"].str.replace("[$,]","",regex=True)
jep_df.loc[jep_df["value"].isna()] = 0
jep_df["value"] = jep_df["value"].astype('int32')


Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",200,John Adams,Jeopardy!,4680


We can now update the air_date column to reflect the datetime value. This may help with any correlation between dates and types of questions. We have an issue that must be addressed with some of the dates listed. 3,634 records have a date of '0'. When filtering by the 0 value, we can see that there are multiple records with zeroes across all features. As a result, this data is useless and can be dropped. 


In [48]:
# jep_df["air_date"] = pd.to_datetime(jep_df["air_date"])

jep_df.drop(jep_df.loc[jep_df["air_date"] == 0].index,axis=0,inplace=True)

jep_df["air_date"] = pd.to_datetime(jep_df.loc[:]["air_date"],yearfirst=True)

jep_df.head()


Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",200,John Adams,Jeopardy!,4680


Next we will focus on the normalizing the question and answer data. Removing any punctuation and lower casing will allow us to find any questions or answers that are matching. 

