## Data Science Portfolio - Jeopardy! ##

### Created by: Albert Schultz ###

### Date Created: 06/01/2023 ###

### Version: 1.00 ###

### Executive Summary ###
This notebook goes over extractions of the jeopardy.csv file and import the csv as a dataframe **jeopardy** for analysis and presentation of the EDA of the game of Jeopardy.

## Table of Contents ##

1. [Introduction](#1.-Introduction)
2. [Understanding Purpose, Goals, and Vision](#2.-Understanding-Purpose,-Goals,-and-Vision)
3. [Import the Raw CSV File](#3.-Import-the-Raw-CSV-File)
4. [Perform Data Cleaning](#4.-Perform-Data-Cleaning)
5. [Perform Data Exploratory Analysis (EDA)](#5.-Perform-Data-Exploratory-Analysis-(EDA))
2. [Summary](#Summary)

## 1. Introduction ##

This notebook consists of various analysis scripts that can be run below to extract information from the Jeopardy csv raw file. The Jeopardy raw file contains data that may be missing, misplaced, and or duplicated in the singular table. The dataset has 7 columns including the iD.

**Initialize the Notebook for data access, import library modules, and set the working directory for this project.**

In [79]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/jeopardy-dataset/jeopardy.csv


## 2. Understanding Purpose, Goals, and Vision ##

The vision of this notebook is, to guide the users in showing how to import the csv file, review, sort data, extract data, explore, and present findings from the cleaned dataframes for the Jeopardy game.

**Vision:** To provide insights and aspects of the Jeopardy game so others can understand Jeopardy.

**Goals:**
1. Review the raw data from the raw dataset **jeopardy.csv** file.
2. Import the data set into the Python IDE environment for staging, reviewing, extractions, data manipulations and presentation.
3. Create lambda functions and use various Pandas functions to clean, filter, and manipulate the dataframe for ease of data explorations.
4. Perform Exploratory Data Analysis to understand aspects of the jeopardy game.
5. Present the cleaned data set as a new dataframe that explains the jeopardy game via data insights.

## 3. Import the Raw CSV File ##

**Introduction:** In this section, I import the library modules that were needed along with the raw jeopardy.csv file for review and quick analysis.

1. Import the required library modules needed for this project.

In [80]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import warnings
from datetime import datetime
warnings.filterwarnings("ignore", category=UserWarning, module='bs4')

2. Import the **jeopardy.csv** file and set the name of the new dataframe as jeopardy.

In [81]:
jeopardy = pd.read_csv('/kaggle/input/jeopardy-dataset/jeopardy.csv')

3. Review the first **5 rows** of the new dataframe **jeopardy**.

In [82]:
print(jeopardy.head(5))

   Show Number    Air Date      Round                         Category  Value  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


We can see that there are 7 columns with various answers to each Jeopardy questions.

## 4. Perform Data Cleaning ##

**Introduction:** In this section, I perform data cleaning of the imported dataset as a dataframe **jeopardy** to ensure that the data are cleaned for analysis.

1. Change the column name from **Value** to **Amount** in the dataframe **jeopardy** using the **df.columns()** method..

In [83]:
jeopardy.columns = ['Show Number', 'Air Date', 'Round', 'Title Category', 'Amount', 'Question', 'Answer']

2. Double check the column name change by printing out the first **five rows with the updated table attributes**.

In [84]:
print(jeopardy.head(5))

   Show Number    Air Date      Round                   Title Category Amount  \
0         4680  2004-12-31  Jeopardy!                          HISTORY   $200   
1         4680  2004-12-31  Jeopardy!  ESPN's TOP 10 ALL-TIME ATHLETES   $200   
2         4680  2004-12-31  Jeopardy!      EVERYBODY TALKS ABOUT IT...   $200   
3         4680  2004-12-31  Jeopardy!                 THE COMPANY LINE   $200   
4         4680  2004-12-31  Jeopardy!              EPITAPHS & TRIBUTES   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


3. Change the **Category** column to **title** format to keep in consistency of the dataframe. Repeat this step for **Air Date** column as well. 

In [85]:
jeopardy['Title Category'] = jeopardy['Title Category'].apply(lambda x: x.title())
jeopardy['Air Date'] = jeopardy['Air Date'].apply(lambda date: datetime.strptime(date, '%Y-%m-%d'))

4. Print out the first **five** observations and columns to see the changes in the **Category** column.

In [86]:
print(jeopardy.head(5))

   Show Number   Air Date      Round                   Title Category Amount  \
0         4680 2004-12-31  Jeopardy!                          History   $200   
1         4680 2004-12-31  Jeopardy!  Espn'S Top 10 All-Time Athletes   $200   
2         4680 2004-12-31  Jeopardy!      Everybody Talks About It...   $200   
3         4680 2004-12-31  Jeopardy!                 The Company Line   $200   
4         4680 2004-12-31  Jeopardy!              Epitaphs & Tributes   $200   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


5. Create a function called **question_filter** that filters the number of hits found in the **Question** column of the **jeopardy** dataframe.

In [87]:
def question_filter(words):
    return jeopardy[jeopardy['Question'].astype(str).str.contains('|'.join(words), case=False)]


In [88]:
question_filter(['King'])

Unnamed: 0,Show Number,Air Date,Round,Title Category,Amount,Question,Answer
34,4680,2004-12-31,Double Jeopardy!,"""X""S & ""O""S",$400,Around 100 A.D. Tacitus wrote a book on how th...,oratory
40,4680,2004-12-31,Double Jeopardy!,Dr. Seuss At The Multiplex,$1200,"<a href=""http://www.j-archive.com/media/2004-1...",Yertle
50,4680,2004-12-31,Double Jeopardy!,Dr. Seuss At The Multiplex,$2000,"<a href=""http://www.j-archive.com/media/2004-1...",Bartholomew Cubbins
56,5957,2010-07-06,Jeopardy!,"Geography ""E""",$200,It's the largest kingdom in the United Kingdom,England
72,5957,2010-07-06,Jeopardy!,Let'S Bounce,$600,"In this kid's game, you bounce a small rubber ...",jacks
...,...,...,...,...,...,...,...
5503,2349,1994-11-17,Double Jeopardy!,Ancient History,$1000,This Old Kingdom capital of Egypt was original...,Memphis
5590,3537,2000-01-11,Jeopardy!,Anniversary Gifts,$400,"19th century American ""King of the South"" that...",Cotton
5643,3911,2001-09-10,Jeopardy!,Larry King'S Public Figures,$300,"At the bottom of the hour, bet you won't miss ...",Pete Rose
5647,3911,2001-09-10,Jeopardy!,Exports,$300,This crop is king in Mali; about 1/2 of its ex...,cotton


As you can see the results shows **185 rows** for the word, **King** in the question column.

6. Using the **Beautiful Soup** python script, remove the **html** tags and data in the **HTML <>**.
**Note:** If you get an error regarding the input issues, please ignore it and print the **question_filter(['href'])** to see that the html tags has been removed from all of the observations in the **Question** column. 

In [89]:
jeopardy['Question'] = jeopardy['Question'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())

  jeopardy['Question'] = jeopardy['Question'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())


7. Review of the **href** tags has been removed using the **BeautifulSoup** library module.

In [90]:
question_filter(['href'])

Unnamed: 0,Show Number,Air Date,Round,Title Category,Amount,Question,Answer


The above showed that there were no href or HTML tags, which meant that the html code has been removed from the Question Column.

8. Next, convert the **Amount** column of the dataframe **jeopardy** from **string** to **float** for proper calculations.

In [91]:
#Replace the $ with '' in the column Amount.
jeopardy['Amount'] = jeopardy['Amount'].str.replace('$', '')
jeopardy['Amount'] = pd.to_numeric(jeopardy['Amount'].str.replace('[^0-9]', ''), errors='coerce')

  jeopardy['Amount'] = jeopardy['Amount'].str.replace('$', '')
  jeopardy['Amount'] = pd.to_numeric(jeopardy['Amount'].str.replace('[^0-9]', ''), errors='coerce')


9. Review the column **Amount** type.

In [92]:
print(jeopardy['Amount'].dtype)
print(jeopardy.head(5))

float64
   Show Number   Air Date      Round                   Title Category  Amount  \
0         4680 2004-12-31  Jeopardy!                          History   200.0   
1         4680 2004-12-31  Jeopardy!  Espn'S Top 10 All-Time Athletes   200.0   
2         4680 2004-12-31  Jeopardy!      Everybody Talks About It...   200.0   
3         4680 2004-12-31  Jeopardy!                 The Company Line   200.0   
4         4680 2004-12-31  Jeopardy!              Epitaphs & Tributes   200.0   

                                            Question      Answer  
0  For the last 8 years of his life, Galileo was ...  Copernicus  
1  No. 2: 1912 Olympian; football star at Carlisl...  Jim Thorpe  
2  The city of Yuma in this state has a record av...     Arizona  
3  In 1963, live on "The Art Linkletter Show", th...  McDonald's  
4  Signer of the Dec. of Indep., framer of the Co...  John Adams  


I have confirmed that the Amount column has been successfully converted to the float data type for that column.

10. Perform a sum of the Amount Total using the sum function and add a , and .00 formatting to make it legible.

In [93]:
total_amount = '{:,.2f}'.format(jeopardy['Amount'].sum())

In [94]:
print(f"The Total Amount of all the Jeopardy! observations in the Amount Column is ${total_amount}.")

The Total Amount of all the Jeopardy! observations in the Amount Column is $4,274,392.00.


## 5. Perform Data Exploratory Analysis (EDA) ##

**Introduction:** This section goes over the process of performing Data Exploratory Analysis on the jeopardy dataframe to get answers and clarity of the Jeopardy! game throughout the years.

1. Find and print the average of the amount money per question for each participant in Jeopardy.

In [95]:
jeopardy_cn_questions = '{:.2f}'.format(jeopardy['Amount'].mean())
print(f"The average amount of money per question for participants is ${jeopardy_cn_questions}.")


The average amount of money per question for participants is $765.75.


2. Find and print the average value of questions that contains the word, **King**.

In [96]:
jeopardy_avg_q_king = jeopardy[jeopardy['Question'].astype(str).str.contains('|'.join('King'), case=False)]

In [97]:
javg_q_king = '{:.2f}'.format(jeopardy_avg_q_king['Question'].str.split().str.len().mean())

In [98]:
print(f"The average amount of questions that contains the word King is {javg_q_king}.")

The average amount of questions that contains the word King is 14.47.


3. Find and print the unique answers to the questions that has the word **King** in the questions and with answers of **Henry VIII**.

In [99]:
#Create a function that looks through and filter the questions down to the answers as well off of the filtered questions.
def question_filter(words, answers):
    jeopardy_questions_filter = jeopardy[jeopardy['Question'].astype(str).str.contains('|'.join(words), case=False)]
    jeopardy_qanswers_filter = jeopardy_questions_filter[jeopardy_questions_filter['Answer'].astype(str).str.contains('|'.join(answers), case=False)]
    return jeopardy_qanswers_filter

#Execute the questions_filter function.
items_q_a = len(question_filter(['King'],['Henry VIII']))


In [100]:
print(f"The numbers of unique answers in questions in the game, Jeopardy that mentioned the word King is {items_q_a}.")

The numbers of unique answers in questions in the game, Jeopardy that mentioned the word King is 3.


4. Find the ways in which questions change over time by filtering by date. How many questions from the 90s use the word **Computer** compare to the questions in the 2000s?

In [101]:
def date_range_questions(date_start_year, date_start_month, date_start_day, date_end_year, date_end_month, date_end_day):
    mask = (jeopardy['Air Date'] > datetime(date_start_year, date_start_month, date_start_day)) & (jeopardy['Air Date'] <= datetime(date_end_year, date_end_month, date_end_day))
    jeopardy_filtered = jeopardy.loc[mask]
    return jeopardy_filtered

#Count the numbers utilizing the function date_range_questions for the whole 20th Century of q and a in Jeopardy.
cent_twentieth_q_a_count = '{:.0f}'.format(len(date_range_questions(1900, 1, 1, 2000, 12, 31)))
print(f"The numbers of questions in the 20th century is {cent_twentieth_q_a_count} questions.")

#Count the numbers utilizing the function date_range_questions for the whole 21st Century of q and a in Jeopardy.
cent_twentyfirst_q_a_count = '{:.0f}'.format(len(date_range_questions(2000, 1, 1, 2020, 12, 31)))
print(f"The numbers of questions in the 21st century is {cent_twentyfirst_q_a_count} questions.")

The numbers of questions in the 20th century is 2083 questions.
The numbers of questions in the 21st century is 4014 questions.


5. Find the differences between the 20th and the 21st century of questions.

In [102]:
differences_in_questions_cnt = int(cent_twentyfirst_q_a_count) - int(cent_twentieth_q_a_count)
print(f"The differences between the 21st and the 20th century's questions is {differences_in_questions_cnt} questions between each other.")

The differences between the 21st and the 20th century's questions is 1931 questions between each other.


## Summary ##

This project portfolio went through the process of cleaning the raw real-world data to extracting, manipulating and explorations of the cleaned dataset.