# Jeopardy! Game
The main goal of this project is to practice writing several functions to:
1. investigate a dataset of *Jeopardy! Game* questions and answers, filter the dataset
2. filter the dataset for topics that the players are insterested in
3. compute the average difficulty of the questions

In [1]:
# Import needed libraries
import pandas as pd
import datetime
import random
from time import sleep

In [2]:
# Loading the data
df = pd.read_csv('jeopardy.csv', parse_dates=[1]) #parse_dates to convert the date column to datetime type
pd.set_option('display.max_colwidth', None) # to display the full contents of the columns

## Data Wrangling
In this section, I will check for the cleanliness of the data, then trim and clean the dataset to make it ready for the analysis.

In [3]:
# Check the first few lines of the dataset
df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams


In [4]:
# Check the data types
df.dtypes

Show Number             int64
 Air Date      datetime64[ns]
 Round                 object
 Category              object
 Value                 object
 Question              object
 Answer                object
dtype: object

> The **Value** column has data type as *object*, in order to compute the values in this column, it is neccessary to convert it to *float* type

In [5]:
# Check columns 
df.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

> There are empty spaces in front of the column' names --> Rename all the columns to get rid of empty spaces and make it more consistent.

In [6]:
# Rename the columns
df.rename(columns={'Show Number':'show_number',
		' Air Date':'air_date',
		' Round':'round',
		' Category': 'category',
		' Value':'value',
		' Question':'question',
		' Answer':'answer'},
		inplace = True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 216930 entries, 0 to 216929
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   show_number  216930 non-null  int64         
 1   air_date     216930 non-null  datetime64[ns]
 2   round        216930 non-null  object        
 3   category     216930 non-null  object        
 4   value        216930 non-null  object        
 5   question     216930 non-null  object        
 6   answer       216928 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(5)
memory usage: 11.6+ MB


> The dataset has **216 930 rows** and **7 columns**.

In [8]:
# Check if there is null values
df.isnull().sum()

show_number    0
air_date       0
round          0
category       0
value          0
question       0
answer         2
dtype: int64

> **answer** column has **2** null values.

In [9]:
# Check the row has has null values in answer column
answer_null_data = df[df.isnull().any(axis=1)]
answer_null_data

Unnamed: 0,show_number,air_date,round,category,value,question,answer
94817,4346,2003-06-23,Jeopardy!,"GOING ""N""SANE",$200,"It often precedes ""and void""",
143297,6177,2011-06-21,Double Jeopardy!,NOTHING,$400,"This word for ""nothing"" precedes ""and void"" to mean ""not valid""",


> The answer for both of these questions is **"Null"**, so instead of leaving the answer empty, I will assign the answer **Null** to these empty fields.

In [10]:
# Fill the empty values with the string 'Null'
df.fillna('Null', inplace=True)

In [11]:
# Double check if the empty values were filled
df.isnull().sum()

show_number    0
air_date       0
round          0
category       0
value          0
question       0
answer         0
dtype: int64

In [12]:
# Check for unique value of each columns
df.nunique()

show_number      3640
air_date         3640
round               4
category        27995
value             150
question       216124
answer          88269
dtype: int64

In [13]:
# Check for duplicate rows
df.duplicated().sum()

0

## Cleaning Data
**To be fixed:** convert datatype of column *value* into float and save the float values to a new column named *float_value*.


In [14]:
# Take a look at unique values of column value
df.value.unique()

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', 'None', '$5,000', '$100', '$300',
       '$500', '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389', '$4,200', '$5', '$2,001', '$1,263',
       '$4,637', '$3,201', '$6,600', '$3,700', '$2,990', '$5,500',
       '$14,000', '$2,700', '$6,400', '$350', '$8,600', '$6,300', '$250',
    

In [15]:
# Strip the $ sign in front of the value and replace the ',' sign with empty ''
# Then turn the value into float value as long as is's not the 'None' values
# If it's 'None', then replace 'None' with 0
df['float_value'] = df.value.apply(lambda x:
				float(x[1:].replace(',',''))
				if x != 'None'
				else 0)

In [16]:
df.head()

Unnamed: 0,show_number,air_date,round,category,value,question,answer,float_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus,200.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe,200.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona,200.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's,200.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams,200.0


In [17]:
# Rearrange the order of the columns so that column value and float_value next to each other
df = df.reindex(columns= ['show_number', 'air_date', 'round', 'category', 'value', 
		'float_value', 'question', 'answer'])

In [18]:
df.head()

Unnamed: 0,show_number,air_date,round,category,value,float_value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,200.0,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,200.0,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,200.0,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,200.0,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,200.0,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams


# Task 1: find_in_question
Write a function that filters the dataset for questions that contains all of the words in a list of words. Then compute the average value for the questions containing that specific list of words.

For example, when the list *["King", "England"]* was passed to our function, the function returned a DataFrame in which every row had the strings *"King"/"king"* and *"England"/"england"* somewhere in its *"question"*.

In [19]:
# Function to filter a dataset by a list of words
# .lower(): lowercases for all words so that if we are looking for the word 'King' and we find 'king' it still shows the result
# all(): returns True if all items in an iterable are true, otherwise it returns False.\
# If the iterable object is empty, the all() function also returns True.

def find_in_question(dataframe, words):
	# Returns true if all of the words in 'words' appear in the the question.
    # each_question =  each question in the column 'question'
	filter = lambda each_question: all(word.lower() in each_question.lower() for word in words)
	return dataframe.loc[dataframe['question'].apply(filter)]

In [20]:
# Test the find_in_question function
filtered_question = find_in_question(df, ['King', 'England'])
filtered_question.head()

Unnamed: 0,show_number,air_date,round,category,value,float_value,question,answer
4953,3003,1997-09-24,Double Jeopardy!,"""PH""UN WORDS",$200,200.0,"Both England's King George V & FDR put their stamp of approval on this ""King of Hobbies""",Philately (stamp collecting)
6337,3517,1999-12-14,Double Jeopardy!,Y1K,$800,800.0,"In retaliation for Viking raids, this ""Unready"" king of England attacks Norse areas of the Isle of Man",Ethelred
9191,3907,2001-09-04,Double Jeopardy!,WON THE BATTLE,$800,800.0,This king of England beat the odds to trounce the French in the 1415 Battle of Agincourt,Henry V
11710,2903,1997-03-26,Double Jeopardy!,BRITISH MONARCHS,$600,600.0,"This Scotsman, the first Stuart king of England, was called ""The Wisest Fool in Christendom""",James I
13454,4726,2005-03-07,Jeopardy!,A NUMBER FROM 1 TO 10,$1000,1000.0,It's the number that followed the last king of England named William,4


In [21]:
# Check how many questions that have the words "King"/"king" and "England"/"england"
len(filtered_question)

152

### Calculating Average Value for Questions that have specific word(s)
**Method 1:** Use the **find_in_question** function to filtered out the questions that contain the specific words, then using **.mean()** to compute the average value on the **float_value** column.

**Method 2:** Write a new function named **compute_average_value** taking the *dataset* and *list of words* as its parameters. Using the **find_in_question** function within this new function to filter the question, and then compute the average value on the float_value column of the dataset.

In [22]:
# Method 1: calculate the average value of questions that contain the words "King"/"king" and "England"/"england"
avg_value = filtered_question.float_value.mean()
print(f'The average value of these filtered questions is {avg_value:.2f}$.')

The average value of these filtered questions is 886.84$.


In [23]:
# Method 2:
def compute_average_value(dataset, words):
	filtered_data = find_in_question(dataset, words)
	avg_value = filtered_data.float_value.mean()
	return f'The average value of the questions that have the word(s) {words} is {avg_value:.2f}$.'

In [24]:
# Testing method 2
average_value = compute_average_value(df, ['King','England'])
print(average_value)

The average value of the questions that have the word(s) ['King', 'England'] is 886.84$.


## Task 2:
Write a function that returns the **count of unique answers** to all of the questions in a dataset. For example, after filtering the entire dataset to only questions containing the word *"King"*, we could then find all of the unique answers to those questions. The answer *"Henry VIII"* appeared *55* times and was the most common answer.

In [25]:
# Function to return the count of how many times the answers were used for the questions that have the given keywords
def count_unique_answer(dataset, words):
	filtered_data = find_in_question(dataset, words)
	unique_answers = filtered_data.groupby('answer').float_value.count().sort_values(ascending=False).reset_index()
	unique_answers.rename(columns={'float_value':'count'}, inplace=True)
	return unique_answers

In [26]:
# Test count_unique_answer function
unique_answers = count_unique_answer(df, ['King'])
unique_answers.head()

Unnamed: 0,answer,count
0,Henry VIII,55
1,Solomon,35
2,Richard III,33
3,Louis XIV,31
4,David,30


## Task 3:
Investigate the ways in which questions change over time by filtering by the date. E.g, How many questions from the **90s** use the word **Computer** compared to questions from the **2000s**?

In [27]:
# Filter questions that contain the word 'Computer'
computer_df = find_in_question(df, ['Computer'])
computer_df.head()

Unnamed: 0,show_number,air_date,round,category,value,float_value,question,answer
309,5690,2009-05-08,Jeopardy!,OLD FOLKS IN THEIR 30s,$600,600.0,Linus Torvalds is the father of this operating system used on cell phones & supercomputers,Linux
342,5690,2009-05-08,Double Jeopardy!,MATHEM-ATTACK!,$1200,1200.0,"(<a href=""http://www.j-archive.com/media/2009-05-08_DJ_28.jpg"" target=""_blank"">Kelly of the Clue Crew shows an array of numbers enclosed in brackets on the monitor.</a>) A set of numbers in rows and columns can be used in many ways--for example, to encrypt a code or create 3-D computer graphics; the set shares this name with a 1999 film",a matrix
1106,4085,2002-05-10,Double Jeopardy!,"""EN"" THE BEGINNING",$800,800.0,"2-word term for the consumer, for whom a computer is ultimately designed",an end user
1430,4960,2006-03-17,Jeopardy!,RECORD LOSSES IN 2005,$200,200.0,"A computer with 98,000 names & SSNs was reported stolen from this oldest campus of the Univ. of Calif.",Berkeley
2410,3214,1998-07-16,Jeopardy!,PRE-COLUMBIAN CULTURES,$500,500.0,Warriors of this Yucatan civilization battle in the computer-enhanced mural seen here:,Mayans


In [28]:
# Filter questions that has air_date in the 90s (1990-1999) from computer_df
computer_90s = computer_df[(computer_df.air_date >= datetime.datetime(1990, 1, 1)) 
			& (computer_df.air_date <= datetime.datetime(1999, 12, 31))]

In [29]:
# Filter questions that has air_date in the 2000s (2000-2009) from computer_df
computer_2000s = computer_df[(computer_df.air_date >= datetime.datetime(2000, 1, 1)) 
			& (computer_df.air_date <= datetime.datetime(2009, 12, 31))]

In [30]:
# Count how many questions are in the 90s and the 2000s
rows_computer_90s = computer_90s.shape[0] # Return the number of rows
rows_computer_2000s = computer_2000s.shape[0]
print(f'The number of questions that contain the word "Computer" in the 90s and 2000s \
is {rows_computer_90s} and {rows_computer_2000s} respectively.')

The number of questions that contain the word "Computer" in the 90s and 2000s is 98 and 268 respectively.


## Task 4:
Investigate if there is a connection between the round and the category. Are we more likely to find certain categories, like "Literature" in Single Jeopardy or Double Jeopardy?

In [31]:
# Function to count the number of questions belong to a given category per round
def category_by_round(dataset, category_name):
	category_df = dataset[dataset.category == category_name.upper()]
	questions_per_round = category_df.groupby('round').category.count().sort_values(ascending=False).reset_index()
	questions_per_round.rename(columns={'category':'count'}, inplace=True)
	return questions_per_round

In [32]:
# Test the category_by_round function
# Find out how many questions belong to Literature category per round
literature = category_by_round(df, 'literature')
literature.head()

Unnamed: 0,round,count
0,Double Jeopardy!,381
1,Jeopardy!,105
2,Final Jeopardy!,10


> We are more likely to find *Literature* category questions in *Double Jeopardy!* round than in *Jeopardy!* round. *Literature* category questions is less likely to appear in *Final Jeopardy!* round.

## Task 5: This Is Jeopardy! Quiz Game
In this task, I will write a function to build a quiz system. The quiz system will ask the player a random question from its question bank, then use the input function to get the answer from the player. Next, it will check if the players' answer is correct or not and display the number of correct and incorrect answers that the player has made so far.

In [33]:
def quiz_system(df):
	player_choice = input('Do you wanna play the Jeopardy! game? \
		(Answer YES to continue, NO to quit)').upper()
	correct = 0 # to keep track of how many correct and incorrect answers
	incorrect = 0
	
	while (player_choice == 'YES'):
		random_index = random.randint(0, len(df)) # Return a number between 0 and len(df)
		question = df.question.iloc[random_index] # Select the question at location of random_index
		answer = df.answer.iloc[random_index] # Store the corresponding answer
		
		print('\n')
		print(question)
		player_answer = input('Your answer is: \n') # Store the player's answer
		print('\n')
		sleep(0.5)

		if player_answer.lower() == answer.lower(): # If the player's answer is correct
			correct += 1
			print('Yay! It\'s correct. Very great job!!!\n')
			print(f'You have {correct} correct answer(s) and {incorrect} incorrect answers.')
		else: # If the player's answer is incorrect
			incorrect += 1
			print('Ohh noo! It\'s incorrect. But don\'t give up!!!\n')
			print(f'The correct answer is {answer}.\n')
			print(f'You have {correct} correct answer(s) and {incorrect} incorrect answers.')
		# Ask if the player want to try another question
		another_question = input('Do you wanna try another question? \
						(Answer YES to continue, NO to quit)\n').upper()
		sleep(0.5)
		if another_question == 'YES':
			player_choice = 'YES' # jump to while loop (player_choice == 'YES')
		else:
			player_choice = 'NO' # # jump to while loop (player_choice == 'NO')
		
	while (player_choice == 'NO'):
		print('Bye bye! Hope to see you again soon!')
		break	

In [34]:
quiz_system(df)



In 1903 this Boston team won baseball's first World Series


Ohh noo! It's incorrect. But don't give up!!!

The correct answer is Boston Red Sox.

You have 0 correct answer(s) and 1 incorrect answers.
Bye bye! Hope to see you again soon!
