<a href="https://colab.research.google.com/github/gapself/machine-learning-projects/blob/main/jeopardy_game_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 200,000+ Jeopardy! Questions. Data about the game show.

# Kaggle API instalation:

In [2]:
!pip install kaggle



In [4]:
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/

In [5]:
!chmod 600 ~/.kaggle/kaggle.json

In [6]:
!kaggle datasets download -d tunguz/200000-jeopardy-questions

Downloading 200000-jeopardy-questions.zip to /content
  0% 0.00/11.5M [00:00<?, ?B/s] 43% 5.00M/11.5M [00:00<00:00, 30.1MB/s]
100% 11.5M/11.5M [00:00<00:00, 60.5MB/s]


In [7]:
!unzip 200000-jeopardy-questions.zip

Archive:  200000-jeopardy-questions.zip
  inflating: JEOPARDY_CSV.csv        


In [10]:
import pandas as pd

df = pd.read_csv('JEOPARDY_CSV.csv')
df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [11]:
df.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

# Remove leading spaces from column names

## I. First way

In [12]:
df = df.rename(columns = {
    ' Air Date':'Air Date'
})

In [13]:
df['Air Date']

0         2004-12-31
1         2004-12-31
2         2004-12-31
3         2004-12-31
4         2004-12-31
             ...    
216925    2006-05-11
216926    2006-05-11
216927    2006-05-11
216928    2006-05-11
216929    2006-05-11
Name: Air Date, Length: 216930, dtype: object

## II. Faster way
The **.strip()** method removes leading and trailing spaces from a string.

In [14]:
df.columns = [col.strip() for col in df.columns]

In [15]:
df.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

# Notes:
Methods to delete all of the spaces:



In [31]:
word_with_spaces = 'Air Date'
print(word_with_spaces.replace(' ',''))
print(''.join(word_with_spaces.split()))

AirDate
AirDate


In [78]:
test = "i Wanna uSe split. i love england's landscape"
print(test.lower().split())

['i', 'wanna', 'use', 'split.', 'i', 'love', "england's", 'landscape']


# 1. Write a filtering function (by list of words):
 that filters the dataset for questions that contains all of the words in a list of words. For example, when the list ["King", "England"] was passed to our function, the function returned a DataFrame of 152 rows. Every row had the strings "King" and "England" somewhere in its " Question".

 - The all() function is applied to the list of elements, and it returns True if it finds match

In [37]:
def filter_function(data, words):
  filter = lambda x: all(word in x for word in words)
  filter_apply = data['Question'].apply(filter)
  return data.loc[filter_apply]

In [42]:
filtered = filter_function(df, ['King','England'])
print(filtered['Question'])

4953      Both England's King George V & FDR put their s...
14912     This country's King Louis IV was nicknamed "Lo...
21511     this man and his son ruled England following t...
23810     William the Conqueror was crowned King of Engl...
27555     This member of the Medici family was the mothe...
33294     (Sarah of the Clue Crew delivers the clue from...
41148     This French king recognized William of Orange ...
41357     England's King Henry VIII had 3 wives named Ca...
43122                The father of England's King Edward VI
47814     This steak sauce was created for King George I...
49994     Elizabeth I's half-brother, he reigned before ...
51115     (<a href="http://www.j-archive.com/media/2000-...
51565     He wrote several anthems, including "The King ...
56600     This city known for its 24-hour auto race was ...
57516     Famous (& rather insulting) adjective for Engl...
58949     He became King of England in 1399 after forcin...
71808     Number of the William who was 

# Improve function I

Wanna find questions that contain 'king' or 'King'.

Solution: lowercases all words in the list of words as well as the questions. Returns true is all of the words in the list appear in the question.


In [43]:
def filter_function(data, words):
  filter = lambda x: all(word.lower() in x.lower() for word in words)
  filter_apply = data['Question'].apply(filter)
  return data.loc[filter_apply]

In [45]:
filtered = filter_function(df, ['King','England'])
print(filtered['Question'])

4953      Both England's King George V & FDR put their s...
6337      In retaliation for Viking raids, this "Unready...
9191      This king of England beat the odds to trounce ...
11710     This Scotsman, the first Stuart king of Englan...
13454     It's the number that followed the last king of...
                                ...                        
208295    In 1066 this great-great grandson of Rollo mad...
208742    Dutch-born king who ruled England jointly with...
213870    In 1781 William Herschel discovered Uranus & i...
216021    His nickname was "Bertie", but he used this na...
216789    This kingdom of England grew from 2 settlement...
Name: Question, Length: 152, dtype: object


# Improve function II
But we dont wanna substrings of a word 'king' like 'kingdom' or 'england's'. We split sentences to find True values with lambda.

In [75]:
def filter_function(data, words_to_find):
  filter = lambda sentence: all(word.lower() in sentence.lower().split() for word in words_to_find)
  filter_apply = data['Question'].apply(filter)
  return data.loc[filter_apply]

In [76]:
filtered = filter_function(df, ['King','England'])
print(filtered['Question'])

6337      In retaliation for Viking raids, this "Unready...
9191      This king of England beat the odds to trounce ...
13454     It's the number that followed the last king of...
14912     This country's King Louis IV was nicknamed "Lo...
18076     In 1199 this crusader king of England was mort...
                                ...                        
200369    8th C. King Offa built a 170-mile north-south ...
201168    Popular Saint-Exupery character waiting around...
208742    Dutch-born king who ruled England jointly with...
213870    In 1781 William Herschel discovered Uranus & i...
216021    His nickname was "Bertie", but he used this na...
Name: Question, Length: 74, dtype: object


# Calculate statistics like mean() on Value column

First of all check the type of values in column Values.

 * Type 'object' - typically represents strings.

We must change strings to floats values.

In [87]:
#first value in Values
first_name_type = type(df.at[0, 'Value'])
print(first_name_type)

<class 'str'>


In [90]:
column_data_types = df.dtypes
print(column_data_types)

Show Number     int64
Air Date       object
Round          object
Category       object
Value          object
Question       object
Answer         object
dtype: object


In [91]:
print(df['Value'][0])

$200


In [97]:
df['Float Value'] = df['Value'].apply(lambda x: float(x[1:].replace(',','')) if x != 'None' else 0)

In [98]:
print(df['Float Value'])

0          200.0
1          200.0
2          200.0
3          200.0
4          200.0
           ...  
216925    2000.0
216926    2000.0
216927    2000.0
216928    2000.0
216929       0.0
Name: Float Value, Length: 216930, dtype: float64


In [99]:
print(df['Float Value'].mean())

739.9884755451067


In [100]:
filtered = filter_function(df, ['King'])
print(filtered['Float Value'].mean())

805.4698795180723


# A function to find the unique answers of a set of data
f.ex. unique answers for filtered data by the word 'king'


 * The value_counts() method in pandas is used to count the number of occurrences of unique values in a Series (a column in a DataFrame). It returns a Series containing the unique values in the original Series and their respective counts in descending order.



In [104]:
def get_answer_counts(data):
  return data['Answer'].value_counts()

In [105]:
print(get_answer_counts(filtered))

Henry VIII                           41
Sweden                               24
Solomon                              23
Norway                               22
Richard III                          21
                                     ..
Tory                                  1
Naomi Watts Riots                     1
Bad, Bad Leroy Brown                  1
Elephants                             1
a pyramid (the pyramids accepted)     1
Name: Answer, Length: 1165, dtype: int64
