## Winning Jeopardy!

In this project, we will work to write several functions that investigate a dataset of Jeopardy! questions and answers. The dataset can be found [here](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).

We will look for ways in which we might gain an advantage in order to win at Jeopardy. We will filter the dataset for topics that we’re interested in, compute the average difficulty of those questions, and train to become the next Jeopardy champion!

In [40]:
# Let's import the packages
import pandas as pd
import re
import warnings
warnings.filterwarnings("ignore")

We are provided with a csv file containing data about the game show Jeopardy! 
Let's load the data into a DataFrame and investigate its contents.



In [41]:
pd.set_option('display.max_colwidth', -1) # To disply the full contents of a column.

df = pd.read_csv(r"C:\Users\amanp\OneDrive\Desktop\jeopardy.csv")
df.head(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,$200,"In the title of an Aesop fable, this insect shared billing with a grasshopper",the ant
6,4680,2004-12-31,Jeopardy!,HISTORY,$400,"Built in 312 B.C. to link Rome & the South of Italy, it's still in use today",the Appian Way
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$400,"No. 8: 30 steals for the Birmingham Barons; 2,306 steals for the Bulls",Michael Jordan
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$400,"In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state",Washington
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$400,This housewares store was named for the packaging its merchandise came in & was first displayed on,Crate & Barrel


### Cleaning the dataset

The dataset requires a lot of cleaning. Let's do that first.


#### 1. Renaming the columns

In [42]:
# Let's examine the columns
df.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We see that the column names all have a leading space. Let's rename them to make life easier for the rest of the project.

In [43]:
df = df.rename(columns = {" Air Date": "Air Date", " Round" : "Round", " Category": "Category", " Value": "Value", " Question":"Question", " Answer": "Answer"})

In [44]:
# Let's check the column names again.
df.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

#### 2. Normalizing the Text  Column
Before we start our analysis, we need to normalize the Question and Answer columns by removing punctuation and making sure all words are lowercase so that we will be able to compare them.

In [45]:
def normalize_text(text):
    text = text.lower()
    text = re.sub('[^A-Za-z0-9\s]', '', text) # Removes all punctuation
    text = re.sub('\s+', ' ', text) # Replaces any number of spaces with a single space
    return text

In [46]:
df['clean_question'] = df['Question'].apply(normalize_text)
df['clean_answer'] = df['Answer'].apply(normalize_text)

#### 3. Normalizing the Numeric Column

The values in the 'Value' column are strings. Let's create a new column with the float values.

In [47]:
 # Adding a new column. If the value of the float column is not "None", then we cut off the first character (which is a dollar sign), and replace all commas with nothing, and then cast that value to a float. If the answer was "None", then we just enter a 0.
df["Float Value"] = df["Value"].apply(lambda x: float(x[1:].replace(',','')) if x != "None" else 0)

#### 4. Normalizing the Air Date column

so that the values are datetime objects and not strings.

In [48]:
df['Air Date'] = pd.to_datetime(df['Air Date'])

In [49]:
# Let's see our cleaned dataframe now.

df.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,Float Value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory",Copernicus,for the last 8 years of his life galileo was under house arrest for espousing this mans theory,copernicus,200.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe,no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves,jim thorpe,200.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,"The city of Yuma in this state has a record average of 4,055 hours of sunshine each year",Arizona,the city of yuma in this state has a record average of 4055 hours of sunshine each year,arizona,200.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", this company served its billionth burger",McDonald's,in 1963 live on the art linkletter show this company served its billionth burger,mcdonalds,200.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States",John Adams,signer of the dec of indep framer of the constitution of mass second president of the united states,john adams,200.0


## Now our data is clean and ready for analysis

### Let's write a function that filters the dataset for questions that contains all of the words in a list of words.

For example, when the list ["King", "England"] was passed to our function, the function returned a DataFrame of 152 rows. Every row had the strings "King" and "England" somewhere in its " Question".

#### 1. Defining the function:

In [50]:
# Filtering a dataset by a list of words
def filter_data(data, words):
  # Lowercases all words in the list of words as well as the questions. Returns true is all of the words in the list appear in the question.
  filter = lambda x: all(word.lower() in x.lower() for word in words)
  # Applies the labmda function to the Question column and returns the rows where the function returned True
  return data.loc[data["Question"].apply(filter)]

#### 2. Testing the function:

In [51]:
# Testing the filter function

filtered_1 = filter_data(df, ["King", "England"])

filtered_1

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,Float Value
4953,3003,1997-09-24,Double Jeopardy!,"""PH""UN WORDS",$200,"Both England's King George V & FDR put their stamp of approval on this ""King of Hobbies""",Philately (stamp collecting),both englands king george v fdr put their stamp of approval on this king of hobbies,philately stamp collecting,200.0


In [52]:
# Let's print just the question in filtered
filtered_1["Question"]

4953    Both England's King George V & FDR put their stamp of approval on this "King of Hobbies"
Name: Question, dtype: object

Great Job! Our funtion is working properly. 

Let's check it out with some other list of words.

In [53]:
filtered_2 = filter_data(df, ["Indian", "School"])
filtered_2

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,Float Value
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,"No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves",Jim Thorpe,no 2 1912 olympian football star at carlisle indian school 6 mlb seasons with the reds giants braves,jim thorpe,200.0
944,3834,2001-04-12,Double Jeopardy!,"THE ""BUTLER"" DID IT","$1,000","The mission statement of this school says it's located ""In...Indianapolis, one of America's most livable cities""",Butler University,the mission statement of this school says its located inindianapolis one of americas most livable cities,butler university,1000.0
1451,4960,2006-03-17,Jeopardy!,COLLEGES & UNIVERSITIES,"$1,000","This West Lafayette, Indiana school's Hall of Music has seating for more than 6,000",Purdue,this west lafayette indiana schools hall of music has seating for more than 6000,purdue,1000.0


In [54]:
filtered_3 = filter_data(df, ["King"])
filtered_3

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,Float Value
34,4680,2004-12-31,Double Jeopardy!,"""X""s & ""O""s",$400,Around 100 A.D. Tacitus wrote a book on how this art of persuasive speaking had declined since Cicero,oratory,around 100 ad tacitus wrote a book on how this art of persuasive speaking had declined since cicero,oratory,400.0
40,4680,2004-12-31,Double Jeopardy!,DR. SEUSS AT THE MULTIPLEX,"$1,200","<a href=""http://www.j-archive.com/media/2004-12-31_DJ_26.mp3"">Ripped from today's headlines, he was a turtle king gone mad; Mack was the one good turtle who'd bring him down</a>",Yertle,a hrefhttpwwwjarchivecommedia20041231dj26mp3ripped from todays headlines he was a turtle king gone mad mack was the one good turtle whod bring him downa,yertle,1200.0
50,4680,2004-12-31,Double Jeopardy!,DR. SEUSS AT THE MULTIPLEX,"$2,000","<a href=""http://www.j-archive.com/media/2004-12-31_DJ_24.mp3"">""500 Hats""... 500 ways to die. On July 4th, this young boy will defy a king... & become a legend</a>",Bartholomew Cubbins,a hrefhttpwwwjarchivecommedia20041231dj24mp3500 hats 500 ways to die on july 4th this young boy will defy a king become a legenda,bartholomew cubbins,2000.0
56,5957,2010-07-06,Jeopardy!,"GEOGRAPHY ""E""",$200,It's the largest kingdom in the United Kingdom,England,its the largest kingdom in the united kingdom,england,200.0
72,5957,2010-07-06,Jeopardy!,LET'S BOUNCE,$600,"In this kid's game, you bounce a small rubber ball while picking up 6-pronged metal objects",jacks,in this kids game you bounce a small rubber ball while picking up 6pronged metal objects,jacks,600.0
...,...,...,...,...,...,...,...,...,...,...
5503,2349,1994-11-17,Double Jeopardy!,ANCIENT HISTORY,"$1,000",This Old Kingdom capital of Egypt was originally named Hikouptah,Memphis,this old kingdom capital of egypt was originally named hikouptah,memphis,1000.0
5590,3537,2000-01-11,Jeopardy!,ANNIVERSARY GIFTS,$400,"19th century American ""King of the South"" that's a 2nd anniversary gift",Cotton,19th century american king of the south thats a 2nd anniversary gift,cotton,400.0
5643,3911,2001-09-10,Jeopardy!,LARRY KING'S PUBLIC FIGURES,$300,"At the bottom of the hour, bet you won't miss my chat with this all time ""hit king"" of baseball...Cincinnati, hello?",Pete Rose,at the bottom of the hour bet you wont miss my chat with this all time hit king of baseballcincinnati hello,pete rose,300.0
5647,3911,2001-09-10,Jeopardy!,EXPORTS,$300,This crop is king in Mali; about 1/2 of its export income comes from it,cotton,this crop is king in mali about 12 of its export income comes from it,cotton,300.0


#### 3. Computing aggregate statistics

Now we want to eventually compute aggregate statistics, like .mean() on the " Value" column.

In [55]:
# Filtering the dataset and finding the average value of those questions
filtered = filter_data(df, ["King"])
print(filtered["Float Value"].mean())


670.8108108108108


So, the average value of questions that contain the word "King" is around 670.

### Let's write a function that returns the count of the unique answers to all of the questions in a dataset.

For example, after filtering the entire dataset to only questions containing the word "King", we could then find all of the unique answers to those questions

#### 1. Defining the function:

In [56]:
# A function to find the unique answers of a set of data
def get_answer_counts(data):
    return data["Answer"].value_counts()

#### 2. Testing the function:

In [57]:
# Testing the answer count function
get_answer_counts(filtered)

Henry VIII                      3
Louisiana                       2
Richard III                     2
cotton                          1
Merlin                          1
                               ..
PayPal                          1
a picture                       1
Romania                         1
Philately (stamp collecting)    1
Bald eagle                      1
Name: Answer, Length: 181, dtype: int64

After filtering the entire dataset to only questions containing the word "King", we could that the answer “Henry VIII” appeared 3 times and was the most common answer.

We can do this for other words too.

### Recycled Questions

Without access to the entire Jeopardy question dataset, we can't know exactly if a question is a repeat of an older one, but we can still investigate how often complex words reoccur.

#### 1. Defining the function:

In [58]:

question_overlap = []
terms_used = set()

df = df.sort_values('Air Date')

for i, row in df.iterrows():
    split_question = row['Question'].split(' ')
    split_question = [q for q in split_question if len(q) > 5]
    
    match_count = 0
    
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)

#### 2. Testing the function:

In [59]:
df['question_overlap'] = question_overlap
df['question_overlap'].mean()

0.4244994573320742

It looks like approximately 40% of the terms in the new questions and old questions overlap. This only looks at single terms, but it tells us this is something worth looking into more.

### Conclusion & Next Steps

In this project we performed some hypothesis testing to see if there were any good potential strategies to win Jeopardy.

Some next steps we could take to further our analysis could be to do the following:

- Investigate the ways in which questions change over time by filtering by the date. How many questions from the 90s use the word "Computer" compared to questions from the 2000s?

- Is there a connection between the round and the category? Are you more likely to find certain categories, like "Literature" in Single Jeopardy or Double Jeopardy?

- Find a better way to remove non-informative words by manually removing words like 'the', 'than', etc, or find a list of stop-words, or remove words that occur in more than a certain percentage of questions.

- Dig deeper into the Category column to see which categories appear more often and find the probability of each category appearing in each round.

- Use the entire Jeopardy dataset instead of just the subset we used.

- Use phrases instead of just single words to see if there is any overlap between questions.