In [1]:
import pandas as pd

jeopardy = pd.read_json('data/jeopardy_questions1.json')
jeopardy.head(10)

Unnamed: 0,air_date,answer,category,question,round,show_number,value
0,2004-12-31,Copernicus,HISTORY,"'For the last 8 years of his life, Galileo was...",Jeopardy!,4680,$200
1,2004-12-31,Jim Thorpe,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 2: 1912 Olympian; football star at Carlis...,Jeopardy!,4680,$200
2,2004-12-31,Arizona,EVERYBODY TALKS ABOUT IT...,'The city of Yuma in this state has a record a...,Jeopardy!,4680,$200
3,2004-12-31,McDonald\'s,THE COMPANY LINE,"'In 1963, live on ""The Art Linkletter Show"", t...",Jeopardy!,4680,$200
4,2004-12-31,John Adams,EPITAPHS & TRIBUTES,"'Signer of the Dec. of Indep., framer of the C...",Jeopardy!,4680,$200
5,2004-12-31,the ant,3-LETTER WORDS,"'In the title of an Aesop fable, this insect s...",Jeopardy!,4680,$200
6,2004-12-31,the Appian Way,HISTORY,'Built in 312 B.C. to link Rome & the South of...,Jeopardy!,4680,$400
7,2004-12-31,Michael Jordan,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 8: 30 steals for the Birmingham Barons; 2...,Jeopardy!,4680,$400
8,2004-12-31,Washington,EVERYBODY TALKS ABOUT IT...,"'In the winter of 1971-72, a record 1,122 inch...",Jeopardy!,4680,$400
9,2004-12-31,Crate & Barrel,THE COMPANY LINE,'This housewares store was named for the packa...,Jeopardy!,4680,$400


In [None]:
jeopardy.columns

## Cleaning the Data
After reviewing the data, there are some clean up tasks to perform to make data analysis easier.

First step is to normalise the `question` and `answer` columns. We'll do this by creating a function to lowercase the strings and remove all punctuation from them.

Next, the `value` column also needs to be cleaned. This column should be numeric. Because the source data comes with a dollar sign, it does not come in a numeric format. The dollar sign must be removed and then the column has to be converted to a numeric.

`air_date` is a string but should be a date. Should be a fairly straightforward conversion.

In [26]:
import re

def normalise_string(s):
    pattern = '[\.\'\:\"\,\\\/\!\?]'
    s = re.sub(pattern, '', s)
    return s.lower()

def remove_dollar(cash):
    cash.remove('$', '')

jeopardy["clean_question"] = jeopardy["question"].apply(normalise_string)
jeopardy["clean_answer"] = jeopardy["answer"].apply(normalise_string)
jeopardy["clean_value"] = jeopardy["value"].replace([None],'0')
jeopardy["clean_value"] = jeopardy["clean_value"].str.replace('[$,]', '').astype('int32')
jeopardy["air_date"] = pd.to_datetime(jeopardy["air_date"])

In [28]:
jeopardy.head(10)

Unnamed: 0,air_date,answer,category,question,round,show_number,value,clean_question,clean_answer,clean_value
0,2004-12-31,Copernicus,HISTORY,"'For the last 8 years of his life, Galileo was...",Jeopardy!,4680,$200,for the last 8 years of his life galileo was u...,copernicus,200
1,2004-12-31,Jim Thorpe,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 2: 1912 Olympian; football star at Carlis...,Jeopardy!,4680,$200,no 2 1912 olympian; football star at carlisle ...,jim thorpe,200
2,2004-12-31,Arizona,EVERYBODY TALKS ABOUT IT...,'The city of Yuma in this state has a record a...,Jeopardy!,4680,$200,the city of yuma in this state has a record av...,arizona,200
3,2004-12-31,McDonald\'s,THE COMPANY LINE,"'In 1963, live on ""The Art Linkletter Show"", t...",Jeopardy!,4680,$200,in 1963 live on the art linkletter show this c...,mcdonalds,200
4,2004-12-31,John Adams,EPITAPHS & TRIBUTES,"'Signer of the Dec. of Indep., framer of the C...",Jeopardy!,4680,$200,signer of the dec of indep framer of the const...,john adams,200
5,2004-12-31,the ant,3-LETTER WORDS,"'In the title of an Aesop fable, this insect s...",Jeopardy!,4680,$200,in the title of an aesop fable this insect sha...,the ant,200
6,2004-12-31,the Appian Way,HISTORY,'Built in 312 B.C. to link Rome & the South of...,Jeopardy!,4680,$400,built in 312 bc to link rome & the south of it...,the appian way,400
7,2004-12-31,Michael Jordan,ESPN's TOP 10 ALL-TIME ATHLETES,'No. 8: 30 steals for the Birmingham Barons; 2...,Jeopardy!,4680,$400,no 8 30 steals for the birmingham barons; 2306...,michael jordan,400
8,2004-12-31,Washington,EVERYBODY TALKS ABOUT IT...,"'In the winter of 1971-72, a record 1,122 inch...",Jeopardy!,4680,$400,in the winter of 1971-72 a record 1122 inches ...,washington,400
9,2004-12-31,Crate & Barrel,THE COMPANY LINE,'This housewares store was named for the packa...,Jeopardy!,4680,$400,this housewares store was named for the packag...,crate & barrel,400
