# Location of dataset

https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file


### Description:
  The json file is an unordered list of questions where each question has:
  
- 'category' : the question category, e.g. "HISTORY"
- 'value' : dollar value of the question as string, e.g. "\$200". $Note$: This is "None" for Final Jeopardy! and Tiebreaker questions
- 'question' : text of question. $Note$: This sometimes contains hyperlinks and other things messy text such as when there's a picture or video question
- 'answer' : text of answer
- 'round' : one of "Jeopardy!","Double Jeopardy!","Final Jeopardy!" or "Tiebreaker" Note: Tiebreaker questions do happen but they're very rare (like once every 20 years)
- 'show_number' : string of show number, e.g '4680'
- 'air_date' : the show air date in format YYYY-MM-DD


In [69]:
import pandas as pd
import numpy as  np
import string , re

pd.options.display.max_rows = 4
pd.options.display.max_columns = 15
pd.options.display.width = 200

In [70]:
#jeopardy = pd.read_json('JEOPARDY_QUESTIONS1.json')
jeopardy = pd.read_csv('JEOPARDY_CSV.csv')

In [71]:
jeopardy.sample(10)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
52285,3478,1999-10-20,Double Jeopardy!,FORGOTTEN MUSICALS,$400,"Critics didn't ""fawn"" over the musical based o...","""The Yearling"""
206092,5385,2008-01-25,Jeopardy!,MY CHEMICAL ROMANCE,$200,The simplest substance that can also be called...,H2O
...,...,...,...,...,...,...,...
50262,1285,1990-03-16,Double Jeopardy!,LANGUAGES,$1000,Standard Spanish originated in this ancient ki...,Castille
60222,1260,1990-02-09,Jeopardy!,HOLIDAYS & OBSERVANCES,$400,This Jewish holiday marks the end of the 10 da...,Yom Kippur


In [72]:
jeopardy.columns = jeopardy.columns.str.strip().str.replace(' ','_').str.lower()

In [73]:
#new_names = [i.lower().strip().replace(' ','_') for i in jeopardy.columns.tolist()]

In [74]:
#jeopardy.columns = new_names

In [75]:
jeopardy.sample(2)

Unnamed: 0,show_number,air_date,round,category,value,question,answer
127010,4676,2004-12-27,Double Jeopardy!,MAX-IMUM OVERDRIVE,$2000,"The ""colorful"" title of this George Peppard-Ur...",The Blue Max
54980,961,1988-11-07,Double Jeopardy!,THE THREE R'S,$600,"To celebrate its 60th anniversary, this newspa...","""Weekly Reader"""


In [76]:
jeopardy

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
...,...,...,...,...,...,...,...
216928,4999,2006-05-11,Double Jeopardy!,QUOTATIONS,$2000,"From Ft. Sill, Okla. he made the plea, Arizona...",Geronimo
216929,4999,2006-05-11,Final Jeopardy!,HISTORIC NAMES,,A silent movie title includes the last name of...,Grigori Alexandrovich Potemkin


In [87]:
tbl = string.maketrans("","")
jeopardy.question.str.translate(tbl,string.punctuation).str.lower()

0         for the last 8 years of his life galileo was u...
1         no 2 1912 olympian football star at carlisle i...
                                ...                        
216928    from ft sill okla he made the plea arizona is ...
216929    a silent movie title includes the last name of...
Name: question, dtype: object

### Performing basic cleanup operations:

- Stripping all punctuation and capitalizations in the 'Question' and 'Answer' columns.
- Converting the 'Value' column to integer type after stripping dollar sign.
- Converting the 'Air_date' column values to datetime objects.

Refer to these links:

http://www.tutorialspoint.com/python3/string_maketrans.htm

https://stackoverflow.com/questions/11692199/string-translate-with-unicode-data-in-python

https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python


Alternate Method:
```python
print(string.punctuation)

def clean(obj):
    punctuation = string.punctuation
    remove_uni_ordinal_mapping = dict((ord(char),None) for char in punctuation)
    obj = obj.translate(remove_uni_ordinal_mapping)
    
    return obj.lower()
```

In [77]:
def clean(obj):
    tbl = string.maketrans("","")
    obj = obj.translate(tbl,string.punctuation).lower()
    return obj

def str_2_num(obj):
    try:
        return np.int(obj)
    except:
        return 0

In [30]:
jeopardy['clean_question'] = jeopardy.question.apply(clean)
jeopardy['clean_answer'] = jeopardy.answer.apply(clean)
jeopardy['clean_value'] = jeopardy.value.apply(clean)
jeopardy.clean_value = jeopardy.clean_value.apply(str_2_num)
jeopardy.air_date = pd.to_datetime(jeopardy.air_date, format='%Y-%m-%d ')

In [31]:
jeopardy.sample(2)

Unnamed: 0,show_number,air_date,round,...,clean_question,clean_answer,clean_value
1927,6294,2012-01-19,Jeopardy!,...,the chorus of romeo juliet tells us its in th...,verona,600
39123,5032,2006-06-27,Double Jeopardy!,...,there are 2 kinds of this creature soft hard ...,ticks,800


In [19]:
(jeopardy[['question','answer']]
                                 #.question
                                 #.apply(clean)
                                 .answer
                                 .apply(clean)
)

0                                                copernicus
1                                                jim thorpe
2                                                   arizona
3                                                 mcdonalds
4                                                john adams
5                                                   the ant
6                                            the appian way
7                                            michael jordan
8                                                washington
9                                             crate  barrel
10                                           jackie gleason
11                                                  the cud
12                                      ceylon or sri lanka
13                                                jim brown
14                                             the uv index
15                                                   bulova
16                                      

In [None]:
def splitS(Series):
    try:
        return Series.str.split(" ")
    except:
        return 0

def removeS(Series,rm_string):
    try:
        for k,v in Series.iteritems():
            v.remove(rm_string)
        return 0
    except:
        return 1

In [None]:
jeopardy['split_answer'] = splitS(jeopardy.clean_answer)
jeopardy['split_question'] = splitS(jeopardy.clean_question)

In [None]:
#Picking out a 'the' in jeopardy.split_question column of dataframe:
jeopardy.split_question.loc[:3]

In [None]:
test = removeS(jeopardy.split_question,'the')
answer = removeS(jeopardy.split_answer,'the')

In [None]:
def matching_function(df):
    match_count = 0
    split_question = splitS(df['clean_question'])
    split_answer = splitS(df['clean_answer'])
    remove_the = removeS(split_question,'the')
    remove_the = removeS(split_answer,'the')
    return split_question,split_answer
    

In [None]:
X, Y = matching_function(jeopardy[['clean_question','clean_answer']])

In [None]:
X.loc[0] , Y.loc[0]

In [None]:
test, answer