# Jeopardy Questions Analysis

The objective here is to see how this database looks like. How is data categorized in order to attend the challenge requirements.

In [None]:
import pandas as pd

In [70]:
df = pd.read_json('../../files/JEOPARDY_QUESTIONS1.json')

In [71]:
df.shape

(216930, 7)

In [72]:
df.columns

Index(['category', 'air_date', 'question', 'value', 'answer', 'round',
       'show_number'],
      dtype='object')

In [73]:
df['category'].unique()

array(['HISTORY', "ESPN's TOP 10 ALL-TIME ATHLETES",
       'EVERYBODY TALKS ABOUT IT...', ..., 'OFF-BROADWAY',
       'RIDDLE ME THIS', 'AUTHORS IN THEIR YOUTH'],
      shape=(27995,), dtype=object)

There is too many categories. Maybe is better to include this on the prompt categorization?

In [74]:
df['air_date'].unique()

array(['2004-12-31', '2010-07-06', '2000-12-18', ..., '2006-09-29',
       '2007-03-23', '2006-05-11'], shape=(3640,), dtype=object)

In [75]:
df['show_number'].unique()

array([4680, 5957, 3751, ..., 5070, 5195, 4999], shape=(3640,))

In [87]:
df['air_date'] = pd.to_datetime(df['air_date'])
df['air_date'].corr(df['show_number'])

np.float64(0.9999810894074881)

Correlation between 'air_date' and 'show_number'...

In [77]:
df['value'].unique()

array(['$200', '$400', '$600', '$800', '$2,000', '$1000', '$1200',
       '$1600', '$2000', '$3,200', None, '$5,000', '$100', '$300', '$500',
       '$1,000', '$1,500', '$1,200', '$4,800', '$1,800', '$1,100',
       '$2,200', '$3,400', '$3,000', '$4,000', '$1,600', '$6,800',
       '$1,900', '$3,100', '$700', '$1,400', '$2,800', '$8,000', '$6,000',
       '$2,400', '$12,000', '$3,800', '$2,500', '$6,200', '$10,000',
       '$7,000', '$1,492', '$7,400', '$1,300', '$7,200', '$2,600',
       '$3,300', '$5,400', '$4,500', '$2,100', '$900', '$3,600', '$2,127',
       '$367', '$4,400', '$3,500', '$2,900', '$3,900', '$4,100', '$4,600',
       '$10,800', '$2,300', '$5,600', '$1,111', '$8,200', '$5,800',
       '$750', '$7,500', '$1,700', '$9,000', '$6,100', '$1,020', '$4,700',
       '$2,021', '$5,200', '$3,389', '$4,200', '$5', '$2,001', '$1,263',
       '$4,637', '$3,201', '$6,600', '$3,700', '$2,990', '$5,500',
       '$14,000', '$2,700', '$6,400', '$350', '$8,600', '$6,300', '$250',
      

Maybe convert to number and create a range category...

In [78]:
def clean_value(val):
    if val:
        val = val.replace(',', '').replace('$', '').strip()
        return int(val)
    return None

df['value'] = df['value'].apply(clean_value)
df['value'].unique()

array([2.000e+02, 4.000e+02, 6.000e+02, 8.000e+02, 2.000e+03, 1.000e+03,
       1.200e+03, 1.600e+03, 3.200e+03,       nan, 5.000e+03, 1.000e+02,
       3.000e+02, 5.000e+02, 1.500e+03, 4.800e+03, 1.800e+03, 1.100e+03,
       2.200e+03, 3.400e+03, 3.000e+03, 4.000e+03, 6.800e+03, 1.900e+03,
       3.100e+03, 7.000e+02, 1.400e+03, 2.800e+03, 8.000e+03, 6.000e+03,
       2.400e+03, 1.200e+04, 3.800e+03, 2.500e+03, 6.200e+03, 1.000e+04,
       7.000e+03, 1.492e+03, 7.400e+03, 1.300e+03, 7.200e+03, 2.600e+03,
       3.300e+03, 5.400e+03, 4.500e+03, 2.100e+03, 9.000e+02, 3.600e+03,
       2.127e+03, 3.670e+02, 4.400e+03, 3.500e+03, 2.900e+03, 3.900e+03,
       4.100e+03, 4.600e+03, 1.080e+04, 2.300e+03, 5.600e+03, 1.111e+03,
       8.200e+03, 5.800e+03, 7.500e+02, 7.500e+03, 1.700e+03, 9.000e+03,
       6.100e+03, 1.020e+03, 4.700e+03, 2.021e+03, 5.200e+03, 3.389e+03,
       4.200e+03, 5.000e+00, 2.001e+03, 1.263e+03, 4.637e+03, 3.201e+03,
       6.600e+03, 3.700e+03, 2.990e+03, 5.500e+03, 

In [86]:
def value_categories(val):
    if pd.isna(val):
        return 'Final'
    elif val < 1000:
        return '0-1000'
    elif val < 3000:
        return '1001-3000'
    elif val < 5000:
        return '3001-5000'
    elif val < 8000:
        return '5001-8000'
    elif val < 10000:
        return '8001-10000'
    else:
        return '10001+'

df['value_categories'] = df['value'].apply(value_categories)
df['value_categories'].unique()

array(['0-1000', '1001-3000', '3001-5000', 'Final', '5001-8000',
       '8001-10000', '10001+'], dtype=object)

In [8]:
df['round'].unique()

array(['Jeopardy!', 'Double Jeopardy!', 'Final Jeopardy!', 'Tiebreaker'],
      dtype=object)

In [None]:
df['question'].unique()

array(["'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'",
       "'No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves'",
       "'The city of Yuma in this state has a record average of 4,055 hours of sunshine each year'",
       ...,
       '\'In Penny Lane, where this "Hellraiser" grew up, the barber shaves another customer--then flays him alive!\'',
       '\'From Ft. Sill, Okla. he made the plea, Arizona is my land, my home, my father\'s land, to which I now ask to... return"\'',
       "'A silent movie title includes the last name of this 18th c. statesman & favorite of Catherine the Great'"],
      shape=(216132,), dtype=object)

Looks like there is some duplicate questions. Maybe is for the best to just remove them.

In [81]:
df_dupli = df[df.duplicated('question', keep=False)].sort_values(by='question')
df_dupli

Unnamed: 0,category,air_date,question,value,answer,round,show_number
76511,PRESIDENTIAL BIOGRAPHIES,2008-10-14,"'""A Time to Heal""'",2000.0,Gerald Ford,Double Jeopardy!,5542
144726,BOOKS BY PRESIDENTS,1998-06-29,"'""A Time to Heal""'",400.0,Gerald Ford,Double Jeopardy!,3201
30412,MOVIE LOVE THEMES,1998-05-27,"'""A Whole New World""'",400.0,Aladdin,Double Jeopardy!,3178
189941,SONGS FROM DISNEY FILMS,2003-02-20,"'""A Whole New World""'",800.0,Aladdin,Double Jeopardy!,4259
59633,WHAT'S THE PITCH?,1996-01-16,"'""A mind is a terrible thing to waste""'",500.0,United Negro College Fund (UNCF),Jeopardy!,2622
...,...,...,...,...,...,...,...
78645,SIGNS & SYMBOLS,1997-12-29,'[video clue]',100.0,Slippery When Wet,Jeopardy!,3071
78651,SIGNS & SYMBOLS,1997-12-29,'[video clue]',200.0,Lost And Found,Jeopardy!,3071
78657,SIGNS & SYMBOLS,1997-12-29,'[video clue]',300.0,(Registered) Trademark,Jeopardy!,3071
144796,AUTO LOGOS,2002-10-17,'[video clue]',1200.0,Toyota,Double Jeopardy!,4169


It looks like although the question is duplicated, they are from different shows, with different values, etc. So removing them can interfere on the filtering of the subset dataset creation.

But this '[video clue]' questions where not expected and can be a problem. I'm going to remove them. Also, I'm going to search for more of these types of questions.

In [47]:
df = df[df['question'] != "'[video clue]'"]
df.shape

(216916, 7)

Only 14 of those.

In [88]:
df.head()

Unnamed: 0,category,air_date,question,value,answer,round,show_number,value_categories
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",200.0,Copernicus,Jeopardy!,4680,0-1000
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,200.0,Jim Thorpe,Jeopardy!,4680,0-1000
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,200.0,Arizona,Jeopardy!,4680,0-1000
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",200.0,McDonald\'s,Jeopardy!,4680,0-1000
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",200.0,John Adams,Jeopardy!,4680,0-1000


In [90]:
df.to_csv('../../files/questions.csv')