# Winning Jeopardy


Jeopardy is a popular TV show in the US where participants answer questions to win money. In this project, we will work with the dataset of Jeopardy questions to figure out some patterns in the questions that could help win the game.

![image](https://images2.minutemediacdn.com/image/upload/c_crop,w_4184,h_2353,x_0,y_297/c_fill,w_1440,ar_16:9,f_auto,q_auto,g_auto/images/GettyImages/mmsport/mentalfloss/01g2aftm5529evy4bpwa.jpg)


The dataset can be found at this [link](https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/).
Each row in the dataset corresponds to one questions asked on a single episode. Description of a few columns:

* Show Number - the Jeopardy episode number of the show this question was in.
* Air Date - the date the episode aired.
* Round - the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category - the category of the question.
* Value - the number of dollars answering the question correctly is worth.
* Question - the text of the question.
* Answer - the text of the answer.

In [98]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import bigrams
from scipy.stats import chisquare,chi2_contingency

In [8]:
import pandas as pd

In [9]:
jeopardy= pd.read_csv("jeopardy.csv")

In [10]:
jeopardy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [12]:
jeopardy.columns

Index(['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')

In [14]:
cols = jeopardy.columns
jeopardy.columns = cols.str.strip().str.lower().str.replace(" ","_")
jeopardy.head(3)

Unnamed: 0,show_number,air_date,round,category,value,question,answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


In [16]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 7 columns):
show_number    19999 non-null int64
air_date       19999 non-null object
round          19999 non-null object
category       19999 non-null object
value          19999 non-null object
question       19999 non-null object
answer         19999 non-null object
dtypes: int64(1), object(6)
memory usage: 1.1+ MB


## Normalize 

Let us normalize the questions and ansewrs columns to remove punctuations and convert all words to lower case. Some questions also contains html tags, so we will remove them as well. This way we can easily use the words for comparision later on.

In [20]:
jeopardy["question"].head()

0    For the last 8 years of his life, Galileo was ...
1    No. 2: 1912 Olympian; football star at Carlisl...
2    The city of Yuma in this state has a record av...
3    In 1963, live on "The Art Linkletter Show", th...
4    Signer of the Dec. of Indep., framer of the Co...
Name: question, dtype: object

In [21]:
import re
def normalize_text(text):
    text = text.lower()           #Convert str to lowercase
    text = re.sub("[^a-zA-Z0-9/s]"," ",text)
    text = re.sub("/s+"," ",text)   #dấu câu
    return text

jeopardy["clean_question"] = jeopardy["question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["answer"].apply(normalize_text)

In [22]:
jeopardy["clean_question"].head(5)

0    for the last 8 years of his life  galileo was ...
1    no  2  1912 olympian  football star at carlisl...
2    the city of yuma in this state has a record av...
3    in 1963  live on  the art linkletter show   th...
4    signer of the dec  of indep   framer of the co...
Name: clean_question, dtype: object

## Normalize value

The value column must be numeric and the air_date a 'datetime' object rather than a string. So let us normalize these as well.

In [27]:
import re
def normalize_value(value):
    value = re.sub("[^a-zA-Z0-9/s]", "",value)
    if value != 'None':
        value = value
    else:
        value = 0
    value = int(value)
    return value
    

In [28]:
jeopardy["value"] = jeopardy["value"].apply(normalize_value)
jeopardy["value"].head()

0    200
1    200
2    200
3    200
4    200
Name: value, dtype: int64

## Normalize Date

In [24]:
jeopardy["air_date"] = pd.to_datetime(jeopardy["air_date"])

In [25]:
jeopardy["air_date"].head()

0   2004-12-31
1   2004-12-31
2   2004-12-31
3   2004-12-31
4   2004-12-31
Name: air_date, dtype: datetime64[ns]

## Answers in Questions

It would be helpful to figure two things when trying to analyze the game inorder to win it.

* How often the answer is deducible from the question.
* How often new questions are repeats of older questions.

For the first question, we will see how many times on average do the answers appear or are mentioned of in the questions. For every answer we will check the corresponding questions to see if the answer or any part of the answer was in it. We will remove the Stopwords from the questions and answers as Stopwords are very common and can be misleading in our case.

The basic idea is to find on average how many times do the questions contain the answers, so we will, for each answer check the corresponding questions and find the proportion of answer present in the question, we will then take its mean to get a general idea.

In [29]:

def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for answer in split_answer:
        if answer in split_question:
            match_count += 1
    return match_count/len(split_answer)
     

In [35]:
jeopardy.head(20)

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,0.0
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sh...,the ant,0.0
6,4680,2004-12-31,Jeopardy!,HISTORY,400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 b c to link rome the south of ...,the appian way,0.0
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2 ...,michael jordan,0.0
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 1971 72 a record 1 122 inche...,washington,0.0
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel,0.0


In [31]:
jeopardy["answer_in_question"] =  jeopardy.apply(count_matches,axis = 1)

In [32]:
jeopardy.head(20)

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,answer_in_question
0,4680,2004-12-31,Jeopardy!,HISTORY,200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was ...,copernicus,0.0
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisl...,jim thorpe,0.0
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,0.0
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's,in 1963 live on the art linkletter show th...,mcdonald s,0.0
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,200,"Signer of the Dec. of Indep., framer of the Co...",John Adams,signer of the dec of indep framer of the co...,john adams,0.0
5,4680,2004-12-31,Jeopardy!,3-LETTER WORDS,200,"In the title of an Aesop fable, this insect sh...",the ant,in the title of an aesop fable this insect sh...,the ant,0.0
6,4680,2004-12-31,Jeopardy!,HISTORY,400,Built in 312 B.C. to link Rome & the South of ...,the Appian Way,built in 312 b c to link rome the south of ...,the appian way,0.0
7,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,400,"No. 8: 30 steals for the Birmingham Barons; 2,...",Michael Jordan,no 8 30 steals for the birmingham barons 2 ...,michael jordan,0.0
8,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,400,"In the winter of 1971-72, a record 1,122 inche...",Washington,in the winter of 1971 72 a record 1 122 inche...,washington,0.0
9,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,400,This housewares store was named for the packag...,Crate & Barrel,this housewares store was named for the packag...,crate barrel,0.0


In [36]:
jeopardy.answer_in_question.mean()

0.06291895444478074

### Only 6% the answer can be used for a question

We found the mean to be - 6%
This is actually a very small proportion (only 6.29%) of questions that contain some part of the answer in them. This tells us that just by this idea, we cannot win Jeopardy.



## Investigate about repeat question
Lets now try to see how often new questions are repeat of older ones. Now the dataset(sample) we are working with is just a representative of the population, hence we can only investigate this phenomenon and try to generalize it.

In [37]:
jeopardy = jeopardy.sort_values('air_date',ascending = True)
question_overlap = []
terms_used = set()
for i,row in jeopardy.iterrows():
    split_question = row["clean_question"].split()
    for characters in split_question:
        if len(characters) <=6:
            split_question.remove(characters)
    match_count = 0
    for word in split_question:
        if word in terms_used:
            match_count += 1
    for word in split_question:
        terms_used.add(word)
    if len(split_question) > 0:
        match_count /= len(split_question)
    question_overlap.append(match_count)
            
jeopardy["question_overlap"] = question_overlap
    

In [39]:
jeopardy["question_overlap"].head(10)

19325    0.000000
19301    0.000000
19302    0.000000
19303    0.200000
19304    0.142857
19305    0.000000
19306    0.000000
19307    0.200000
19308    0.166667
19309    0.000000
Name: question_overlap, dtype: float64

In [50]:
terms_used.d # xóa .d để hiện kết quả

AttributeError: 'set' object has no attribute 'd'

In [51]:
jeopardy["question_overlap"].mean()

0.8258987099594451

The percentage is around - 82.58%. This is a considerable amount but we are only considering unigrams. This high percentage can be because certain words repeat multiple times but not neccessarily in the same context.


## Low Value vs High Value Question

The game is all about answering questions and earning money for every correct answer. So let us try to seggregate our analysis into high value questions and low value questions.

Let us consider a threshold for high and low separation.

## Determine Value

In [55]:
def determine_value(row):
    value = 0
    if row["value"] > 800:
        value = 1
    return value

## Determine which questions are high and low value.

In [56]:
jeopardy["high_value"] =  jeopardy.apply(determine_value,axis = 1)

In [58]:
jeopardy["high_value"].head()

19325    0
19301    0
19302    0
19303    0
19304    0
Name: high_value, dtype: int64

In [59]:
## Muốn lặp DF thì dùng iterrows để lặp các dòng
## Tạo ra 1 function chỉ ra "từ cần tìm" trong câu hỏi xuất hiện ở high_value bao nhiêu lần, ở lơ_value bap nhiêu lần
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count


In [60]:
count_usage("term")

(89, 189)

In [61]:
jeopardy.head() 

Unnamed: 0,show_number,air_date,round,category,value,question,answer,clean_question,clean_answer,answer_in_question,question_overlap,high_value
19325,10,1984-09-21,Final Jeopardy!,U.S. PRESIDENTS,0,"Adventurous 26th president, he was 1st to ride...",Theodore Roosevelt,adventurous 26th president he was 1st to ride...,theodore roosevelt,0.0,0.0,0
19301,10,1984-09-21,Double Jeopardy!,LABOR UNIONS,200,Notorious labor leader missing since '75,Jimmy Hoffa,notorious labor leader missing since 75,jimmy hoffa,0.0,0.0,0
19302,10,1984-09-21,Double Jeopardy!,1789,200,"Washington proclaimed Nov. 26, 1789 this first...",Thanksgiving,washington proclaimed nov 26 1789 this first...,thanksgiving,0.0,0.0,0
19303,10,1984-09-21,Double Jeopardy!,TOURIST TRAPS,200,Both Ferde Grofe' & the Colorado River dug thi...,the Grand Canyon,both ferde grofe the colorado river dug thi...,the grand canyon,0.0,0.2,0
19304,10,1984-09-21,Double Jeopardy!,LITERATURE,200,"Depending on the book, he could be a ""Jones"", ...",Tom,depending on the book he could be a jones ...,tom,0.0,0.142857,0


### RANDOM 
Now we have this, let us use the set words_used that we created earlier and observe the frequency of that word for high and low value questions.

In [63]:
import random 

terms_used_list = list(terms_used)
comparison_terms = random.sample(terms_used_list,10)

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

In [64]:
observed_expected

[(0, 1),
 (1, 0),
 (1, 0),
 (0, 2),
 (0, 1),
 (2, 2),
 (0, 1),
 (1, 1),
 (0, 1),
 (0, 1)]

## Chi-squared Test

In [65]:
high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
high_value_count

5734

In [66]:
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]
low_value_count


14265

In [69]:
from scipy.stats import chisquare
import numpy as np
chi_squared = []
for obs in observed_expected:
    total = jeopardy.value[0] + jeopardy.value[1]
    total_prop = total/(jeopardy.shape[0])
    high_value_expect = total_prop * high_value_count
    low_value_expecet = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expect = np.array([high_value_expect,low_value_expecet])
    chi_squared.append(chisquare(observed,expect))
    
chi_squared

[Power_divergenceResult(statistic=398.00350490711526, pvalue=1.498129036812735e-88),
 Power_divergenceResult(statistic=398.00871948029294, pvalue=1.4942183328599873e-88),
 Power_divergenceResult(statistic=398.00871948029294, pvalue=1.4942183328599873e-88),
 Power_divergenceResult(statistic=396.01401962846126, pvalue=4.0610956130963573e-88),
 Power_divergenceResult(statistic=398.00350490711526, pvalue=1.498129036812735e-88),
 Power_divergenceResult(statistic=392.0488975496332, pvalue=2.9636897808839765e-87),
 Power_divergenceResult(statistic=398.00350490711526, pvalue=1.498129036812735e-88),
 Power_divergenceResult(statistic=396.0122243874083, pvalue=4.064751739805223e-88),
 Power_divergenceResult(statistic=398.00350490711526, pvalue=1.498129036812735e-88),
 Power_divergenceResult(statistic=398.00350490711526, pvalue=1.498129036812735e-88)]

For every word, the p-value is much higher than the threshold - 0.05. Hence we fail to reject the null hypothesis. This means that by examining these 5 words, we found no statistical significance suggesting that these words can help us identify the type of question (high-value or low-value) we are dealing with.

The above result is only for 5 terms, and maybe inconclusive of the bigger picture. Thus let us try it again with more words.

In [70]:
chi_squared = {}
chi_test(chi_squared ,comparison_terms )
chi_squared

NameError: name 'chi_test' is not defined

In [71]:
jeopardy['round'].value_counts()

Jeopardy!           9901
Double Jeopardy!    9762
Final Jeopardy!      335
Tiebreaker             1
Name: round, dtype: int64

Looking at the data and making its cross table with the value_level column, we can tell that Doube Jeopardy round holds the most high-value questions. But how do we know whether this phenomenon if just by chance (for this sample) or is this true for the population.

In [80]:
cross_table = pd.crosstab(jeopardy['round'],jeopardy['high_value'])
cross_table

high_value,0,1
round,Unnamed: 1_level_1,Unnamed: 2_level_1
Double Jeopardy!,5340,4422
Final Jeopardy!,335,0
Jeopardy!,8589,1312
Tiebreaker,1,0


For this purpose, let us perform a chi-square test using the scipy.stats.chi2_contingency function on the cross table.
The null hypothesis is that there is no correlation between the rounds and the value level of the questions.

The alternative hypothesis is that there exists some correlation between the rounds and value level of the questions.

In [81]:
from scipy.stats import chisquare,chi2_contingency
chi_sq,p_value,dof,expected = chi2_contingency(cross_table)
p_value

0.0

In [123]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,8))
cross_table[0].plot.bar(align='center',color='#009999',label='low-level',width=0.25)
cross_table[1].plot.bar(align='edge',color = '#ff9933',label='high-level',width=0.25)
plt.legend()
plt.yticks([])
plt.xticks(rotation=0)
plt.ylabel('Number of questions')
plt.xlabel('Round')
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

In [83]:
jeopardy .category.value_counts()

TELEVISION                    51
U.S. GEOGRAPHY                50
LITERATURE                    45
HISTORY                       40
BEFORE & AFTER                40
AMERICAN HISTORY              40
AUTHORS                       39
WORD ORIGINS                  38
WORLD CAPITALS                37
SPORTS                        36
BODIES OF WATER               36
RHYME TIME                    35
SCIENCE                       35
MAGAZINES                     35
SCIENCE & NATURE              35
WORLD GEOGRAPHY               33
WORLD HISTORY                 32
ANNUAL EVENTS                 32
HISTORIC NAMES                32
FICTIONAL CHARACTERS          31
BIRDS                         31
IN THE DICTIONARY             31
U.S. PRESIDENTS               30
MEDICINE                      30
OPERA                         30
TRAVEL & TOURISM              30
ISLANDS                       30
POTPOURRI                     30
BALLET                        29
ART                           28
          

In [86]:
catgs = jeopardy.category.value_counts().sort_values(ascending=False)[:10].index

def observed(catg):
    high_count = 0
    low_count = 0
    
    for i,row in jeopardy.iterrows():
        if row.category == catg:
            if row.high_value == 1:
                high_count += 1
            else:
                low_count += 1
                
    return high_count,low_count

observed_values = []
for catg in catgs:
    observed_values.append(observed(catg))
    
observed_values

[(6, 45),
 (14, 36),
 (13, 32),
 (7, 33),
 (19, 21),
 (12, 28),
 (6, 33),
 (12, 26),
 (8, 29),
 (6, 30)]

In [88]:
chi_squared = {}

def chi_test(observed_values,catgs):
    high_value_count = np.count_nonzero(jeopardy.high_value == 1)
    low_value_count = np.count_nonzero(jeopardy.high_value == 0)

    total_rows = len(jeopardy)

    for catg,l in zip(catgs,observed_values):
        total = sum(l)
        total_prop = total / total_rows

        expected_high = total_prop * high_value_count
        expected_low = total_prop * low_value_count

        observed = np.array([l[0],l[1]])
        expected = np.array([expected_high,expected_low])

        chi_squared[catg] = chisquare(observed,expected)
    
chi_test(observed_values,catgs)
chi_squared

{'AMERICAN HISTORY': Power_divergenceResult(statistic=0.034523405991355754, pvalue=0.8525978776056389),
 'AUTHORS': Power_divergenceResult(statistic=3.366616811569767, pvalue=0.06653024865486724),
 'BEFORE & AFTER': Power_divergenceResult(statistic=6.933964236239863, pvalue=0.008457402515593288),
 'HISTORY': Power_divergenceResult(statistic=2.4409838293691184, pvalue=0.11820206989580724),
 'LITERATURE': Power_divergenceResult(statistic=0.0010404942221835507, pvalue=0.9742673454576186),
 'SPORTS': Power_divergenceResult(statistic=2.5368632703677747, pvalue=0.11121553152067523),
 'TELEVISION': Power_divergenceResult(statistic=7.1281427377644, pvalue=0.007588328882660597),
 'U.S. GEOGRAPHY': Power_divergenceResult(statistic=0.011022071015878569, pvalue=0.9163868768161757),
 'WORD ORIGINS': Power_divergenceResult(statistic=0.15707760152502, pvalue=0.6918614751677927),
 'WORLD CAPITALS': Power_divergenceResult(statistic=0.8991742998170983, pvalue=0.3430032082144119)}

Here we can see the majority of topics do not have p_value <= 0.05, meaning for these topics we fail to reject the null hypothesis. However, for two topics - TELEVISION and BEFORE & AFTER, the null hypothesis is rejected and hence can be said that it does have a correlation with the value levels.

We have only performed these tests for the top 10 most frequent categories (topics) in the data. Let us perform the same for the top 20 categories (topics).

In [90]:
catgs = jeopardy.category.value_counts().sort_values(ascending=False)[:20].index

observed_values = []
for catg in catgs:
    observed_values.append(observed(catg))
    
print(catgs)
observed_values

Index(['TELEVISION', 'U.S. GEOGRAPHY', 'LITERATURE', 'HISTORY',
       'BEFORE & AFTER', 'AMERICAN HISTORY', 'AUTHORS', 'WORD ORIGINS',
       'WORLD CAPITALS', 'SPORTS', 'BODIES OF WATER', 'MAGAZINES',
       'SCIENCE & NATURE', 'RHYME TIME', 'SCIENCE', 'WORLD GEOGRAPHY',
       'WORLD HISTORY', 'ANNUAL EVENTS', 'HISTORIC NAMES',
       'FICTIONAL CHARACTERS'],
      dtype='object')


[(6, 45),
 (14, 36),
 (13, 32),
 (7, 33),
 (19, 21),
 (12, 28),
 (6, 33),
 (12, 26),
 (8, 29),
 (6, 30),
 (5, 31),
 (6, 29),
 (14, 21),
 (7, 28),
 (15, 20),
 (6, 27),
 (7, 25),
 (7, 25),
 (10, 22),
 (8, 23)]

In [91]:
chi_squared = {}
chi_test(observed_values,catgs)
chi_squared

{'AMERICAN HISTORY': Power_divergenceResult(statistic=0.034523405991355754, pvalue=0.8525978776056389),
 'ANNUAL EVENTS': Power_divergenceResult(statistic=0.72276851787158, pvalue=0.395237283932548),
 'AUTHORS': Power_divergenceResult(statistic=3.366616811569767, pvalue=0.06653024865486724),
 'BEFORE & AFTER': Power_divergenceResult(statistic=6.933964236239863, pvalue=0.008457402515593288),
 'BODIES OF WATER': Power_divergenceResult(statistic=3.846697168272983, pvalue=0.049844052805717624),
 'FICTIONAL CHARACTERS': Power_divergenceResult(statistic=0.1244206806982202, pvalue=0.7242884987195274),
 'HISTORIC NAMES': Power_divergenceResult(statistic=0.10403841390560437, pvalue=0.7470361694584846),
 'HISTORY': Power_divergenceResult(statistic=2.4409838293691184, pvalue=0.11820206989580724),
 'LITERATURE': Power_divergenceResult(statistic=0.0010404942221835507, pvalue=0.9742673454576186),
 'MAGAZINES': Power_divergenceResult(statistic=2.27460770890725, pvalue=0.13150839701567232),
 'RHYME TI

We have new additions to our list of topics that have correlation with the value levels, they are - SPORTS, SCIENCE, SCIENCE & NATURE, BIRDS and the ones from previous analysis as well as this, TELEVISION and BEFORE & AFTER.

Let us make a cross table for these topics, to understand the frequencies of these topics with respect to the value level.

In [94]:
catg_interest = [
    'SPORTS',
    'SCIENCE',
    'SCIENCE & NATURE',
    'BIRDS',
    'TELEVISION',
    'BEFORE & AFTER'
]

subset = jeopardy[jeopardy.category.isin(catg_interest)]
cross_table = pd.crosstab(subset.category,subset.high_value)
cross_table

high_value,0,1
category,Unnamed: 1_level_1,Unnamed: 2_level_1
BEFORE & AFTER,21,19
BIRDS,29,2
SCIENCE,20,15
SCIENCE & NATURE,21,14
SPORTS,30,6
TELEVISION,45,6


In [132]:
plt.figure(figsize=(12,8))
cross_table[0].plot.bar(align='center',color='#009999',label='low-level',width=0.25)
cross_table[1].plot.bar(align='edge',color = '#ff9933',label='high-level',width=0.25)
plt.legend()
plt.yticks([])
plt.xticks(rotation=0)
plt.ylabel('Number of questions')
plt.xlabel('Topics')
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['left'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.show()

<matplotlib.figure.Figure at 0x7f5999599278>


I found a path object that I don't think is part of a bar chart. Ignoring.



Looking into the cross table, the plot and the p_values obtained from before, we can say that the topics SPORTS, TELEVISION, BIRDS have a higher chance of being low-level questions, whereas the topics BEFORE & AFTER, SCIENCE and SCIENCE & NATURE have a higher chance of being high-level questions.

From our analysis, we can conclude :-

1. The answers are hardly hidden in the questions and hence the participant has to be revised with all categories (topics).

2. The repetition of questions is rare, the participant must not rely on reading previous questions only to win the game.

3. No relationship was found between the level of the question (>750 or <750 dollars) and the words present in the questions. Thus the participant cannot estimate the level of the question with respect to words in  the question.

4. The first round, Jeopardy! hosts mostly low-level (<750 dollars) questions. Whereas the second round    Double Jeopardy! hosts high-level (>750 dollars) questions. Participant's aim to win more money can utilize these findings and play accordingly.

5. The categories (topics) - SPORTS, TELEVISION and BIRDS have a higher chance of having low-level (<750 dollars) questions, whereas the categories (topics) BEFORE & AFTER, SCIENCE and SCIENCE & NATURE have a highe chance of having high-level (> 750 dollars) questions. 

From the above conclusions, the participant can accordingly prepare and choose to answer questions in the game in order to win more money and overall be successful in the game.