# Analysing Text: Winning Jeopardy

In [23]:
import pandas
import csv

jeopardy = pandas.read_csv("data/jeopardy.csv")

jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona


As you can see, each row in the dataset represents a single question on a single episode of Jeopardy. Here are explanations of each column:

* Show Number -- the Jeopardy episode number of the show this question was in.
* Air Date -- the date the episode aired.
* Round -- the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
* Category -- the category of the question.
* Value -- the number of dollars answering the question correctly is worth.
* Question -- the text of the question.
* Answer -- the text of the answer.


In [24]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

Let's remove leading and trailing whitespace from columns names.

In [25]:
columns = [x.strip(' ') for x in jeopardy.columns]

In [26]:
columns

['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [27]:
jeopardy.columns = columns

In [28]:
import re
import string
def normalise(s):
    s = s.lower()
    # Removing punctuation
    pattern = '[{}]'.format(re.escape(string.punctuation))
    regexp = re.compile(pattern)
    return regexp.sub('', s)

In [29]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalise)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalise)

In [30]:
def normalise_dollars(s):
    s = normalise(s)
    try:
        return int(s)
    except Exception:
        return 0
    

In [31]:
jeopardy["clean_value"] = jeopardy["Value"].apply(normalise_dollars)

In [32]:
jeopardy["Air Date"] = pandas.to_datetime(jeopardy["Air Date"])
jeopardy.head(3)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus,for the last 8 years of his life galileo was u...,copernicus,200
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe,no 2 1912 olympian football star at carlisle i...,jim thorpe,200
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona,the city of yuma in this state has a record av...,arizona,200


## Answering questions
### Can we figure out the answer from the question?

We can try to find out how many words from the answer are also part of the question?

We are going to get rid of words that don't have any meaningful use in finding the answer (e.g., 'the', 'this', 'a', etc.) and count how many words from the answer are also part of the question.

In [33]:
def process(row):
    quest_words = row['clean_question'].split()
    answ_words = row['clean_answer'].split()
    match_count = 0
    # To get rid of common words such as 'the', 'this', 'a', etc: we can eliminate short words.
    answ_words = [word for word in answ_words if len(word)>4]
    if len(answ_words) == 0:
        return 0
    for i in answ_words:
        if i in quest_words:
            match_count += 1
    return match_count / len(answ_words)

jeopardy["answer_in_question"] = jeopardy.apply(process, axis=1)  

print(jeopardy["answer_in_question"].mean())

0.02989317744953176


Fewer than 3% of words from the answer were part of the question. It means that we probably can't just hope that hearing a question will enable us to figure out the answer. 

### How often words in questions are repeated in the same question and over other questions.

You can investigate how often complex words (> 6 characters) reoccur. 

In [41]:
question_overlap = []
terms_used = set()
for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split()
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if (word in terms_used) and (len(word)>5):
                match_count += 1
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.8734646239825284

There is about 87% overlap between terms in new questions and terms in old questions.

We can actually figure out which terms correspond to high-value questions using a chi-squared test. You'll first need to narrow down the questions into two categories:

* Low value -- Any row where Value is less than 800.
* High value -- Any row where Value is greater than 800.


In [44]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split():
            if row["clean_value"] > 800:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

comparison_terms = list(terms_used)[:5]
observed_expected = []
for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected


[(1, 0), (1, 0), (4, 6), (1, 0), (1, 2)]