In [1]:
import pandas as pd
jeopardy = pd.read_csv('OneDrive\Documents\my_datasets\Jeopardy.csv')
jeopardy.shape

(19999, 7)

In [2]:
jeopardy.sample(5)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
8539,5443,2008-04-16,Double Jeopardy!,SIGNS & SYMBOLS,"$2,000","At her swearing-in, Speaker Pelosi wore this c...",purple
11840,3053,1997-12-03,Double Jeopardy!,WORLD CITIES,$1000,"The Prefecture, the palace of Pizarro, still s...",Cuzco
18047,3227,1998-09-22,Jeopardy!,PHONIES,$400,A medical syndrome is named for this baron fam...,Baron von Munchausen
15498,3272,1998-11-24,Double Jeopardy!,TV COMEDY,$600,"For the Flintstones, it's Barney Rubble; for t...",Neighbors
3539,4576,2004-06-28,Jeopardy!,GONE TOMORROW?,$600,The endangerment of the New Mexico ridge-nosed...,a rattlesnake


In [3]:
jeopardy.columns

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

In [4]:
jeopardy.columns  = ['Show Number', 'Air Date', 'Round', 'Category', 'Value', 'Question', 'Answer']

In [5]:
import re

def normalize_text(text):
    text = text.lower()
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    text = re.sub("\s+", " ", text)
    return text

def normalize_values(text):
    text = re.sub("[^A-Za-z0-9\s]", "", text)
    try:
        text = int(text)
    except Exception:
        text = 0
    return text

In [6]:
jeopardy["clean_question"] = jeopardy["Question"].apply(normalize_text)
jeopardy["clean_answer"] = jeopardy["Answer"].apply(normalize_text)
jeopardy["clean_value"] = jeopardy["Value"].apply(normalize_values)

In [7]:
jeopardy["Air Date"] = pd.to_datetime(jeopardy["Air Date"])

In [8]:
jeopardy.sample(2)

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer,clean_question,clean_answer,clean_value
18849,4612,2004-09-28,Double Jeopardy!,ART & ARTISTS,$1600,"His ""Jolly Toper"" of the 1600s is seen <a href...",Franz Hals,his jolly toper of the 1600s is seen a hrefhtt...,franz hals,1600
9307,3820,2001-03-23,Double Jeopardy!,"FILE UNDER ""K""",$600,"His ""Ode To A Nightingale"" says, ""With beaded ...",John Keats,his ode to a nightingale says with beaded bubb...,john keats,600


In [9]:
jeopardy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19999 entries, 0 to 19998
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Show Number     19999 non-null  int64         
 1   Air Date        19999 non-null  datetime64[ns]
 2   Round           19999 non-null  object        
 3   Category        19999 non-null  object        
 4   Value           19999 non-null  object        
 5   Question        19999 non-null  object        
 6   Answer          19999 non-null  object        
 7   clean_question  19999 non-null  object        
 8   clean_answer    19999 non-null  object        
 9   clean_value     19999 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(7)
memory usage: 1.5+ MB


In [10]:
def count_matches(row):
    split_answer = row["clean_answer"].split()
    split_question = row["clean_question"].split()
    if "the" in split_answer:
        split_answer.remove("the")
    if len(split_answer) == 0:
        return 0
    match_count = 0
    for item in split_answer:
        if item in split_question:
            match_count += 1
    return match_count / len(split_answer)

jeopardy["answer_in_question"] = jeopardy.apply(count_matches, axis=1)

In [11]:
jeopardy["answer_in_question"].mean()

0.05900196524977763

### Questions recyclées

En moyenne, la réponse ne représente qu'environ 6% de la question. Ce n'est pas un nombre énorme et cela signifie que nous ne pouvons probablement pas simplement espérer qu'entendre une question nous permettra de trouver la réponse. Nous devrons probablement étudier.

In [12]:
question_overlap = []
terms_used = set()

jeopardy = jeopardy.sort_values("Air Date")

for i, row in jeopardy.iterrows():
        split_question = row["clean_question"].split(" ")
        split_question = [q for q in split_question if len(q) > 5]
        match_count = 0
        for word in split_question:
            if word in terms_used:
                match_count += 1
        for word in split_question:
            terms_used.add(word)
        if len(split_question) > 0:
            match_count /= len(split_question)
        question_overlap.append(match_count)
jeopardy["question_overlap"] = question_overlap

jeopardy["question_overlap"].mean()

0.6876260592169802

## Questions de faible valeur vs questions de grande valeur

Il y a un chevauchement d'environ 70% entre les termes des nouvelles questions et les termes des anciennes questions. Cela ne regarde qu'un petit ensemble de questions, et il ne regarde pas les phrases, il regarde les termes uniques. Cela le rend relativement insignifiant, mais cela signifie qu'il vaut la peine de se pencher davantage sur le recyclage des questions.

In [13]:
def determine_value(row):
    value = 0
    if row["clean_value"] > 800:
        value = 1
    return value

jeopardy["high_value"] = jeopardy.apply(determine_value, axis=1)

In [14]:
def count_usage(term):
    low_count = 0
    high_count = 0
    for i, row in jeopardy.iterrows():
        if term in row["clean_question"].split(" "):
            if row["high_value"] == 1:
                high_count += 1
            else:
                low_count += 1
    return high_count, low_count

In [15]:
from random import choice

terms_used_list = list(terms_used)
comparison_terms = [choice(terms_used_list) for _ in range(10)]

observed_expected = []

for term in comparison_terms:
    observed_expected.append(count_usage(term))

observed_expected

[(1, 3),
 (1, 0),
 (0, 1),
 (0, 1),
 (0, 1),
 (3, 2),
 (0, 1),
 (0, 1),
 (0, 1),
 (0, 5)]

In [16]:
from scipy.stats import chisquare
import numpy as np

high_value_count = jeopardy[jeopardy["high_value"] == 1].shape[0]
low_value_count = jeopardy[jeopardy["high_value"] == 0].shape[0]

chi_squared = []
for obs in observed_expected:
    total = sum(obs)
    total_prop = total / jeopardy.shape[0]
    high_value_exp = total_prop * high_value_count
    low_value_exp = total_prop * low_value_count
    
    observed = np.array([obs[0], obs[1]])
    expected = np.array([high_value_exp, low_value_exp])
    chi_squared.append(chisquare(observed, expected))

chi_squared

[Power_divergenceResult(statistic=0.02636443308440769, pvalue=0.871013484688921),
 Power_divergenceResult(statistic=2.487792117195675, pvalue=0.11473257634454047),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.3995960878537224, pvalue=0.12136658322360773),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=0.401962846126884, pvalue=0.5260772985705469),
 Power_divergenceResult(statistic=2.00981423063442, pvalue=0.1562844540498912)]

## Résultats de Chi-squared 

Aucun des termes ne présentait de différence d'utilisation significative entre les lignes de valeur élevée et faible. De plus, les fréquences étaient toutes inférieures à 5, donc le test du chi carré n'est pas aussi valide. Il serait préférable d'exécuter ce test avec uniquement des termes qui ont des fréquences plus élevées.