"We need you to use a collection of questions and answers from the television show 'Jeopardy!' to build a classifier that might work. It is a long-shot, since there isn't a lot of linguistic difference between low- and high-value questions, but even if you can demonstrate it can't be done effectively using some common approaches, it will at least give us something to go on."

**Basics**

Parse, clean, and organize the Jeopardy! question data file to train a Naive Bayesian classifier.

Your aim here is to make sense of the data presented, and create a binary classifier ("high value" and "low value," based on the points available for each) for questions. Despite the large number of questions, this is an extraordinarily difficult classification problem. 

Consider it as a human coder: how often could you tell those questions that are "easy" versus "hard"? The degree to which you are successful in this is largely based on your own contextual knowledge--indeed, you might be tempted to classify questions you know the answer to as "easy" and those you do not as "hard." The computer doesn't know the answers to any of these.

For that reason, do not be discouraged if your classifier does not perform well. This constitutes an especially difficult problem for a simple classifier to solve.

Put the script and its output (which may merely report the accuracy of the trial) in your github repository, and share the link/filenames when you start your quiz.

In [20]:
#import dependencies
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import os
import re
import json
from string import punctuation
from datetime import datetime

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\denee\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\denee\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\denee\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Next, import the 'Jeopardy!' JSON file and view the structure

In [21]:
#import the jeopardy JSON file
file_path = 'jeopardy.json'

with open(file_path, 'r') as file:
    jeopardy_questions = json.load(file)

#print the first row if it is in dictionary format
print(jeopardy_questions[0])  

{'category': 'HISTORY', 'air_date': '2004-12-31', 'question': "'For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory'", 'value': '$200', 'answer': 'Copernicus', 'round': 'Jeopardy!', 'show_number': '4680'}


Now, clean and organize the questions into "high" or "low" values:

In [25]:
#Below is a function that converts the 'value' string into an integer by removing the dollar signs

#I ran into a lot of errors here and had to go back and forth, not all 'values' were integers nor strings, so I attempted to handle both

def clean_value(value):
    """Ensure the value is returned as an integer."""
    # Handle cases where the value is None
    if value is None:
        return 0
    if isinstance(value, str):
        try:
            value = int(value.replace('$', '').replace(',', ''))# Attempt to clean and convert string values
        except ValueError:# Return 0 if the string cannot be converted
            value = 0
    elif not isinstance(value, int):# Return 0 if the value is neither a string nor an integer
        value = 0
    return value

In [26]:
#The next function determines if the questions is of 'high' or 'low' value asdetermined by a particular threshold:

def determine_value(question, threshold=800):
    """Determine if the question is of high or low value based on the threshold."""
    if question['value'] > threshold:
        return 'high'
    else:
        return 'low'
    
# Clean the 'value' field and add a 'value_label' field for each question
for question in jeopardy_questions:
    question['value'] = clean_value(question['value'])
    question['value_label'] = determine_value(question)   

Next, I work on pre-processing the text data(questions) to make them suitable for training a Naive Bayes classifer

In [29]:
#Tokenize, remove stopwords, and lemmatize text.
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    text = text.lower()# Convert to lowercase
    text = re.sub(r'[^a-z\s]', '', text)# Remove non-alphabetical characters
    tokens = word_tokenize(text)# Tokenize
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stopwords.words('english')] # Remove stopwords and lemmatize
    return cleaned_tokens

# Pre-process the question text
for question in jeopardy_questions:
    question['cleaned_question'] = preprocess_text(question['question'])

Next, I work on feature extraction.

For Naive Bayes, I convert the questions into a format that the algorithm can process - specifically, a *'bag-of-words'* model

In [30]:
# Convert cleaned questions into a list for CountVectorizer
cleaned_questions = [" ".join(question['cleaned_question']) for question in jeopardy_questions]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(cleaned_questions)

In [31]:
#the next code chunk is to prepare the target variable - indicating whether or not each question is "high" or 'low' value
y = [question['value_label'] for question in jeopardy_questions]

Finally, the following code works to split the data into training and testing sets to train the Naive Bayes classifer

In [32]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predict on the test set
y_pred = classifier.predict(X_test)

# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.6935416954778039


An accuracy of approximately 69.35% indicates that our classifier here correctly predicts whether a 'Jeopardy!' question is of "high" or "low" value around 69.35% of the time based on the question's content. 

This isn't bad! But could be better, we could explore additional features and try more complex models (that I have yet to learn) to improve the model's predictive accuracy.