**Essentials**

Extend the above classification attempt and try two other approaches to classifying the text.

Expand and refine the approach above by providing at least two other approaches to classifying the output.

These might also engage in different ways of preparing the materials as well.

Given the semantic value of these short texts, using word embeddings might lead to more effective vectorization.

You might try: 

--Support Vector Machines 

--More Deep Learning approaches by following a Tensor Flow tutorial.

In [8]:
!pip install tensorflow

Collecting tensorflow
  Obtaining dependency information for tensorflow from https://files.pythonhosted.org/packages/93/21/9b035a4f823d6aee2917c75415be9a95861ff3d73a0a65e48edbf210cec1/tensorflow-2.15.0-cp311-cp311-win_amd64.whl.metadata
  Downloading tensorflow-2.15.0-cp311-cp311-win_amd64.whl.metadata (3.6 kB)
Collecting tensorflow-intel==2.15.0 (from tensorflow)
  Obtaining dependency information for tensorflow-intel==2.15.0 from https://files.pythonhosted.org/packages/4c/48/1a5a15517f18eaa4ff8d598b1c000300b20c1bb0e624539d702117a0c369/tensorflow_intel-2.15.0-cp311-cp311-win_amd64.whl.metadata
  Downloading tensorflow_intel-2.15.0-cp311-cp311-win_amd64.whl.metadata (5.1 kB)
Collecting absl-py>=1.0.0 (from tensorflow-intel==2.15.0->tensorflow)
  Obtaining dependency information for absl-py>=1.0.0 from https://files.pythonhosted.org/packages/a2/ad/e0d3c824784ff121c03cc031f944bc7e139a8f1870ffd2845cc2dd76f6c4/absl_py-2.1.0-py3-none-any.whl.metadata
  Downloading absl_py-2.1.0-py3-none-any

In [9]:
#import dependencies
import nltk
import pandas as pd
import re
import json
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.utils import to_categorical




In [10]:
#import the jeopardy JSON file
file_path = 'jeopardy.json'

with open(file_path, 'r') as file:
    jeopardy_questions = json.load(file)

**First, I will attempt a tensorflow method for classification using a deep learning method (LSTM)**

In [12]:
# Convert 'value' to numerical and categorize as high or low
value_threshold = 800  # Define threshold for high/low value
for question in jeopardy_questions:
    # If 'value' is not a string/integer, set a default value 0
    if question['value'] is None or (not isinstance(question['value'], str) and not isinstance(question['value'], int)):
        question['value'] = 0
    elif isinstance(question['value'], str):
        # If 'value' is a string, remove $ and commas and convert to int
        question['value'] = int(question['value'].replace('$', '').replace(',', ''))
    # If 'value' is an int, no changes
    
    # Categorize as 'high' or 'low' based on the threshold
    question['value_label'] = 'high' if question['value'] > value_threshold else 'low'

Next, pre-processing the text data(questions) to make them suitable for training

In [13]:
# Convert to Pandas DataFrame
df = pd.DataFrame(jeopardy_questions)

In [14]:
# Preprocess and tokenize questions
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['question'])
sequences = tokenizer.texts_to_sequences(df['question'])
max_sequence_length = max(len(x) for x in sequences)
question_padded = pad_sequences(sequences, maxlen=max_sequence_length) #pad the sequence lengths to be equal for the model to analyze

In [15]:
# Encode labels
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(df['value_label'])
labels = to_categorical(labels)

In [16]:
# Split data as seen in class
X_train, X_val, y_train, y_val = train_test_split(question_padded, labels, test_size=0.2, random_state=42)

In [17]:
# Define model, I opted for a LSTM sequentioal model 
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index) + 1, output_dim=64, input_length=max_sequence_length),
    LSTM(64),
    Dense(2, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])





In [19]:
# Train model
model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1a790862090>

In [22]:
# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_val, y_val)

# Print test accuracy
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 0.6396763920783997


**Support Vector Machines (SVM) approach for classifying text data**

Next, I will try the classification using SVM on the 'Jeopardy!' dataset

In [42]:
#import dependencies not mentioned above
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score
import gensim.downloader as api
import json

In [27]:
#Load GloVe model from gensim
glove_model = api.load('glove-wiki-gigaword-300')

[=====---------------------------------------------] 11.5% 43.4/376.1MB downloaded

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)





IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [43]:
# Function to create an average embedding for a text
def document_vector(word_list):
    # Assume word_list is already a list of words, directly use them
    embeddings = [glove_model[word] for word in word_list if word in glove_model]
    if not embeddings:
        return np.zeros(300)  # Return a zero vector if embeddings list is empty
    return np.mean(embeddings, axis=0)

In [44]:
file_path = 'jeopardy.json'

with open(file_path, 'r') as file:
    jeopardy_questions = json.load(file)

In [45]:
# Convert 'value' to numerical and categorize as high or low
value_threshold = 800  # Define threshold for high/low value
for question in jeopardy_questions:
    if question['value'] is None or (not isinstance(question['value'], str) and not isinstance(question['value'], int)):
        question['value'] = 0
    elif isinstance(question['value'], str):
        question['value'] = int(question['value'].replace('$', '').replace(',', ''))
    question['value_label'] = 'high' if question['value'] > value_threshold else 'low'

In [46]:
# Convert to Pandas DataFrame
df = pd.DataFrame(jeopardy_questions)

In [47]:
# Pre-process the questions to split them into words
df['processed_questions'] = df['question'].apply(lambda x: x.lower().split())

In [48]:
# Vectorize text data using the document_vector function
df['question_vector'] = df['processed_questions'].apply(document_vector)

In [49]:
#Train/Test Split
X = np.array(list(df['question_vector']))  # Convert to array
y = df['value_label'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [50]:
#Train SVM
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

In [51]:
#Evaluation
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:", classification_report(y_test, y_pred))

Accuracy: 0.7189415940626008


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Classification Report:               precision    recall  f1-score   support

        high       0.00      0.00      0.00     12194
         low       0.72      1.00      0.84     31192

    accuracy                           0.72     43386
   macro avg       0.36      0.50      0.42     43386
weighted avg       0.52      0.72      0.60     43386



  _warn_prf(average, modifier, msg_start, len(result))
