<a href="https://colab.research.google.com/github/ealatorr/sds510/blob/main/Module_5/mod_5_basics_and_essentials_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook contains both module 5 Basics and Essentials.

**Start of Mod 5 Basics**

Mounted google drive to retrieve files.

In [37]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Imported necessary packages to retrieve json data file, converted to a dataframe, and normalized the capitalization to all lowercases. Lastly, printed to check how the columns turned out and make sure the data downloaded properly.

In [38]:
import json
import pandas as pd
json_path = "/content/drive/MyDrive/Colab Notebooks/jeopardy.json"

with open(json_path, "r", encoding="utf-8") as f:
    raw_data = json.load(f)
jeopardy_data = pd.DataFrame(raw_data)
jeopardy_data.columns = [c.lower() for c in jeopardy_data.columns]
print("Columns:", jeopardy_data.columns.tolist())
jeopardy_data.head()


Columns: ['category', 'air_date', 'question', 'value', 'answer', 'round', 'show_number']


Unnamed: 0,category,air_date,question,value,answer,round,show_number
0,HISTORY,2004-12-31,"'For the last 8 years of his life, Galileo was...",$200,Copernicus,Jeopardy!,4680
1,ESPN's TOP 10 ALL-TIME ATHLETES,2004-12-31,'No. 2: 1912 Olympian; football star at Carlis...,$200,Jim Thorpe,Jeopardy!,4680
2,EVERYBODY TALKS ABOUT IT...,2004-12-31,'The city of Yuma in this state has a record a...,$200,Arizona,Jeopardy!,4680
3,THE COMPANY LINE,2004-12-31,"'In 1963, live on ""The Art Linkletter Show"", t...",$200,McDonald\'s,Jeopardy!,4680
4,EPITAPHS & TRIBUTES,2004-12-31,"'Signer of the Dec. of Indep., framer of the C...",$200,John Adams,Jeopardy!,4680


This next cell is to clean and prep the data I tried a couple different methods which required me to import many packages. Firstly I set the high value threshold to 1000 so any question over 1000 is high value and anything below is low value. I firstly clean the set of dollar signs and extra whitespace. I then realize that stopwords can do most of the heavy lifting if i modify it so I went ahead and added that in the next cell.

In [39]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
nltk.download("stopwords")

HIGH_VALUE_THRESHOLD = 1000

def clean_value(v):
    if not isinstance(v, str):
        return 0
    v = v.replace("$", "").replace(",", "").strip()
    return int(v) if v.isdigit() else 0

def clean_text(text):
    if not isinstance(text, str):
        return ""
    return text.lower().strip()

def prepare_dataframe(df):
    df = df.copy()
    df["clean_value"] = df["value"].apply(clean_value)
    df["high_value"] = (df["clean_value"] >= HIGH_VALUE_THRESHOLD).astype(int)
    df["clean_question"] = df["question"].apply(clean_text)
    df = df[df["clean_question"].str.len() > 0]
    return df

jeopardy_data = prepare_dataframe(jeopardy_data)
print(jeopardy_data["high_value"].value_counts())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


high_value
0    155622
1     61308
Name: count, dtype: int64


In [40]:
stopwords = stopwords.words("english") + list(string.punctuation)

Now I am using the vectorizer to make the data numeric and labeling the high values as y and splitting the data into training sets in the following cell.

In [41]:
vectorizer = CountVectorizer(stop_words="english", min_df=2)
X = vectorizer.fit_transform(jeopardy_data["clean_question"])
y = jeopardy_data["high_value"].values

In [42]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y,
)
X_train.shape, X_test.shape

((173544, 52987), (43386, 52987))

Next I go into the Naive Bayes training now that I have created my training sets.

In [43]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)
nb_acc = accuracy_score(y_test, y_pred_nb)
print(f"Naive Bayes accuracy: {nb_acc:.4f}\n")
print("Naive Bayes classification report")
print(classification_report(y_test, y_pred_nb, target_names=["low_value", "high_value"]))


Naive Bayes accuracy: 0.6948

Naive Bayes classification report
              precision    recall  f1-score   support

   low_value       0.74      0.89      0.81     31124
  high_value       0.41      0.19      0.26     12262

    accuracy                           0.69     43386
   macro avg       0.58      0.54      0.53     43386
weighted avg       0.65      0.69      0.65     43386



In [44]:
output_path = "/content/drive/MyDrive/Colab Notebooks/jeopardy_basic_results.txt"
with open(output_path, "w") as f:
    f.write(f"Naive Bayes accuracy: {nb_acc:.4f}\n")

The Naive Bayes classifier had a 69% accuracy  but if we look more closely the low value questions so those under 1000 dollars had a 74% precision adn a 89% recall while the high value questions thsoe over 1000 dollars only had a 41% precision and 19% recall drastically lower than the low value questions. This leads me to believe that the high value questions have textual markers that are similar to low value questions and the classifier has a difficult time distinguishing between them.

**Start of Essentials Sections**
Instructions: Extend the above classification attempt and try two other approaches to classifying the text.

For the essentials badge I decided to use Linear SVM and Logistic Regression.

In [45]:
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

In [46]:
def evaluate_model(name, model, X_train, y_train, X_test, y_test):
    print(f"\n=== {name} ===")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"{name} accuracy: {acc:.4f}")
    print("Classification report:")
    print(classification_report(y_test, y_pred, target_names=["low_value", "high_value"]))

svm_model = LinearSVC()
evaluate_model("Linear SVM", svm_model, X_train, y_train, X_test, y_test)

logreg_model = LogisticRegression(max_iter=1000, n_jobs=-1)
evaluate_model("Logistic Regression", logreg_model, X_train, y_train, X_test, y_test)



=== Linear SVM ===
Linear SVM accuracy: 0.6739
Classification report:
              precision    recall  f1-score   support

   low_value       0.74      0.84      0.79     31124
  high_value       0.38      0.24      0.30     12262

    accuracy                           0.67     43386
   macro avg       0.56      0.54      0.54     43386
weighted avg       0.64      0.67      0.65     43386


=== Logistic Regression ===
Logistic Regression accuracy: 0.7017
Classification report:
              precision    recall  f1-score   support

   low_value       0.74      0.91      0.81     31124
  high_value       0.43      0.16      0.24     12262

    accuracy                           0.70     43386
   macro avg       0.58      0.54      0.53     43386
weighted avg       0.65      0.70      0.65     43386



The Naive Bayes had a 69% accuracy. The Linear SVM had a 67% accuracy and the Logistic Regression had a 70% accuracy. So all three classifiers were within 3% of each other. And all three classifiers show that the high value questions consistantly has lower precision and recall than the low value questions. The Linear SVM was able to get the highest recall for the high value questions at 24% possibly indicating that it is most suited for finding textual markers in this data. But it is still missing aorung 76% of the time so it is not very reliable. Again supporting the conclusion from the Naive Bayes classifier that you can not decifer the difficulty of a Jeopardy question by its textual markers.

In [47]:
output_path = "/content/drive/MyDrive/Colab Notebooks/jeopardy_essential_results.txt"
with open(output_path, "w") as f:
    f.write(f"Naive Bayes accuracy: {nb_acc:.4f}\n")
    f.write("Naive Bayes report:\n")
    f.write(classification_report(y_test, y_pred_nb, target_names=["low_value", "high_value"]))
