Multinomial Naive Bayes (MNB) vs Support Vector Machine (SVM)

- Since the course already covered Bert and Roberta as an homework, I wanted to use a different approach in topic analysis to further demonstrate different technologies. This wy we opted to demonstrate two different technologies and analyze the results in this part and preferred not to use the already present and obvious Bert and Roberta examples.



*Multinomial Naive Bayes (MNB)*
Using the TF-IDF vectorizer and then training a Naive Bayes classifier

In [4]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Loading the movie reviews and labeling them as movies

movies_df = pd.read_csv("C:/Users/mberk/OneDrive/Masaüstü/New folder/Movies.csv", encoding="latin-1")
movies_df.columns = ["ReviewContent"]
movies_df["Label"] = "movies"

# only using a portion of the movies so it does not overtake other data as it is bigger
movies_df = movies_df.sample(frac=0.4, random_state=42)

# Loading the book reviews from the Reviews file and labeling them as book
reviews_df = pd.read_csv("C:/Users/mberk/OneDrive/Masaüstü/New folder/Reviews.csv", encoding="latin-1")
reviews_df.columns = ["ReviewContent"]
reviews_df["Label"] = "book"

# Loading the first sports reviews and labeling them as sports
sports_df = pd.read_csv("C:/Users/mberk/OneDrive/Masaüstü/New folder/Sports1.csv", encoding="utf-8")
sports_df.columns = ["ReviewContent"]
sports_df["Label"] = "sports"


# Loading the second sports reviews and labeling them as sports
sports2_df = pd.read_csv("C:/Users/mberk/OneDrive/Masaüstü/New folder/Sports2.csv", encoding="utf-8")
sports2_df.columns = ["ReviewContent"]
sports2_df["Label"] = "sports"

# Concatenate movie, book, and sports reviews
all_reviews_df = pd.concat([movies_df, reviews_df, sports_df,sports2_df], ignore_index=True)

all_reviews_df.dropna(inplace=True)

X_train, X_test, y_train, y_test = train_test_split(
    all_reviews_df["ReviewContent"], all_reviews_df["Label"], test_size=0.2, random_state=42
)

tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)

y_pred = clf.predict(X_test_tfidf)
print(classification_report(y_test, y_pred))

# Predict topics for the given test sentences
test_sentences = [
    "I wouldn't be caught dead watching the NFL if it weren't for Taylor Swift.",
    "Chris O'Donnell stated that while filming for this movie, he felt like he was in a Toys ''R'' Us commercial.",
    "The whole game was a rollercoaster ride, but Los Angeles Lakers ultimately persevered and won!",
    "Zendaya slayed in Dune 2, as she does in all her movies.",
    "While my favorite player was playing this match and started off strongggg, it went downhill after Messi's injury midgame.",
    "My uncle's brother's neighbor's cat's veterinarian David reads the communist manifesto in his spare time.",
    "He said that The Great Gatsby is the best novel ever, and I was about to throw hands.",
    "I could not look away from this train wreck of a movie, on February 14th of all days.",
    "The film Everything Everywhere All At Once follows Evelyn Wang, a woman drowning under the stress of her family's failing laundromat.",
    "I just finished reading pride and prejudice which had me HOOOKED from the beginning."
]

X_test_sentences = tfidf_vectorizer.transform(test_sentences)
predicted_topics = clf.predict(X_test_sentences)

for sentence, topic in zip(test_sentences, predicted_topics):
    print(f"Sentence: {sentence}\t Predicted Topic: {topic}") 

              precision    recall  f1-score   support

        book       1.00      0.92      0.96       956
      movies       0.93      1.00      0.96      4055
      sports       1.00      0.99      0.99     19989

    accuracy                           0.99     25000
   macro avg       0.97      0.97      0.97     25000
weighted avg       0.99      0.99      0.99     25000

Sentence: I wouldn't be caught dead watching the NFL if it weren't for Taylor Swift.	 Predicted Topic: movies
Sentence: Chris O'Donnell stated that while filming for this movie, he felt like he was in a Toys ''R'' Us commercial.	 Predicted Topic: movies
Sentence: The whole game was a rollercoaster ride, but Los Angeles Lakers ultimately persevered and won!	 Predicted Topic: movies
Sentence: Zendaya slayed in Dune 2, as she does in all her movies.	 Predicted Topic: movies
Sentence: While my favorite player was playing this match and started off strongggg, it went downhill after Messi's injury midgame.	 Predicted 

*Support Vector Machine (SVM) classification with TF-IDF (Term Frequency-Inverse Document Frequency) vectorization*
Transforming using the TF-IDF vectorizer, and predicting the labels using the trained SVM classifier. Then printing out the test sentences with the model's trained prediction.

In [6]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Read movie reviews
movies_df = pd.read_csv("C:/Users/mberk/OneDrive/Masaüstü/New folder/Movies.csv", encoding="latin-1")
movies_df.columns = ["ReviewContent"]
movies_df["Label"] = "movies"
movies_df = movies_df.sample(frac=0.7, random_state=42)  # Using a smaller fraction of the dataset

# Read book reviews
reviews_df = pd.read_csv("C:/Users/mberk/OneDrive/Masaüstü/New folder/Reviews.csv", encoding="latin-1")
reviews_df.columns = ["ReviewContent"]
reviews_df["Label"] = "book"

# Read sports tweets from CSV
sports_df = pd.read_csv("C:/Users/mberk/OneDrive/Masaüstü/New folder/Sports1.csv", encoding="utf-8")
sports_df.columns = ["ReviewContent"]
sports_df["Label"] = "sports"

# Read another set of sports tweets from CSV
sports2_df = pd.read_csv("C:/Users/mberk/OneDrive/Masaüstü/New folder/Sports2.csv", encoding="utf-8")
sports2_df.columns = ["ReviewContent"]
sports2_df["Label"] = "sports"

all_reviews_df = pd.concat([movies_df, reviews_df, sports_df, sports2_df], ignore_index=True)
all_reviews_df.dropna(inplace=True)

X_train, X_test, y_train, y_test = train_test_split(
    all_reviews_df["ReviewContent"], all_reviews_df["Label"], test_size=0.3, random_state=42
)

tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # Reduced number of features

X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Initialize SVM classifier
svm_clf = SVC(kernel='linear')
svm_clf.fit(X_train_tfidf, y_train)

# Predictions using SVM
y_pred_svm = svm_clf.predict(X_test_tfidf)
print("SVM Classification Report:")
print(classification_report(y_test, y_pred_svm))

# Test sentences
test_sentences = [
    "I wouldn't be caught dead watching the NFL if it weren't for Taylor Swift.",
    "Chris O'Donnell stated that while filming for this movie, he felt like he was in a Toys ''R'' Us commercial.",
    "The whole game was a rollercoaster ride, but Los Angeles Lakers ultimately persevered and won!",
    "Zendaya slayed in Dune 2, as she does in all her movies.",
    "While my favorite player was playing this match and started off strongggg, it went downhill after Messi's injury midgame.",
    "My uncle's brother's neighbor's cat's veterinarian David reads the communist manifesto in his spare time.",
    "He said that The Great Gatsby is the best novel ever, and I was about to throw hands.",
    "I could not look away from this train wreck of a movie, on February 14th of all days.",
    "The film Everything Everywhere All At Once follows Evelyn Wang, a woman drowning under the stress of her family's failing laundromat.",
    "I just finished reading pride and prejudice which had me HOOOKED from the beginning."
]

X_test_sentences = tfidf_vectorizer.transform(test_sentences)

predicted_topics = svm_clf.predict(X_test_sentences)

for sentence, topic in zip(test_sentences, predicted_topics):
    print(f"Sentence: {sentence}\t Predicted Topic: {topic}")


SVM Classification Report:
              precision    recall  f1-score   support

        book       0.99      0.98      0.98      1524
      movies       1.00      1.00      1.00     10537
      sports       1.00      1.00      1.00     29939

    accuracy                           1.00     42000
   macro avg       0.99      0.99      0.99     42000
weighted avg       1.00      1.00      1.00     42000

Sentence: I wouldn't be caught dead watching the NFL if it weren't for Taylor Swift.	 Predicted Topic: sports
Sentence: Chris O'Donnell stated that while filming for this movie, he felt like he was in a Toys ''R'' Us commercial.	 Predicted Topic: sports
Sentence: The whole game was a rollercoaster ride, but Los Angeles Lakers ultimately persevered and won!	 Predicted Topic: sports
Sentence: Zendaya slayed in Dune 2, as she does in all her movies.	 Predicted Topic: movies
Sentence: While my favorite player was playing this match and started off strongggg, it went downhill after Messi's 

The results are similar but better scores for sports label can be observed for SVM. However MNB is better in terms of processing data and being time efficient. Different weights for the movies data has been tried for both models and the ideal seems to be the current weight for the models seperately.

*THE DATA FOR THE ANALYSIS*

Sports Data:
https://www.kaggle.com/datasets/kumari2000/fifa-world-cup-twitter-dataset-2022
https://www.kaggle.com/datasets/eliasdabbas/european-football-soccer-clubs-tweets

Movies Data:
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Book Reviews Data:
https://www.kaggle.com/datasets/shrutimehta/amazon-book-reviews-webscraped