# Naive Bayes Model to Classify Genres

In this notebook, we will conduct training and testing on our book summary data. First, we will establish a baseline through Multinomial Naive Bayes then conduct our more robust transformer modeling later on

In [1]:
# import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt   
import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [2]:
# load the data
train_data = 'data/train_data.zip'
test_data = 'data/test_data.zip'

df_train = pd.read_csv(train_data)
df_test = pd.read_csv(test_data)

# show the dataframes
df_train.head()
df_test.head()

Unnamed: 0.1,Unnamed: 0,Wikipedia ID,Freebase ID,Title,Author,Publication Date,Genres,Summary,CleanSummary
0,1092,453257,/m/02bc8q,Encounter With Tiber,Buzz Aldrin,1996,Science Fiction,"9,000 years ago an alien society in the Alpha...",years ago alien society alpha centauri system ...
1,10276,12185816,/m/02vtnv1,Princess Academy,Shannon Hale,2005-06-16,Children's literature,Miri is a fourteen-year-old girl from Mount E...,miri fourteenyearold girl mount eskel isolated...
2,7421,6564761,/m/0gbsxh,Cyborg,Martin Caidin,1972-04,Techno-thriller,Cyborg is the story of an astronaut-turned-te...,cyborg story astronautturnedtest pilot steve a...
3,14634,24913860,/m/09gghmb,Star Wars: Crosscurrent,Paul S. Kemp,,Science Fiction,"In 5000 BBY, the Sith Lord Saes Rrogan gather...",bby sith lord saes rrogan gathers forceenhanci...
4,16374,33580925,/m/0hgmd01,Children of Paranoia,,2011-09,Thriller,"Since the age of eighteen, Joseph has been as...",since age eighteen joseph assassinating people...


#### Tokenizing data

We will be tokenizing data using CountVectorizer() for our Multinomial Naive-Bayes Baseline.

In [3]:
# tokenize data for Naive Bayes
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer()

# apply the vectorizer and store in Dataframe
training_counts = cvect.fit_transform(df_train.CleanSummary.values.astype('U'))
words = cvect.get_feature_names_out()
arr = training_counts.toarray()

# load training genres
train_nb_actual = df_train.Genres

In [4]:
# vectorize the test data
testing_counts = cvect.transform(df_test.CleanSummary.values.astype('U'))
test_nb_actual = df_test.Genres

In [5]:
# apply naive bayes model
mnb_model = MultinomialNB().fit(training_counts, train_nb_actual)
mnb_outputs = mnb_model.predict(testing_counts)

# calculate accuracy
mnb_acc = accuracy_score(mnb_outputs, test_nb_actual)
print("Multinomial Naive Bayes Accuracy:", mnb_acc)

Multinomial Naive Bayes Accuracy: 0.4275506153235411
