# Naive Bayes Model to Classify Genres

In this notebook, we will conduct training and testing on our book summary data. First, we will establish a baseline through Multinomial Naive Bayes then conduct our more robust transformer modeling later on

In [1]:
# import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt   
import numpy as np

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [2]:
# load the data
train_data = 'data/train_data.zip'
test_data = 'data/test_data.zip'

df_train = pd.read_csv(train_data)
df_test = pd.read_csv(test_data)

# show the dataframes
df_train.head()
df_test.head()

Unnamed: 0.1,Unnamed: 0,Wikipedia ID,Freebase ID,Title,Author,Publication Date,Genres,Summary,CleanSummary
0,1092,453257,/m/02bc8q,Encounter With Tiber,Buzz Aldrin,1996,Science Fiction,"9,000 years ago an alien society in the Alpha...",years ago alien society alpha centauri system ...
1,10276,12185816,/m/02vtnv1,Princess Academy,Shannon Hale,2005-06-16,Children's literature,Miri is a fourteen-year-old girl from Mount E...,miri fourteenyearold girl mount eskel isolated...
2,7421,6564761,/m/0gbsxh,Cyborg,Martin Caidin,1972-04,Techno-thriller,Cyborg is the story of an astronaut-turned-te...,cyborg story astronautturnedtest pilot steve a...
3,14634,24913860,/m/09gghmb,Star Wars: Crosscurrent,Paul S. Kemp,,Science Fiction,"In 5000 BBY, the Sith Lord Saes Rrogan gather...",bby sith lord saes rrogan gathers forceenhanci...
4,16374,33580925,/m/0hgmd01,Children of Paranoia,,2011-09,Thriller,"Since the age of eighteen, Joseph has been as...",since age eighteen joseph assassinating people...


#### Tokenizing data

We will be tokenizing data using CountVectorizer() for our Multinomial Naive-Bayes Baseline.

In [3]:
# tokenize data for Naive Bayes
from sklearn.feature_extraction.text import CountVectorizer
cvect = CountVectorizer()

# apply the vectorizer to training data
training_counts = cvect.fit_transform(df_train.CleanSummary.values.astype('U'))
train_nb_actual = df_train.Genres

In [4]:
# apply the vectorizer to the test data
testing_counts = cvect.transform(df_test.CleanSummary.values.astype('U'))
test_nb_actual = df_test.Genres

#### Apply the Naive Bayes Model

We will now use the data above to calculate a baseline accuracy.

In [5]:
# apply naive bayes model
mnb_model = MultinomialNB().fit(training_counts, train_nb_actual)
mnb_outputs = mnb_model.predict(testing_counts)

# calculate accuracy
mnb_acc = accuracy_score(mnb_outputs, test_nb_actual)
print("Multinomial Naive Bayes Accuracy:", mnb_acc)

Multinomial Naive Bayes Accuracy: 0.4275506153235411


In [None]:
# confusion matrix
genres = list(set(df_test.Genres))
nb_conf = confusion_matrix(test_nb_actual, mnb_outputs, labels=genres)

plt.figure(figsize=(50,50))
ax=plt.subplot()
sns.heatmap(nb_conf, annot=True, fmt='g', ax=ax);  #annot=True to annotate cells, ftm='g' to disable scientific notation

# labels, title and ticks
ax.set_xlabel('Predicted labels', fontsize=14)
ax.set_ylabel('True labels', fontsize=14)
ax.set_title('Multinomial NB Confusion Matrix', fontsize=20)
ax.xaxis.set_ticklabels(genres, fontsize=10)
ax.yaxis.set_ticklabels(genres, fontsize=10)

plt.savefig('figures/mnb_confmat.jpg')

In [10]:
metrics = classification_report(test_nb_actual, mnb_outputs)
print(metrics)

                                          precision    recall  f1-score   support

                         Adventure novel       0.00      0.00      0.00         8
                       Alternate history       1.00      0.03      0.05        40
Apocalyptic and post-apocalyptic fiction       0.00      0.00      0.00         1
                  Autobiographical novel       0.00      0.00      0.00        15
                           Autobiography       0.00      0.00      0.00        25
                           Bildungsroman       0.00      0.00      0.00         3
                               Biography       0.00      0.00      0.00        16
                            Black comedy       0.00      0.00      0.00         4
                   Children's literature       0.43      0.57      0.49       249
                                  Comedy       0.00      0.00      0.00        12
                             Comic novel       0.00      0.00      0.00        15
               

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
