In [1]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
import ast

# Training and Testing Multinomial Naive Bayes Classifier

## Reading in Training Data and Simple Transforms

In [2]:
train = pd.read_csv('./data/processed/goodreads_books_train_processed.csv',
                    converters={'cleaned_descriptions':ast.literal_eval},
                    index_col=0)
train.head()

Unnamed: 0_level_0,is_english_description,cleaned_descriptions,genre_1,votes_1
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,True,"[angri, rebel, john, drop, school, enlist, arm...",Romance,3493.0
1,True,"[thi, onli, complet, unabridg, one, volum, edi...",Reference,77.0
2,True,"[maria, moran, first, inkl, troubl, wa, copper...",Thriller,32.0
3,True,"[indi, savag, cop, daughter, rock, chick, use,...",Romance,1926.0
4,True,"[inexplic, worldwid, event, forti, seven, extr...",Sequential Art,3214.0


Scikit-learn has the ability to do all of the text cleaning we have done already EXCEPT stemming, so I'll use our cleaned descriptions after stitching them back together such that each description is a string.

We are passing counts of unique words as features, meaning that the dimensions of `X_train` are `(number of descriptions, number of unique words)`.  The targets in `y_train` are just strings representing each genre.

In [3]:
train['mod_descriptions'] = train['cleaned_descriptions'].apply(lambda x: ' '.join(x))

valid_train = train[train[['mod_descriptions', 'genre_1']].notnull().all(axis=1)]

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(valid_train.mod_descriptions)
y_train = valid_train['genre_1']

classifier = MultinomialNB()
classifier.fit(X_train, y_train)

MultinomialNB()

## Read in Test Data

In [4]:
test = pd.read_csv('./data/processed/goodreads_books_test_processed.csv',
                   converters={'cleaned_descriptions':ast.literal_eval},
                   index_col=0)
test.head()

Unnamed: 0_level_0,is_english_description,cleaned_descriptions,genre_1,votes_1
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,True,"[jim, butcher, 1, new, york, time, bestsel, au...",Fantasy,2560.0
2,True,"[bicker, struggl, piski, problem, liam, beth, ...",,
3,True,"[hi, posit, tracker, snowdanc, pack, drew, kin...",Romance,1462.0
4,True,"[action, doesnt, let, explos, tomorrow, book, ...",Young Adult,423.0
7,True,"[berti, think, quest, almost, done, help, arie...",Fantasy,187.0


## Make Similar Transforms and Score

In [5]:
test['mod_descriptions'] = test['cleaned_descriptions'].apply(lambda x: ' '.join(x))
valid_test = test[test[['mod_descriptions', 'genre_1']].notnull().all(axis=1)]
X_test = vectorizer.transform(valid_test.mod_descriptions)
y_test = valid_test[['genre_1']]

In [6]:
classifier.score(X_test,y_test)

0.5129518929689724

The score above is actually not bad considering how overlapping the genres are... It could probably be greatly improved just by making genres more distinct.