 # NOTEBOOK 04d: MULTINOMIAL NAIVE BAYES

Multinomial Naive Bayes is a generalized Bayesian classification model that is suited to large feature spaces. Typically used with discrete features, this model is also commonly used with Tfidf vectorized text data. As the name suggests, naive Bayes operates on the fundamental assumption that the independant variables are independant to reduce the complexity that stems from the conditional probabilities required for modeling dependant variables. Though we know this is contrary to the true nature of our data, this model still produces surprisingly accurate predictions for the simplicity of the model.

In [1]:
import pandas as pd
import numpy as np

from scipy import sparse
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

np.random.seed(42)

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%matplotlib inline

In [2]:
!ls '../assets'

1544820504_comments_df.csv     1545277666_y_test.csv
1544843302_comments_df.csv     1545277666_y_train.csv
1544988010_comments_df.csv     1545285057_tfidf_col.csv
1545241316_clean_target.csv    1545336727_SVD_col.csv
1545241316_clean_text.csv      1545336727_XtestSVD_coo.npz
1545249401_cvec_coo.npz        1545336727_XtestSVD_raw.csv
1545249401_cvec_coo_col.csv    1545336727_XtrainSVD_coo.npz
1545254581_clean_text.csv      1545336727_XtrainSVD_raw.csv
1545266972_clean_text.csv      cvec_1545266972_coo_col.csv
1545266972_cvec_coo.npz        file_log.txt
1545266972_tfidf_coo.npz       test_1545277666_tfidf_coo.npz
1545272821_eda_words.csv       tfidf_1545266972_coo_col.csv
1545277666_tfidf_col.csv       train_1545277666_tfidf_coo.npz


Importing the Tfidf vectorized training and validation data.

In [3]:
columns = pd.read_csv('../assets/1545277666_tfidf_col.csv', na_filter=False, header=None)
cols = np.array(columns[0])

In [4]:
X_train_tfidf_coo=sparse.load_npz('../assets/train_1545277666_tfidf_coo.npz')
X_train_tfidf = pd.SparseDataFrame(X_train_tfidf_coo, columns=cols)

In [5]:
X_test_tfidf_coo=sparse.load_npz('../assets/test_1545277666_tfidf_coo.npz')
X_test_tfidf = pd.SparseDataFrame(X_test_tfidf_coo, columns=cols)

Filling null values for modeling.

In [6]:
X_train_tfidf.fillna(0, inplace=True)
X_test_tfidf.fillna(0, inplace=True)

In [7]:
y_train = pd.read_csv('../assets/1545277666_y_train.csv', header=None)
y_test = pd.read_csv('../assets/1545277666_y_test.csv', header=None)

Transforming the target data into a flattened 1-d array for modeling.

In [8]:
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)

Instantiating the Multinomial Naive Bayes model.

In [9]:
nb = MultinomialNB()

Fitting the model.

In [10]:
model = nb.fit(X_train_tfidf, np.ravel(y_train))

Scoring the model on the training data.

In [21]:
model.score(X_train_tfidf, y_train)

0.7434105765245653

Our fitted model predicted with 74.3% accuracy on our training data. This is to be compared against our baseline accuracy of 53.9%. 

Scoring the model on the validation data.

In [14]:
model.score(X_test_tfidf, y_test)

0.7235188901492017

The validation dataset was scored with 72.3% accuracy, slightly below the training score. This suggest that the model may be slightly overfit, though this could be within a reasonable marge of error. Overall, this is a very good score considering the simplicity of the model.

In [22]:
pred_proba = pd.DataFrame(model.predict_proba(X_test_tfidf))

Reloading the original clean dataframe to connect the predicted probabilities to the sample text.m

In [1]:
df = pd.read_csv('../assets/1545272821_eda_words.csv', na_filter=False)

NameError: name 'pd' is not defined

In [None]:
df.loc[X_test[(pred_probas[:1] >.7) & (y_test !=1)].index,:]

Reviewing the predicted probabilites, we see many of our observations either had wide margins or were very closely split with a margin less than 10%. 

Using the fitted model to predict on the validation set.

In [11]:
preds = model.predict(X_test_tfidf)

Generating a confusion matrix to review the individual predictions.

In [15]:
tn, fp, fn, tp = confusion_matrix(y_test, preds).ravel()

In [16]:
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)

True Negatives: 13708
False Positives: 7441
False Negatives: 5234
True Positives: 19461


The type I error rate (16%) was much higher than the type II error (11%), though in this case we don't necessarily consider one type of misclassification better or worse than than the other. For our purposes, we can consider them equivalent, however if we felt strongly that we err on the side of type II error we can decrease the sensitivity. Likewise, if our priority was aggressive screening of topical/syntatical variation we could increase the sensitivity at the expense of increasing the type I error.