# Naive Bayes Example, revisited

This is a repeat of the previous ML demo, using the same sms data. We will use the BournoulliNB algorithm instead of the MultinomialNB model.


In [1]:
import pandas as pd
df = pd.read_csv('sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')
print('rows and columns:', df.shape)
print(df.head())

rows and columns: (4837, 2)
   spam                                               text
0     0  Go until jurong point, crazy.. Available only ...
1     0                      Ok lar... Joking wif u oni...
2     1  Free entry in 2 a wkly comp to win FA Cup fina...
3     0  U dun say so early hor... U c already then say...
4     0  Nah I don't think he goes to usf, he lives aro...


In the previous notebook we saw that we got better results by identifying punctuation and all caps, so let's do that again.

In [2]:
import re

df['text'].replace('[\d][\d]+', ' num ', regex=True, inplace=True)
df['text'].replace('[!@#*][!@#*]+', ' punct ', regex=True, inplace=True)
df['text'].replace('[A-Z][A-Z]+', ' caps ', regex=True, inplace=True)
    

### Text preprocessing

Before applying a machine learning algorithm, we need to do some preprocessing on the text. The following code removes stop words while creating a tf-idf representation of the data.

We changed a few settings on the vectorizer. We let binary=True which will set tf=1 instead of using counts.

In [3]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords, binary=True)

In [4]:
# set up X and y
X = vectorizer.fit_transform(df.text)
y = df.spam

### train and test sets

First we need to establish that our predictor column is the text column and the label we are trying to learn is our target column. Then we split the data into train and test sets, with 20% of the data going to the test set.

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)


### train the naive bayes classifier

For this data, let's try the BernoulliNB instead of MultinomialNB. The BernoulliNB function will convert our data to binary features.

We used the default settings. You should always research the documentation and see what these mean:

- alpha: additive (Laplace) smoothing (0 for no smoothing)
- binarize: threshold for mapping input features to booleans; if None, input is assumed to be binary vectors
- fit_prior: if True, learn priors from data; if false, use a uniform prior
- class_prior: lets you specify class priors


In [6]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import BernoulliNB

naive_bayes = BernoulliNB()
naive_bayes.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)


### evaluate on the test data

In [7]:
# make predictions on the test data
pred = naive_bayes.predict(X_test)

# print confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)


array([[847,   1],
       [ 13, 107]])

In [8]:
# confusion matrix has this form for our example, where "negative" means not spam:
#     tn   fp
#     fn   tp

# breaking down the confusion matrix to understand terminology
# not_spam = 0, the negative class
tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
print(tn, fp, fn, tp)


847 1 13 107


In [9]:
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

accuracy score:  0.985537190083
precision score:  0.990740740741
recall score:  0.891666666667
f1 score:  0.938596491228


### analysis

How did the BernoulliNB model with binarized vectors do compared to the MultinomialNB model with count vectors?

The accuracy was slightly higher, precision very slightly lower but recall jumped from .83 to .89. This made our f1 score .9385, over 3 points higher than the MultinomialNB model. 