# Naive Bayes Example

Using an SMS Spam data set (slightly modified) from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). The data set is a collection of 5574 SMS messages that have been labeled as ham or spam. The file is a tab-delimited file with the first column the label and the second the message content. I edited the data set to remove some unwanted columns and add headings. 



In [1]:
import pandas as pd
df = pd.read_csv('sms-spam.csv', header=0, usecols=[1,2], encoding='latin-1')
print('rows and columns:', df.shape)
print(df.head())

rows and columns: (4837, 2)
   spam                                               text
0     0  Go until jurong point, crazy.. Available only ...
1     0                      Ok lar... Joking wif u oni...
2     1  Free entry in 2 a wkly comp to win FA Cup fina...
3     0  U dun say so early hor... U c already then say...
4     0  Nah I don't think he goes to usf, he lives aro...



### Text preprocessing

Before applying a machine learning algorithm, we need to do some preprocessing on the text. The following code removes stop words while creating a tf-idf representation of the data.

The TfidfVectorizer is a combination of two other sklearn functions: CountVectorizer, and TfidfTransformer. 

There are too many parameters to list here, [refer to the doc](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

An excellent primer on sklearn text processing [is found here]()http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

In [2]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(stop_words=stopwords)

In [3]:
# set up X and y
X = vectorizer.fit_transform(df.text)
y = df.spam

In [4]:
# take a peek at X
X.toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [5]:
# look at y
y[:20]

0     0
1     0
2     1
3     0
4     0
5     1
6     0
7     0
8     1
9     1
10    0
11    1
12    1
13    0
14    0
15    1
16    0
17    0
18    0
19    1
Name: spam, dtype: int64

### train and test sets

First we need to establish that our predictor column is the text column and the label we are trying to learn is our target column. Then we split the data into train and test sets, with 20% of the data going to the test set.

In [6]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

# take a peek at the data
print('train size:', X_train.shape)
X_train.toarray()[:5]

# this is a very sparse matrix because most of the 8613 words don't occur in each sms message

train size: (3869, 8613)


array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

### train the naive bayes classifier

For this data, let's try the MultinomialNB. 

We used the default settings. You should always research the documentation and see what these mean:

- alpha: additive (Laplace) smoothing (0 for no smoothing)
- fit_prior: if True, learn priors from data; if false, use a uniform prior
- class_prior: lets you specify class priors


In [7]:
from sklearn.naive_bayes import MultinomialNB

naive_bayes = MultinomialNB()
naive_bayes.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


### evaluate on the test data

In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# make predictions on the test data
pred = naive_bayes.predict(X_test)

# print confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred)


array([[848,   0],
       [ 32,  88]])

In [9]:
# confusion matrix has this form for our example, where "negative" means not spam:
#     tn   fp
#     fn   tp

# breaking down the confusion matrix to understand terminology
# not_spam = 0, the negative class
tn, fp, fn, tp = confusion_matrix(y_test, pred).ravel()
print(tn, fp, fn, tp)


848 0 32 88


In [10]:
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))

accuracy score:  0.96694214876
precision score:  1.0
recall score:  0.733333333333
f1 score:  0.846153846154


### Metrics

In our example, spam = 1, not-spam = 0. Spam is the "positive" class.

#### precision

Precision measures what proportion of our spam classifications was correct.

Precision = tp / (tp + fp)

If you have no false positives, you get 100% precision.

#### recall

Recall measures what proportion of spam messages were correctly identifed.

Recall = tp / (tp + fn)

We got 88 / (88 + 32) = .73 recall

#### f1 - the geometric mean

f1 = 2pr / (p + r) = 2 * 1 * 73 / (1 + .73) = .84

How good is our accuracy?

In the data set, there are 4199 not-spam messages out of 4837. The test data distribution is similar. So if we guess not spam every time we would have 87% accuracy. It seems that Naive Bayes did learn something. The accuracy was several points above this simple baseline.

In [11]:
print('spam size in test data:',y_test[y_test==0].shape[0])
print('test size: ', len(y_test))
baseline = y_test[y_test==0].shape[0] / y_test.shape[0] 
print(baseline)

spam size in test data: 848
test size:  968
0.8760330578512396


Examine some wrong classificataions.

In [12]:
y_test[y_test != pred]

4179    1
677     1
2073    1
2466    1
4721    1
3144    1
1788    1
801     1
511     1
4062    1
4731    1
1150    1
2754    1
924     1
444     1
1583    1
4757    1
760     1
165     1
4214    1
1266    1
851     1
827     1
2932    1
3003    1
4635    1
4363    1
1558    1
366     1
3528    1
4479    1
2584    1
Name: spam, dtype: int64

In [13]:
for i in [1536, 4692, 2915, 2464, 1101, 1268, 227]:
    print(df.loc[i])

spam                                                    1
text    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
Name: 1536, dtype: object
spam                                                    1
text    Santa Calling! Would your little ones like a c...
Name: 4692, dtype: object
spam                                                    1
text    You have 1 new voicemail. Please call 08719181...
Name: 2915, dtype: object
spam                                                    1
text    INTERFLORA - ÂIt's not too late to order Inte...
Name: 2464, dtype: object
spam                                                    1
text    CLAIRE here am havin borin time & am now alone...
Name: 1101, dtype: object
spam                                                    1
text    500 free text msgs. Just text ok to 80488 and ...
Name: 1268, dtype: object
spam                                                    1
text    Will u meet ur dream partner soon? Is ur caree...
Name: 227, dtype: object


#### analysis

There are capital letters and exclamation points in these messages that were misclassified as not spam, but they really are spam.  The way we preprocessed got rid of this information so our algorithm could not learn from it. 

Will we get better performance if we process the data differently?


# Second Try

Let's preprocess the text differently to recognize punctuation and caps.

In [14]:
import re

df['text'].replace('[\d][\d]+', ' num ', regex=True, inplace=True)
df['text'].replace('[!@#*][!@#*]+', ' punct ', regex=True, inplace=True)
df['text'].replace('[A-Z][A-Z]+', ' caps ', regex=True, inplace=True)
    
# these are known problem messages 
for i in [1536, 4692, 2915, 2464, 1101, 1268, 227]:
    print(df.loc[i])

spam                                                    1
text     caps   num  &  caps   caps   caps   caps   ca...
Name: 1536, dtype: object
spam                                                    1
text    Santa Calling! Would your little ones like a c...
Name: 4692, dtype: object
spam                                               1
text    You have 1 new voicemail. Please call  num .
Name: 2915, dtype: object
spam                                                    1
text     caps  - ÂIt's not too late to order Interflo...
Name: 2464, dtype: object
spam                                                    1
text     caps  here am havin borin time & am now alone...
Name: 1101, dtype: object
spam                                                    1
text     num  free text msgs. Just text ok to  num  an...
Name: 1268, dtype: object
spam                                                    1
text    Will u meet ur dream partner soon? Is ur caree...
Name: 227, dtype: object


In [15]:
# do the rest of the processing

X = vectorizer.fit_transform(df.text)
y = df.spam

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=1234)

naive_bayes.fit(X_train, y_train)



MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [16]:
# evaluate

pred = naive_bayes.predict(X_test)
print('accuracy score: ', accuracy_score(y_test, pred))
print('precision score: ', precision_score(y_test, pred))
print('recall score: ', recall_score(y_test, pred))
print('f1 score: ', f1_score(y_test, pred))
confusion_matrix(y_test, pred)


accuracy score:  0.979338842975
precision score:  1.0
recall score:  0.833333333333
f1 score:  0.909090909091


array([[848,   0],
       [ 20, 100]])

It seems that we moved 12 observations that were misclassified as not-spam into spam. We got better recall which in turn led to a better f1 score. 