# Naive Bayes classifier applied to spam filtering

In the following we will analze text data retrieved from about 5500 SMS to train a Multinomial Naive Bayes algorithm for spam filtering. The sklearn function CountVectorizer() transforms the text data into a sparse matrix (where entries are indicating how often a word appears but not the probability of occurrence) which is the reason why we use a MNB for classification.

In [7]:
#import necessary packages
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer

%matplotlib inline
import matplotlib.pyplot as plt

from helper import plot_classifier #helper.py is saved in the repository

In [2]:
#defining the data frame
#data taken from: https://www.kaggle.com/uciml/sms-spam-collection-dataset
#elaborated as follows:
#  - 2 "Unnamed" columns dropped
#  - encoding changed to utf-8
#  - columns renamed

df = pd.read_csv("spam.csv")

df.head()

Unnamed: 0,type,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [18]:
print(df.shape)

(5572, 2)


In [13]:
#define variables
X = df["message"] # single brackets, no .values

Y = df["type"] # single brackets, no .values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0, test_size = .25)

#need to transform text data into integers in matrix form via sklearn function CountVectorizer()
cv = CountVectorizer()

cv.fit(X_train)

X_train = cv.transform(X_train)
X_test = cv.transform(X_test)

In [15]:
print(X_train.shape)

(4179, 7323)


Comment: The result is a sparse matrix with about 4200 lines and 7300 columns.

In [9]:
#we would like to check in detail what is going on when the text data is transformed into matrix-valued integer form
cv_test = CountVectorizer()

#example to clarify what's happening here:
cv_test.fit(pd.Series(["Hello world","Hello Jupiter"]))

cv_test.transform(pd.Series(["Hello world","Hello Jupiter"])).toarray()

#or in a combined fashion via 
#cv_test.fit_transform(pd.Series(["Hello world","Hello Jupiter"])).toarray()

array([[1, 0, 1],
       [1, 1, 0]])

In [17]:
#train model using MultinomialNB
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, Y_train)

print(model.score(X_test, Y_test))

0.9863603732950467


Comment: This is quite an impressive result given that we had a data set of roughly 5500 SMS to train the model.

We may refine this model by constraining the number of features:

In [33]:
X = df["message"] # single brackets, no .values

Y = df["type"] # single brackets, no .values

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state = 0, test_size = .25)


#cv = CountVectorizer(max_features = 1000) #constrains number of columns, leads to same result
cv = CountVectorizer(min_df = 10) #only those columns such that a feature appears at least ten times, leads to similar result

cv.fit(X_train)

X_train = cv.transform(X_train)
X_test = cv.transform(X_test)

model = MultinomialNB()
model.fit(X_train, Y_train)

print(model.score(X_test, Y_test))

0.9849246231155779
