## Building a spam detector

In this project, we are going to test drive a spam detector using a couple of different ways of vectorizing our data. The dataset is 5572 text messages, available here: (https://www.kaggle.com/uciml/sms-spam-collection-dataset).

In [1]:
#import necessary libraries
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np

In [2]:
#import data and delete irrelevant columns
data = pd.read_csv('spam.csv', encoding = 'latin-1')
data = data[['v1', 'v2']]

In [3]:
print(data.shape)
data.head()

(5572, 2)


Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Now that we've imported the data, we need to do some data processing. Since we want to know whether a message is spam or not, and it's either a yes or a no, we only need to bring one of the one-hot encoded columns into the dataframe that we will perform feature extraction on. 

In [4]:
v1_hot = pd.get_dummies(data[['v1']])
v1_hot.head()

Unnamed: 0,v1_ham,v1_spam
0,1,0
1,1,0
2,0,1
3,1,0
4,1,0


In [5]:
feature = pd.merge(v1_hot['v1_spam'], data['v2'], left_index = True, right_index = True)
feature = feature.rename(columns={'v1_spam' : 'Spam T/F', 'v2' : 'Message Text'})
feature.head()

Unnamed: 0,Spam T/F,Message Text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


#### Now we're ready to build the training and test sets using a few different ways of defining our inputs (different feature extraction methods).

In [6]:
y = feature[['Spam T/F']]

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf = TfidfVectorizer(decode_error='ignore')
X = tfidf.fit_transform(feature['Message Text'])

Let's split the data into training and test sets. 

In [8]:
#Train-Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (4457, 8672) (4457, 1)
Test set: (1115, 8672) (1115, 1)


In [9]:
model = MultinomialNB()
model.fit(X_train, y_train.values.ravel())
print("Classification rate for NB with TF/IDF, test data:", model.score(X_test, y_test))
print("Classification rate for NB with TF/IDF, train data", model.score(X_train, y_train))

Classification rate for NB with TF/IDF, test data: 0.9632286995515695
Classification rate for NB with TF/IDF, train data 0.9694862014808167


Now let's try a different vectorization.

In [10]:
count_vectorizer = CountVectorizer(decode_error='ignore')
X = count_vectorizer.fit_transform(feature['Message Text'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print ('Train set:', X_train.shape,  y_train.shape)
print ('Test set:', X_test.shape,  y_test.shape)

Train set: (4457, 8672) (4457, 1)
Test set: (1115, 8672) (1115, 1)


In [11]:
model = MultinomialNB()
model.fit(X_train, y_train.values.ravel())
print("Classification rate for NB with Count Vectorizer, test data:", model.score(X_test, y_test))
print("Classification rate for NB with Count Vectorizer, train data", model.score(X_train, y_train))

Classification rate for NB with Count Vectorizer, test data: 0.9901345291479821
Classification rate for NB with Count Vectorizer, train data 0.9925959165357864


Looks like count vectorizer is the more accurate model. Let's compare it with our actual data to see what messages fall through the cracks. 

In [13]:
feature['Predictions'] = model.predict(X)
feature.head(10)

Unnamed: 0,Spam T/F,Message Text,Predictions
0,0,"Go until jurong point, crazy.. Available only ...",0
1,0,Ok lar... Joking wif u oni...,0
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,0,U dun say so early hor... U c already then say...,0
4,0,"Nah I don't think he goes to usf, he lives aro...",0
5,1,FreeMsg Hey there darling it's been 3 week's n...,0
6,0,Even my brother is not like to speak with me. ...,0
7,0,As per your request 'Melle Melle (Oru Minnamin...,0
8,1,WINNER!! As a valued network customer you have...,1
9,1,Had your mobile 11 months or more? U R entitle...,1


These are messages that our model predicted were not spam, but actually are. 

In [23]:
sneaky_spam = feature[(feature['Predictions'] == 0) & (feature['Spam T/F'] == 1)]
for x in sneaky_spam['Message Text']:
    print(x)
    print('')

FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, å£1.50 to rcv

Did you hear about the new \Divorce Barbie\"? It comes with all of Ken's stuff!"

Do you realize that in about 40 years, we'll have thousands of old ladies running around with tattoos?

Ever thought about living a good life with a perfect partner? Just txt back NAME and AGE to join the mobile community. (100p/SMS)

Hello. We need some posh birds and chaps to user trial prods for champneys. Can i put you down? I need your address and dob asap. Ta r

Can U get 2 phone NOW? I wanna chat 2 set up meet Call me NOW on 09096102316 U can cum here 2moro Luv JANE xx Callså£1/minmoremobsEMSPOBox45PO139WA

Hi its LUCY Hubby at meetins all day Fri & I will B alone at hotel U fancy cumin over? Pls leave msg 2day 09099726395 Lucy x Callså£1/minMobsmoreLKPOBOX177HP51FL

Would you like to see my XXX pics they are so hot they were nearly banned in the uk!


These are messages that our model marked as spam, but actually are not. 

In [24]:
accidental_spam = feature[(feature['Predictions'] == 1) & (feature['Spam T/F'] == 0)]
for x in accidental_spam['Message Text']:
    print(x)
    print('')

Finally the match heading towards draw as your prediction.

Waiting for your call.

Can u get pic msgs to your phone?

We have sent JD for Customer Service cum Accounts Executive to ur mail id, For details contact us

Hey...Great deal...Farm tour 9am to 5pm $95/pax, $50 deposit by 16 May

Total video converter free download type this in google search:)

Madam,regret disturbance.might receive a reference check from DLF Premarica.kindly be informed.Rgds,Rakhesh,Kerala.

I know complain num only..bettr directly go to bsnl offc nd apply for it..

Boy; I love u Grl: Hogolo Boy: gold chain kodstini Grl: Agalla Boy: necklace madstini Grl: agalla Boy: Hogli 1 mutai eerulli kodthini! Grl: I love U kano;-)

Unlimited texts. Limited minutes.

\GRAN ONLYFOUND OUT AFEW DAYS AGO.CUSOON HONI\""

Mathews or tait or edwards or anderson

Gettin rdy to ship comp

\CHA QUITEAMUZING THATåÕSCOOL BABE

Have you laid your airtel line to rest?

I liked the new mobile

Anytime...

Nokia phone is lovly..

We hav