# Spam Classifier



1. Reading a text-based dataset into pandas
2. Vectorizing our dataset
3. Adding Feutures
4. Building and evaluating a model
5. Comparing models


In [36]:
import pandas as pd
import re
import numpy as np
from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import CountVectorizer

## 1: Reading a text-based dataset into pandas

In [6]:
# read file into pandas using a relative path
path = 'data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

In [7]:
# examine the shape
sms.shape

(5572, 2)

In [8]:
# examine the first 10 rows
sms.head(10)

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [9]:
# examine the class distribution
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [10]:
# convert label to a numerical variable
sms['label_num'] = sms.label.map({'ham':0, 'spam':1})

In [11]:
# check that the conversion worked
sms.head(10)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


In [78]:
# how to define X and y (from the SMS data) for use with COUNTVECTORIZER
X = sms.message
y = sms.label_num
print(X.shape)
print(y.shape)

(5572,)
(5572,)


In [402]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


## 2 : Adding Feature

The following function has been provided to help you combine new features into the training data:

In [403]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
   
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

This function should return a tuple (average length not spam, average length spam).

In [404]:
def add_len_msg():
    sms['length'] = sms['message'].str.len()
    nonSpam = sms[sms['label_num'] == 0]
    spam = sms[sms['label_num'] == 1]
    return (nonSpam['length'].sum()/len(nonSpam), spam['length'].sum()/len(spam))

In [405]:
add_len_msg()

(71.48248704663213, 138.6706827309237)

This function should return a tuple (average # digits not spam, average # digits spam).

In [406]:

def no_digit():
    spam = [re.findall("[0-9]",i) for i in sms['message'][sms.label_num==1]]
    non_spam = [re.findall("[0-9]",i) for i in sms['message'][sms.label_num==0]]
    return ((np.mean(list(map(len,non_spam))),np.mean(list(map(len,spam)))))

In [407]:
no_digit()

(0.30528497409326427, 15.812583668005354)

This function should return a tuple (average # non-word characters not spam, average # non-word characters spam).


In [408]:
def no_nonWord_char():
    spam = [re.findall("\W",i) for i in sms['message'][sms.label_num==1]]
    non_spam = [re.findall("\W",i) for i in sms['message'][sms.label_num==0]]
    return ((np.mean(list(map(len,non_spam))),np.mean(list(map(len,spam)))))

In [409]:
no_nonWord_char()

(17.396269430051813, 29.104417670682732)

## 3: Vectorizing our dataset

Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.**

To tell Count Vectorizer to use character n-grams pass in `analyzer='char_wb'` which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.

Using this document-term matrix and the following additional features:
* the length of document (number of characters)
* number of digits per document
* **number of non-word characters (anything other than a letter, digit or underscore.)*

In [535]:
# instantiate the vectorizer
vect = CountVectorizer(min_df=5, ngram_range=(2,5), analyzer='char_wb',stop_words='english')

In [536]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)
X_train_vectorized = vect.transform(X_train)

In [537]:
#Adding features
X_train_vectorized = add_feature(X_train_vectorized, X_train.str.len())
X_train_digits = X_train.str.findall(r'(\d)')
X_train_vectorized = add_feature(X_train_vectorized, list(map(len, X_train_digits)))
X_train_nonChar = X_train.str.findall(r'(\W)')
X_train_vectorized = add_feature(X_train_vectorized, list(map(len, X_train_nonChar)))
    

In [538]:
# transform testing data (using fitted vocabulary) and adding feuture
X_test_vectorized = vect.transform(X_test)
X_test_vectorized = add_feature(X_test_vectorized, X_test.str.len())
X_test_digits = X_test.str.findall(r'(\d)')
X_test_vectorized = add_feature(X_test_vectorized, list(map(len, X_test_digits)))
X_test_nonChar = X_test.str.findall(r'(\W)')
X_test_vectorized = add_feature(X_test_vectorized, list(map(len, X_test_nonChar)))


##  4: Building and evaluating a model

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [539]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=.0155)

In [540]:
# train the model using X_train_dtm (timing it with an IPython "magic command")
%time nb.fit(X_train_vectorized, y_train)

Wall time: 31.2 ms


MultinomialNB(alpha=0.0155, class_prior=None, fit_prior=True)

In [541]:
# make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_vectorized)

In [542]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.99282124910265612

In [543]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[1204,    1],
       [   9,  179]])

In [544]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

4293    G.W.R
Name: message, dtype: object

In [545]:
# print message text for the false negatives (spam incorrectly classified as ham)
X_test[y_test > y_pred_class]

869     Hello. We need some posh birds and chaps to us...
2430    Guess who am I?This is the first time I create...
5098    TheMob>Hit the link to get a premium Pink Pant...
227     Will u meet ur dream partner soon? Is ur caree...
54      SMS. ac Sptv: The New Jersey Devils and the De...
3132    LookAtMe!: Thanks for your purchase of a video...
2663    Hello darling how are you today? I would love ...
955             Filthy stories and GIRLS waiting for your
2003    TheMob>Yo yo yo-Here comes a new selection of ...
Name: message, dtype: object

In [546]:
# example false negative
X_test[3132]

"LookAtMe!: Thanks for your purchase of a video clip from LookAtMe!, you've been charged 35p. Think you can do better? Why not send a video in a MMSto 32323."

In [547]:
# calculate predicted probabilities for X_test_dtm (poorly calibrated)
y_pred_prob = nb.predict_proba(X_test_vectorized)[:, 1]
y_pred_prob

array([  1.00000000e+000,   1.18515931e-079,   1.00000000e+000, ...,
         5.16066083e-109,   4.91489430e-081,   1.00000000e+000])

In [548]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.98929107442394282

## 5: Comparing models

We will compare multinomial Naive Bayes with [logistic regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression):

> Logistic regression, despite its name, is a **linear model for classification** rather than regression. Logistic regression is also known in the literature as logit regression, maximum-entropy classification (MaxEnt) or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logistic function.

In [549]:
# import and instantiate a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1)

In [550]:
# train the model using X_train_dtm
%time logreg.fit(X_train_vectorized, y_train)

Wall time: 648 ms


LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [551]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_vectorized)

In [552]:
# calculate predicted probabilities for X_test_dtm (well calibrated)
y_pred_prob = logreg.predict_proba(X_test_vectorized)[:, 1]
y_pred_prob

array([  1.00000000e+00,   5.42575909e-03,   9.99986859e-01, ...,
         1.69885925e-06,   1.24753278e-02,   9.99999992e-01])

In [553]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

0.99282124910265612

In [554]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[1204,    1],
       [   9,  179]])

In [555]:
# print message text for the false positives (ham incorrectly classified as spam)
X_test[y_test < y_pred_class]

3797    They have a thread on the wishlist section of ...
Name: message, dtype: object

In [526]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.98790941996998327

##  #SVM

In [594]:
from sklearn import svm
model = svm.SVC(C=64)
model.fit(X_train_vectorized,y_train)
y_pred =model.predict(X_test_vectorized)

In [595]:
metrics.accuracy_score(y_test, y_pred)

0.9921033740129217

In [596]:
metrics.confusion_matrix(y_test, y_pred)

array([[1204,    1],
       [  10,  178]])