**Title:** Data Analysis Report **Date:** 21/11/2017 **Student ID Number:** 17350942-1 **Name:** Tittaya MAIRITTHA

## Purpose 
Filtering text messages as Spam or Ham using the supervised machine learning algorithm a naïve Bayes classifier and logistic regression.

## Dataset
The datasets from SMS Spam Collection Data Set from https://archive.ics.uci.edu/ml/datasets/sms+spam+collection The SMS Spam Collection is a public set of SMSs labeled messages that have been collected for mobile phone spam research. This includes 5,572 text messages; ham 4825 and spam 747. 

## Methods 
1.	Naïve Bayes classifiers are a class of simple linear classifiers which are conditional probability models based on Bayes Theorem. In this case, used the multinomial Naïve Bayes classifier which implements the naive Bayes algorithm for multinomially distributed data, and is suitable for classification with discrete features (e.g., word counts for text classification).
2.	Logistic Regression is another way to determine a class label, depending on the features. Logistic regression takes features that can be continuous (for example, the count of words in SMS texts) and translate them to discrete values (spam or not spam). 

## Analyzing

Compute accuracy, precision, recall, F-measure
1.	The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. 
2.	The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
3.	The F- measure score can be interpreted as a weighted harmonic mean of the precision and recall, where an F- measure score reaches its best value at 1 and worst score at 0.

## Experiment
### Part I Data preprocessing

Import modules 

In [22]:
%matplotlib inline

import numpy as np 
import pandas as pd 
import zipfile
import chardet
import nltk
import matplotlib.pyplot as plt
import itertools
from sklearn import model_selection
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support


Read a zip file

In [4]:
path = './data/'
dataset = 'sms-spam-collection-dataset'
with zipfile.ZipFile(path + dataset +".zip","r") as z:
    z.extractall(path)

Read CSV file into DataFrame

In [27]:
with open(path + 'spam.csv', 'rb') as f:
    result = chardet.detect(f.read())

df = pd.read_csv(path + 'spam.csv', encoding=result['encoding'])
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
df.columns = ["label", "message"]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
label      5572 non-null object
message    5572 non-null object
dtypes: object(2)
memory usage: 87.1+ KB


Convert label to a numerical variable

In [28]:
le = preprocessing.LabelEncoder()
df['label_num'] = le.fit_transform(y)

Define X and y

In [29]:
X = df.message
y = df.label_num
print(X.shape)
print(y.shape)

(5572,)
(5572,)


Check that the conversion worked

In [30]:
df.head(10)

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0
5,spam,FreeMsg Hey there darling it's been 3 week's n...,1
6,ham,Even my brother is not like to speak with me. ...,0
7,ham,As per your request 'Melle Melle (Oru Minnamin...,0
8,spam,WINNER!! As a valued network customer you have...,1
9,spam,Had your mobile 11 months or more? U R entitle...,1


Examine the class distribution

In [31]:
df.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

Split X and Y into training 75% and testing sets 25%

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


Representing text as numerical data using CountVectorizer and tuning the vectorizer (remove English stop words and include 1-grams and 2-grams)

In [32]:
vect = CountVectorizer(stop_words='english', ngram_range=(1, 2))
X_train_dtm = vect.fit_transform(X_train)
X_test_dtm = vect.transform(X_test)

### Part II Create models¶

### Naive Baye
Instantiate a Multinomial Naive Bayes model

In [13]:
nb = MultinomialNB()

Train the model

In [33]:
%time nb.fit(X_train_dtm, y_train)

CPU times: user 5.89 ms, sys: 2.44 ms, total: 8.33 ms
Wall time: 7.23 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Make class predictions

In [15]:
y_pred_class = nb.predict(X_test_dtm)

Calculate accuracy of class predictions

In [34]:
accuracy_score = metrics.accuracy_score(y_test, y_pred_class)
print('accuracy_score score: {0:0.2f}'.format(accuracy_score))

accuracy_score score: 0.99


Compute accuracy, precision, recall, F-measure

In [17]:
precision_recall_fscore_support(y_test, y_pred_class, average='macro')

(0.99009104241662382, 0.97139781991389573, 0.9804894673937401, None)

Print the confusion matrix

In [20]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred_class)
cnf_matrix

array([[1211,    2],
       [  10,  170]])

Print message text for the false positives (ham incorrectly classified as spam)

In [19]:
X_test[y_test < y_pred_class]

1289    Hey...Great deal...Farm tour 9am to 5pm $95/pa...
1081                    Can u get pic msgs to your phone?
Name: message, dtype: object

Print message text for the false negatives (spam incorrectly classified as ham)

In [19]:
X_test[y_test > y_pred_class]

3528    Xmas & New Years Eve tickets are now on sale f...
1662    Hi if ur lookin 4 saucy daytime fun wiv busty ...
3417    LIFE has never been this much fun and great un...
2773    How come it takes so little time for a child w...
1457    CLAIRE here am havin borin time & am now alone...
2429    Guess who am I?This is the first time I create...
4067    TBS/PERSOLVO. been chasing us since Sept forå£...
3358    Sorry I missed your call let's talk when you h...
2821    ROMCAPspam Everyone around should be respondin...
2247    Back 2 work 2morro half term over! Can U C me ...
Name: message, dtype: object

Calculate predicted probabilities 

In [20]:
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([  6.22825609e-03,   1.66438619e-03,   9.23897333e-05, ...,
         1.61421692e-04,   2.64886679e-04,   7.44334421e-04])

Calculate AUC

In [21]:
roc_auc_score = metrics.roc_auc_score(y_test, y_pred_prob)
print('roc_auc_score score: {0:0.2f}'.format(roc_auc_score))

roc_auc_score score: 0.99


### Logistic Regression
Instantiate a Logistic Regression model

In [37]:
logreg = LogisticRegression()

Train the model

In [38]:
%time logreg.fit(X_train_dtm, y_train)

CPU times: user 43 ms, sys: 3.03 ms, total: 46 ms
Wall time: 53.1 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Make class predictions 

In [39]:
y_pred_class = logreg.predict(X_test_dtm)

Calculate accuracy of class predictions

In [26]:
accuracy_score = metrics.accuracy_score(y_test, y_pred_class)
print('accuracy_score score: {0:0.2f}'.format(
      accuracy_score))

accuracy_score score: 0.98


Compute accuracy, precision, recall, F-measure

In [27]:
precision_recall_fscore_support(y_test, y_pred_class, average='macro')

(0.99029911075181887, 0.93333333333333335, 0.95938775510204088, None)

Print the confusion matrix

In [21]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred_class)
cnf_matrix

array([[1211,    2],
       [  10,  170]])

Print message text for the false positives (ham incorrectly classified as spam)

In [30]:
X_test[y_test < y_pred_class]

Series([], Name: message, dtype: object)

Print message text for the false negatives (spam incorrectly classified as ham)

In [31]:
X_test[y_test > y_pred_class]

1117    449050000301 You have won a å£2,000 price! To ...
3528    Xmas & New Years Eve tickets are now on sale f...
1662    Hi if ur lookin 4 saucy daytime fun wiv busty ...
1448    As a registered optin subscriber ur draw 4 å£1...
5110      You have 1 new message. Please call 08715205273
4247    accordingly. I repeat, just text the word ok o...
3417    LIFE has never been this much fun and great un...
2773    How come it takes so little time for a child w...
1960    Guess what! Somebody you know secretly fancies...
5       FreeMsg Hey there darling it's been 3 week's n...
517     Your credits have been topped up for http://ww...
4071    Loans for any purpose even if you have Bad Cre...
1457    CLAIRE here am havin borin time & am now alone...
190     Are you unique enough? Find out from 30th Augu...
2429    Guess who am I?This is the first time I create...
3057    You are now unsubscribed all services. Get ton...
1021    Guess what! Somebody you know secretly fancies...
4067    TBS/PE

Calculate predicted probabilities

In [None]:
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

Calculate AUC

In [33]:
roc_auc_score = metrics.roc_auc_score(y_test, y_pred_prob)
print('roc_auc_score score: {0:0.2f}'.format(roc_auc_score))

roc_auc_score score: 0.99


## Summary 

Results

| Model                | Accuracy  | Precision  | Recall    | F-measure |
| ---------------------|:---------:|:----------:|:---------:|:---------:|
| Naive Bayes          | 0.99      |   0.99     |   0.97    |    0.98   |
| Logistic Regression  | 0.98      |   0.99     |   0.93    |    0.95   |


**Naïve Bayes classifiers**
A total of only 2 + 10 = 12 of the 1,393 SMS messages were incorrectly classified (0.86%).
Among the errors were 10 out of 1,221 ham messages that were misidentified as spam, and 2 of the 172 spam messages were incorrectly labeled as ham.

**Logistics Regression** 
A total of only 24 of the 1,393 SMS messages were incorrectly classified (1.72%).
Among the errors were 24 out of 1,237 ham messages that were misidentified as spam, and 0 of the 156 spam messages were incorrectly labeled as ham.

The true messages that were incorrectly classified as spam could cause significant problems for the deployment of our filtering algorithm, because the filter could cause a person to miss an important text message.