### Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score



### Import

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_curve

In [2]:
df = pd.read_csv("./SMSSpamCollection",sep='\t', names=['spam', 'txt'])

### Prepare

In [3]:
df.head()

Unnamed: 0,spam,txt
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# Preprocess and vectorize text
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)
X = vectorizer.fit_transform(df.txt)

In [76]:
# Get labels and parse to binary
y = [1 if label == 'spam' else 0 for label in df.spam]

### Train

In [77]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [78]:
# Build and train naive bayes classifier
clf = naive_bayes.MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Evaluate

In [79]:
# Evaluate performance
y_test_array = np.array(y_test)
y_pred_proba_array = np.array(clf.predict_proba(X_test)[:,1])
roc_auc_score(y_test_array, y_pred_proba_array)

0.98589322144123448

In [80]:
# Generate confusion matrix
y_pred_array = clf.predict(X_test)
confusion_matrix(y_test_array, y_pred_array)

array([[1207,    0],
       [  35,  151]])

**True Positive**<br>
Predicted Spam / Actual Spam: 151

**False Positive**<br>
Predicted Spam / Actual Ham: 0

**True Negative**<br>
Predicted Ham / Actual Ham: 1207

**False Negative**<br>
Predicted Ham / Actual Spam: 35

In [81]:
# Calculate precision and recall
recall = 151 / (35 + 151)
precision = 151 / (0 + 151)
print ('Precision: ', str(precision), '\nRecall: ', str(recall))

Precision:  1.0 
Recall:  0.8118279569892473


### Summary

Although recall was very low at 81%, this is acceptable since we had high precision at 100%. Email users are willing to accept some spam to filter through to their inboxes; however, they find it unacceptable to mark ham as spam since most people do not actively check their spam folders for real emails.