<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Naive Bayes Spam Filtering

### Overview

We all hate spam, so developing a classifier to classify email as spam or not spam is useful.  

### Builds on
None

### Run time
approx. 20-30 minutes

### Notes

PySpark has a class called NaiveBayes that can be used to do Naive Bayes classification.

In [1]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
spark

Initializing Spark...
Spark found in :  /Users/sujee/spark
Spark config:
	 spark.app.name=TestApp
	spark.master=local[*]
	executor.memory=2g
	spark.sql.warehouse.dir=/var/folders/lp/qm_skljd2hl4xtps5vw0tdgm0000gn/T/tmp9058zjrn
	some_property=some_value
Spark UI running on port 4040


## Step 1: Let's load the dataframe

We will load the dataframe into spark.  Since the outcome label is "ham" or "spam", we'll just call it label.

In [2]:
%%time

dataset = spark.read.format("csv").\
          option('header','true').\
          option('delimiter', '\t').\
          option('inferSchema', 'true').\
          load("/data/spam/SMSSpamCollection.txt")

CPU times: user 1.91 ms, sys: 1.17 ms, total: 3.08 ms
Wall time: 2.96 s


In [3]:
print("records count : {:,}".format(dataset.count()))

dataset.printSchema()
dataset.show()

records count : 5,574
root
 |-- isspam: string (nullable = true)
 |-- text: string (nullable = true)

+------+--------------------+
|isspam|                text|
+------+--------------------+
|   ham|Go until jurong p...|
|   ham|Ok lar... Joking ...|
|  spam|Free entry in 2 a...|
|   ham|U dun say so earl...|
|   ham|Nah I don't think...|
|  spam|FreeMsg Hey there...|
|   ham|Even my brother i...|
|   ham|As per your reque...|
|  spam|WINNER!! As a val...|
|  spam|Had your mobile 1...|
|   ham|I'm gonna be home...|
|  spam|SIX chances to wi...|
|  spam|URGENT! You have ...|
|   ham|I've been searchi...|
|   ham|I HAVE A DATE ON ...|
|  spam|XXXMobileMovieClu...|
|   ham|Oh k...i'm watchi...|
|   ham|Eh u remember how...|
|   ham|Fine if thats th...|
|  spam|England v Macedon...|
+------+--------------------+
only showing top 20 rows



In [4]:
## Count spam/ham
dataset.groupby("isspam").count().show()

+------+-----+
|isspam|count|
+------+-----+
|   ham| 4827|
|  spam|  747|
+------+-----+



## Step 2: Vectorize using tf/idf

Let's use tf/idf for vecorization at first.  TF/IDF will take and count the instances of each term, and then divide by the total frequecy of that term in the entire dataset.  

This leads to very highly dimensional data, because every word in the document will lead to a dimension in the data.

In [5]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

## TODO : split the text into words
## Hint : outputCol = 'words'
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(dataset)
wordsData.show()


+------+--------------------+--------------------+
|isspam|                text|               words|
+------+--------------------+--------------------+
|   ham|Go until jurong p...|[go, until, juron...|
|   ham|Ok lar... Joking ...|[ok, lar..., joki...|
|  spam|Free entry in 2 a...|[free, entry, in,...|
|   ham|U dun say so earl...|[u, dun, say, so,...|
|   ham|Nah I don't think...|[nah, i, don't, t...|
|  spam|FreeMsg Hey there...|[freemsg, hey, th...|
|   ham|Even my brother i...|[even, my, brothe...|
|   ham|As per your reque...|[as, per, your, r...|
|  spam|WINNER!! As a val...|[winner!!, as, a,...|
|  spam|Had your mobile 1...|[had, your, mobil...|
|   ham|I'm gonna be home...|[i'm, gonna, be, ...|
|  spam|SIX chances to wi...|[six, chances, to...|
|  spam|URGENT! You have ...|[urgent!, you, ha...|
|   ham|I've been searchi...|[i've, been, sear...|
|   ham|I HAVE A DATE ON ...|[i, have, a, date...|
|  spam|XXXMobileMovieClu...|[xxxmobilemoviecl...|
|   ham|Oh k...i'm watchi...|[o

In [6]:
## compute the hash of words

## we will tweak this later
number_of_features = 2000

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_features)
featurizedData = hashingTF.transform(wordsData)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.show()


+------+--------------------+--------------------+--------------------+--------------------+
|isspam|                text|               words|         rawFeatures|            features|
+------+--------------------+--------------------+--------------------+--------------------+
|   ham|Go until jurong p...|[go, until, juron...|(2000,[7,77,165,2...|(2000,[7,77,165,2...|
|   ham|Ok lar... Joking ...|[ok, lar..., joki...|(2000,[20,484,131...|(2000,[20,484,131...|
|  spam|Free entry in 2 a...|[free, entry, in,...|(2000,[30,128,140...|(2000,[30,128,140...|
|   ham|U dun say so earl...|[u, dun, say, so,...|(2000,[57,372,381...|(2000,[57,372,381...|
|   ham|Nah I don't think...|[nah, i, don't, t...|(2000,[388,426,89...|(2000,[388,426,89...|
|  spam|FreeMsg Hey there...|[freemsg, hey, th...|(2000,[68,91,98,9...|(2000,[68,91,98,9...|
|   ham|Even my brother i...|[even, my, brothe...|(2000,[47,48,57,2...|(2000,[47,48,57,2...|
|   ham|As per your reque...|[as, per, your, r...|(2000,[272,388,39...

In [7]:
rescaledData.select("isspam", "text", "features").show()

+------+--------------------+--------------------+
|isspam|                text|            features|
+------+--------------------+--------------------+
|   ham|Go until jurong p...|(2000,[7,77,165,2...|
|   ham|Ok lar... Joking ...|(2000,[20,484,131...|
|  spam|Free entry in 2 a...|(2000,[30,128,140...|
|   ham|U dun say so earl...|(2000,[57,372,381...|
|   ham|Nah I don't think...|(2000,[388,426,89...|
|  spam|FreeMsg Hey there...|(2000,[68,91,98,9...|
|   ham|Even my brother i...|(2000,[47,48,57,2...|
|   ham|As per your reque...|(2000,[272,388,39...|
|  spam|WINNER!! As a val...|(2000,[74,153,388...|
|  spam|Had your mobile 1...|(2000,[82,279,343...|
|   ham|I'm gonna be home...|(2000,[26,263,333...|
|  spam|SIX chances to wi...|(2000,[15,46,214,...|
|  spam|URGENT! You have ...|(2000,[68,196,388...|
|   ham|I've been searchi...|(2000,[39,185,317...|
|   ham|I HAVE A DATE ON ...|(2000,[44,82,712,...|
|  spam|XXXMobileMovieClu...|(2000,[78,273,388...|
|   ham|Oh k...i'm watchi...|(2

## Step 3: Create a numeric label out of the string column "isspam."

In [8]:
from pyspark.ml.feature import StringIndexer

## TODO : Index 'isspam' column into 'label' column
## Hint : inputCol = 'isspam',   outputCol = 'label'
indexer = StringIndexer(inputCol="isspam", outputCol="label")
indexed = indexer.fit(rescaledData).transform(rescaledData)

indexed.select(['text', 'isspam', 'label', 'features']).show()


+--------------------+------+-----+--------------------+
|                text|isspam|label|            features|
+--------------------+------+-----+--------------------+
|Go until jurong p...|   ham|  0.0|(2000,[7,77,165,2...|
|Ok lar... Joking ...|   ham|  0.0|(2000,[20,484,131...|
|Free entry in 2 a...|  spam|  1.0|(2000,[30,128,140...|
|U dun say so earl...|   ham|  0.0|(2000,[57,372,381...|
|Nah I don't think...|   ham|  0.0|(2000,[388,426,89...|
|FreeMsg Hey there...|  spam|  1.0|(2000,[68,91,98,9...|
|Even my brother i...|   ham|  0.0|(2000,[47,48,57,2...|
|As per your reque...|   ham|  0.0|(2000,[272,388,39...|
|WINNER!! As a val...|  spam|  1.0|(2000,[74,153,388...|
|Had your mobile 1...|  spam|  1.0|(2000,[82,279,343...|
|I'm gonna be home...|   ham|  0.0|(2000,[26,263,333...|
|SIX chances to wi...|  spam|  1.0|(2000,[15,46,214,...|
|URGENT! You have ...|  spam|  1.0|(2000,[68,196,388...|
|I've been searchi...|   ham|  0.0|(2000,[39,185,317...|
|I HAVE A DATE ON ...|   ham|  

## Step 4: Split into training and test

We will split our dataset into training and test sets.

In [9]:
# TODO : Split the data into train and test into 80/20
(train, test) = indexed.randomSplit([.8, .2])

print("training set count : ", train.count())
print("testing set count : ", test.count())

training set count :  4418
testing set count :  1156


## Step 5: Fit Naive Bayes model

In [10]:
from pyspark.ml.classification import NaiveBayes

## TODO : create the trainer and set its parameters
## Hint : NaiveBayes  (see the class name above)
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

In [11]:
%%time

# train the model
## TODO : fit on training data (hint: train)
print("training starting...")
model = nb.fit(train)
print ("training done.")

training starting...
training done.
CPU times: user 7.32 ms, sys: 2.67 ms, total: 9.98 ms
Wall time: 797 ms


## Step 6: Run test data

Let's call .transform on our model to do make predictions on our test data. The output should be contained in the "prediction" column, while the correct label will be there in the "label" column. 

We will be able to evaluate our results by comparing the results.

In [12]:
# select example rows to display.
## TODO : transform on test data (hint : test)
predictions_test = model.transform(test)
predictions_test.limit(5).toPandas()


Unnamed: 0,isspam,text,words,rawFeatures,features,label,rawPrediction,probability,prediction
0,ham,&lt;#&gt; mins but i had to stop somewhere f...,"[, &lt;#&gt;, , mins, but, i, had, to, stop, s...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"[-236.57647504147462, -252.88586086018597]","[0.9999999174107163, 8.258928376801963e-08]",0.0
1,ham,and picking them up from various points,"[, and, , picking, them, up, from, various, po...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"[-217.0008742346316, -253.6499421962386]","[0.9999999999999998, 1.2120262263282667e-16]",0.0
2,ham,what number do u live at? Is it 11?,"[, what, number, do, u, live, at?, is, it, 11?]","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"[-218.10860554362733, -258.4689566910269]","[1.0, 2.962935578863434e-18]",0.0
3,ham,"""Happy valentines day"" I know its early but i ...","[""happy, valentines, day"", i, know, its, early...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",0.0,"[-706.8283258134346, -799.8643432541411]","[1.0, 3.93523803605778e-41]",0.0
4,ham,"""Wen u miss someone, the person is definitely ...","[""wen, u, miss, someone,, the, person, is, def...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.4288230189921...",0.0,"[-716.5227052362972, -766.7611408132099]","[1.0, 1.5195837808258532e-22]",0.0


## Step 7: Evaluate the model

Let's look at how our model performs.  We will do an accuracy measure.

In [13]:
predictions_test = model.transform(test)
predictions_train = model.transform(train)

### 7.1 - Accuracy

In [14]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")

print("Training set accuracy = " , evaluator.evaluate(predictions_train))
print("Test set accuracy = " , evaluator.evaluate(predictions_test))


Training set accuracy =  0.9692168401991852
Test set accuracy =  0.9472318339100346


### 7.2 - Confusion MMatrix

In [15]:
cm = predictions_test.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label')
cm.show()

## Can you explain the confusion matrix

+-----+---+---+
|label|  0|  1|
+-----+---+---+
|  0.0|954| 43|
|  1.0| 18|141|
+-----+---+---+



In [16]:
import seaborn as sns

cm_pd = cm.toPandas()
cm_pd.set_index("label", inplace=True)
# print(cm_pd)

# colormaps : cmap="YlGnBu" , cmap="Greens", cmap="Blues",  cmap="Reds"
sns.heatmap(cm_pd, annot=True, fmt=',', cmap="Blues").plot()

[]

## Step 8: Improve prediction results

We used too few features above, and got bad accuracy. Increase the number of features for HashingTF

## Step 9:  Run your own test

Now it's your turn!   Make a new dataframe with some sample test data of your own creation.  Make some "spammy" SMSes and some ordinary ones.  See how our spam filter does.

In [18]:
# TODO: make a dataframe with some of your own data.
import pandas as pd

mydata = pd.DataFrame({'text' : ['hey, can we meet 1 hr later?', 
                                'WINNER!  Click here to claim your prize !!!!',
                                'CHEAP DEGREEES !!', 
                                'your text here']
                         })

mydata2 = spark.createDataFrame(mydata)
tokenizer = Tokenizer(inputCol="text", outputCol="words")
fv = tokenizer.transform(mydata2)
fv.show()

## NOTE : make sure this 'numFeatures' matches the 'numFeatures' in step-2
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_features)
fv = hashingTF.transform(fv)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
fv = idfModel.transform(fv)
fv.show()

+--------------------+--------------------+
|                text|               words|
+--------------------+--------------------+
|hey, can we meet ...|[hey,, can, we, m...|
|WINNER!  Click he...|[winner!, , click...|
|   CHEAP DEGREEES !!|[cheap, degreees,...|
|      your text here|  [your, text, here]|
+--------------------+--------------------+

+--------------------+--------------------+--------------------+--------------------+
|                text|               words|         rawFeatures|            features|
+--------------------+--------------------+--------------------+--------------------+
|hey, can we meet ...|[hey,, can, we, m...|(2000,[238,486,74...|(2000,[238,486,74...|
|WINNER!  Click he...|[winner!, , click...|(2000,[388,493,53...|(2000,[388,493,53...|
|   CHEAP DEGREEES !!|[cheap, degreees,...|(2000,[119,339,16...|(2000,[119,339,16...|
|      your text here|  [your, text, here]|(2000,[1135,1169,...|(2000,[1135,1169,...|
+--------------------+--------------------+--

In [19]:
predictions = model.transform(fv)
predictions.select(['text', 'prediction']).show()

+--------------------+----------+
|                text|prediction|
+--------------------+----------+
|hey, can we meet ...|       0.0|
|WINNER!  Click he...|       1.0|
|   CHEAP DEGREEES !!|       1.0|
|      your text here|       1.0|
+--------------------+----------+



## FUN : How will you defeat this algorithm? :-) 

If you are spammer, how can you defeat this algorithm?

<img src="../assets/images/come-tothe-dark-side-iin-we-have-cookies.png">

## Further Reading
Checkout [Amazon Comprehend](https://us-west-2.console.aws.amazon.com/comprehend/v2/home?region=us-west-2#welcome) to parse natural text and extract meaning.

## BONUS: Word2Vec Instead of TF/IDF

We used the TF/IDF encoding. We might get better resu

lts if we use Word2Vec instead. Run with word2vec and see if you get a better accuracy rate.

Refer to [Spark Word2Vec](https://spark.apache.org/docs/2.2.0/mllib-feature-extraction.html#word2vec) implementation for details