<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Naive Bayes Spam Filtering

### Overview

We all hate spam, so developing a classifier to classify email as spam or not spam is useful.  

### Builds on
None

### Run time
approx. 20-30 minutes

### Notes

PySpark has a class called NaiveBayes that can be used to do Naive Bayes classification.

In [None]:
%matplotlib inline

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

print('Spark UI running on http://YOURIPADDRESS:' + sc.uiWebUrl.split(':')[2])

## Step 1: Let's load the dataframe

We will load the dataframe into spark.  Since the outcome label is "ham" or "spam", we'll just call it label.

In [None]:
t1 = time.perf_counter()

dataset = spark.read.format("csv").\
          option('header','true').\
          option('delimiter', '\t').\
          option('inferSchema', 'true').\
          load("/data/spam/SMSSpamCollection.txt")

t2 = time.perf_counter() 

print("read {:,} records in {:,.2f} ms".format(dataset.count(), (t2-t1)*1000))

dataset.printSchema()
dataset.show()

In [None]:
## Count spam/ham
dataset.groupby("isspam").count().show()

## Step 2: Vectorize using tf/idf

Let's use tf/idf for vecorization at first.  TF/IDF will take and count the instances of each term, and then divide by the total frequecy of that term in the entire dataset.  

This leads to very highly dimensional data, because every word in the document will lead to a dimension in the data.

In [None]:
from pyspark.ml.feature import HashingTF, IDF, Tokenizer

## TODO : split the text into words
## Hint : outputCol = 'words'
tokenizer = Tokenizer(inputCol="text", outputCol="???")
wordsData = tokenizer.transform(dataset)
wordsData.show()


In [None]:
## compute the hash of words
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.show()


In [None]:
rescaledData.select("isspam", "text", "features").show()

## Step 3: Create a numeric label out of the string column "isspam."

In [None]:
from pyspark.ml.feature import StringIndexer

## TODO : Index 'isspam' column into 'label' column
## Hint : inputCol = 'isspam',   outputCol = 'label'
indexer = StringIndexer(inputCol="???", outputCol="???")
indexed = indexer.fit(rescaledData).transform(rescaledData)

indexed.select(['text', 'isspam', 'label', 'features']).show()


## Step 4: Split into training and test

We will split our dataset into training and test sets.

In [None]:
# TODO : Split the data into train and test into 80/20
(train, test) = indexed.randomSplit([???, ???])

print("training set count : ", train.count())
print("testing set count : ", test.count())

## Step 5: Fit Naive Bayes model

In [None]:
from pyspark.ml.classification import NaiveBayes

## TODO : create the trainer and set its parameters
## Hint : NaiveBayes  (see the class name above)
nb = ???(smoothing=1.0, modelType="multinomial")

# train the model
t1 = time.perf_counter()
## TODO : fit on training data (hint: train)
model = nb.fit(???)
t2 = time.perf_counter()

print("trained on {:,} records  in {:,.2f} ms".\
      format(train.count(), (t2-t1)*1000))

## Step 6: Run test data

Let's call .transform on our model to do make predictions on our test data. The output should be contained in the "prediction" column, while the correct label will be there in the "label" column. 

We will be able to evaluate our results by comparing the results.

In [None]:
# select example rows to display.
## TODO : transform on test data (hint : test)
predictions = model.transform(???)
predictions.show()


## Step 7: Evaluate the model

Let's look at how our model performs.  We will do an accuracy measure.

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Let us do a confusion matrix.

In [None]:
predictions.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label').show()

## Can you explain the confusion matrix

## Step 8: Improve prediction results

We used too few features above, and got bad accuracy. Increase the number of features for HashingTF

## Step 9:  Run your own test

Now it's your turn!   Make a new dataframe with some sample test data of your own creation.  Make some "spammy" SMSes and some ordinary ones.  See how our spam filter does.

In [None]:
# TODO: make a dataframe with some of your own data.
mydata = pd.DataFrame({'text' : ['hey, can we meet 1 hr later?', 
                                'WINNER!  Click here to claim your prize !!!!',
                                'CHEAP DEGREEES !!', 
                                'your text here']
                         })

mydata2 = spark.createDataFrame(mydata)
tokenizer = Tokenizer(inputCol="text", outputCol="words")
fv = tokenizer.transform(mydata2)
fv.show()

## NOTE : make sure this 'numFeatures' matches the 'numFeatures' in step-2
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=1000)
fv = hashingTF.transform(fv)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
fv = idfModel.transform(fv)
fv.show()

In [None]:
predictions = model.transform(fv)
predictions.select(['text', 'prediction']).show()

## FUN : How will you defeat this algorithm? :-) 

If you are spammer, how can you defeat this algorithm?

<img src="../assets/images/come-tothe-dark-side-iin-we-have-cookies.png">

# BONUS: Word2Vec Instead of TF/IDF

We used the TF/IDF encoding. We might get better resu

lts if we use Word2Vec instead. Run with word2vec and see if you get a better accuracy rate.