<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Naive Bayes Spam Filtering

### Overview

We all hate spam, so developing a classifier to classify email as spam or not spam is useful.  

### Builds on
None

### Run time
approx. 20-30 minutes

### Notes

PySpark has a class called Linear Regression that can be used to do simple linear regression models.

In [None]:
%matplotlib inline

import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, VectorAssembler
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer


## Step 1: Let's load the dataframe

We will load the dataframe into spark.  Since the outcome label is "ham" or "spam", we'll just call it label.

In [None]:
t1 = time.perf_counter()

dataset = spark.read.format("csv").option('header','true').option('delimiter', '\t').\
  option('inferSchema', 'true').load("/data/spam/SMSSpamCollection.txt")

t2 = time.perf_counter() 

print("read {:,} records in {:,.2f} ms".format(dataset.count(), (t2-t1)*1000))

dataset.printSchema()
dataset.show()

In [None]:
## Count spam/ham
dataset.groupby("isspam").count().show()

## Step 2: Vectorize using tf/idf

Let's use tf/idf for vecorization at first.  TF/IDF will take and count the instances of each term, and then divide by the total frequecy of that term in the entire dataset.  

This leads to very highly dimensional data, because every word in the document will lead to a dimension in the data.

In [None]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(dataset)

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("isspam", "features").show()

## Step 3: Create a numeric label out of the string column "isspam."

In [None]:

indexer = StringIndexer(inputCol="isspam", outputCol="label")
indexed = indexer.fit(rescaledData).transform(rescaledData)
indexed.show()


## Step 4: Split into Training and Test

We will split our dataset into training and test sets.

In [None]:
# Split the data into train and test
(train, test) = indexed.randomSplit([0.8, 0.2], seed=1234)

print("training set count : ", train.count())
print("testing set count : ", test.count())

## Step 5: Fit Naive Bayes Model

In [None]:

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
t1 = time.perf_counter()
model = nb.fit(train)
t2 = time.perf_counter()

print("trained on {:,} records  in {:,.2f} ms".\
      format(train.count(), (t2-t1)*1000))

## Step 6: Run Test Data

Let's call .transform on our model to do make predictions on our test data. The output should be contained in the "prediction" column, while the correct label will be there in the "label" column. 

We will be able to evaluate our results by comparing the results.

In [None]:

# select example rows to display.
predictions = model.transform(test)
predictions.show()


## Step 6: Evaluate Model

Let's look at how our model performs.  We will do an accuracy measure.

In [None]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Let us do a confusion matrix.

In [None]:
predictions.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label').show()

## Can you explain the confusion matrix

## Step 7:  Your own test

Now it's your turn!   Make a new dataframe with some sample test data of your own creation.  Make some "spammy" SMSes and some ordinary ones.  See how our spam filter does.

In [None]:
# TODO: make a dataframe with some of your own data.

mydata = pd.DataFrame({'isspam' : ['spam', 'ham', ...],
              'text' : ['My text here', 'My Text Here 2', ...]
             })

