<link rel='stylesheet' href='../assets/css/main.css'/>

[<< back to main index](../README.md)

# Naive Bayes Spam Filtering

### Overview
Instructor to demo this on screen.
 
### Builds on
None

### Run time
approx. 20-30 minutes

### Notes

PySpark has a class called Linear Regression that can be used to do simple linear regression models.

In [21]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, VectorAssembler
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer


## Example : Tips
Here is our tip data.  This shows 10 observations of bill with tip amounts.

## Step 1: Let's load the dataframe

We will load the dataframe into spark.  Since the outcome label is "ham" or "spam", we'll just call it label.

In [19]:
dataset = spark.read.format("csv").option('header','true').option('delimiter', '\t').\
  option('inferSchema', 'true').load("../../datasets/spam/SMSSpamCollection.txt")
dataset.show()

+------+--------------------+
|isspam|                text|
+------+--------------------+
|   ham|Go until jurong p...|
|   ham|Ok lar... Joking ...|
|  spam|Free entry in 2 a...|
|   ham|U dun say so earl...|
|   ham|Nah I don't think...|
|  spam|FreeMsg Hey there...|
|   ham|Even my brother i...|
|   ham|As per your reque...|
|  spam|WINNER!! As a val...|
|  spam|Had your mobile 1...|
|   ham|I'm gonna be home...|
|  spam|SIX chances to wi...|
|  spam|URGENT! You have ...|
|   ham|I've been searchi...|
|   ham|I HAVE A DATE ON ...|
|  spam|XXXMobileMovieClu...|
|   ham|Oh k...i'm watchi...|
|   ham|Eh u remember how...|
|   ham|Fine if thats th...|
|  spam|England v Macedon...|
+------+--------------------+
only showing top 20 rows



## Step 2: Vectorize using tf/idf

Let's use tf/idf for vecorization at first.

In [24]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
wordsData = tokenizer.transform(dataset)

hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)

idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)

rescaledData.select("isspam", "features").show()

+------+--------------------+
|isspam|            features|
+------+--------------------+
|   ham|(20,[0,2,5,6,7,10...|
|   ham|(20,[0,4,6,16,19]...|
|  spam|(20,[0,4,5,6,7,8,...|
|   ham|(20,[1,2,3,6,8,12...|
|   ham|(20,[0,3,4,6,8,9,...|
|  spam|(20,[0,2,5,6,7,8,...|
|   ham|(20,[1,3,6,7,8,10...|
|   ham|(20,[0,2,3,5,6,8,...|
|  spam|(20,[0,1,3,4,5,6,...|
|  spam|(20,[0,2,3,4,6,7,...|
|   ham|(20,[0,1,3,5,6,8,...|
|  spam|(20,[0,1,2,3,6,8,...|
|  spam|(20,[0,2,4,5,6,7,...|
|   ham|(20,[0,1,2,3,5,7,...|
|   ham|(20,[1,2,4,9,10,1...|
|  spam|(20,[1,3,5,6,7,8,...|
|   ham|(20,[2,6,9,15],[0...|
|   ham|(20,[0,5,6,7,8,9,...|
|   ham|(20,[0,1,6,9,10,1...|
|  spam|(20,[0,1,4,5,6,8,...|
+------+--------------------+
only showing top 20 rows



In [25]:
## Step 3: 


indexer = StringIndexer(inputCol="isspam", outputCol="label")
indexed = indexer.fit(rescaledData).transform(rescaledData)
indexed.show()


+------+--------------------+--------------------+--------------------+--------------------+-----+
|isspam|                text|               words|         rawFeatures|            features|label|
+------+--------------------+--------------------+--------------------+--------------------+-----+
|   ham|Go until jurong p...|[go, until, juron...|(20,[0,2,5,6,7,10...|(20,[0,2,5,6,7,10...|  0.0|
|   ham|Ok lar... Joking ...|[ok, lar..., joki...|(20,[0,4,6,16,19]...|(20,[0,4,6,16,19]...|  0.0|
|  spam|Free entry in 2 a...|[free, entry, in,...|(20,[0,4,5,6,7,8,...|(20,[0,4,5,6,7,8,...|  1.0|
|   ham|U dun say so earl...|[u, dun, say, so,...|(20,[1,2,3,6,8,12...|(20,[1,2,3,6,8,12...|  0.0|
|   ham|Nah I don't think...|[nah, i, don't, t...|(20,[0,3,4,6,8,9,...|(20,[0,3,4,6,8,9,...|  0.0|
|  spam|FreeMsg Hey there...|[freemsg, hey, th...|(20,[0,2,5,6,7,8,...|(20,[0,2,5,6,7,8,...|  1.0|
|   ham|Even my brother i...|[even, my, brothe...|(20,[1,3,6,7,8,10...|(20,[1,3,6,7,8,10...|  0.0|
|   ham|As

## Step 4: Run Naive Bayes

Let's run Naive Bayes

In [29]:

# Split the data into train and test
splits = indexed.randomSplit([0.6, 0.4], 1234)
train = splits[0]
test = splits[1]

# create the trainer and set its parameters
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")

# train the model
model = nb.fit(train)


## Step 5: Run Test Data


In [30]:

# select example rows to display.
predictions = model.transform(test)
predictions.show()


+------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|isspam|                text|               words|         rawFeatures|            features|label|       rawPrediction|         probability|prediction|
+------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+----------+
|   ham| &lt;#&gt;  in mc...|[, &lt;#&gt;, , i...|(20,[3,5,6,7,12,1...|(20,[3,5,6,7,12,1...|  0.0|[-18.221704364333...|[0.88107418613854...|       0.0|
|   ham| &lt;#&gt;  mins ...|[, &lt;#&gt;, , m...|(20,[3,7,8,9,10,1...|(20,[3,7,8,9,10,1...|  0.0|[-23.095541279732...|[0.89433849088310...|       0.0|
|   ham| &lt;DECIMAL&gt; ...|[, &lt;decimal&gt...|(20,[1,3,4,7,8,9,...|(20,[1,3,4,7,8,9,...|  0.0|[-61.220479232963...|[0.94184803832905...|       0.0|
|   ham| says that he's q...|[, says, that, he...|(20,[0,2,3,4,5,6,...|(20,[0,2,3,4,5,6,

## Step 6: Evaluate Model

In [31]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction",
                                              metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Test set accuracy = " + str(accuracy))

Test set accuracy = 0.860548807917229
