# Spam SMS Prediction

### Creating a Spark Session

In [1]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("spam_sms").getOrCreate()

Intitializing Scala interpreter ...

Spark Web UI available at http://Varun-CK:4040
SparkContext available as 'sc' (version = 2.3.0, master = local[*], app id = local-1577807752107)
SparkSession available as 'spark'


2019-12-31 21:26:07 WARN  SparkContext:66 - Using an existing SparkContext; some configuration may not take effect.


import org.apache.spark.sql.SparkSession
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@743de7d6


### Initializing Logger

In [2]:
import org.apache.log4j._
Logger.getLogger("org").setLevel(Level.ERROR)

import org.apache.log4j._


### Setting Driver and Executor memory to 80GB

In [3]:
import org.apache.spark.SparkConf

import org.apache.spark.SparkConf


In [4]:
val conf = new SparkConf()

conf: org.apache.spark.SparkConf = org.apache.spark.SparkConf@6bd27f57


In [5]:
conf.set("spark.driver.memory", "8g")
//conf.set("spark.executor.memory", "8g")

res1: org.apache.spark.SparkConf = org.apache.spark.SparkConf@6bd27f57


### Verifying Configuration

In [6]:
conf.getAll

res2: Array[(String, String)] = Array((spark.repl.class.outputDir,C:\Users\Varun\AppData\Local\Temp\tmp_16fjgpz), (spark.driver.memory,8g), (spark.master,local[*]), (spark.submit.deployMode,client), (spark.ui.showConsoleProgress,true), (spark.app.name,pyspark-shell))


### Using Spark to read spam SMS data set

In [7]:
var data = spark.read.options(Map(("header","false"),("inferSchema","true"),("delimiter","\t"))).csv("SMS_Spam_Collection\\SMSSpamCollection")

data: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string]


### Count

In [8]:
data.count()

res3: Long = 5574


### Printing the first few rows of the dataframe

In [9]:
data.show(5)

+----+--------------------+
| _c0|                 _c1|
+----+--------------------+
| ham|Go until jurong p...|
| ham|Ok lar... Joking ...|
|spam|Free entry in 2 a...|
| ham|U dun say so earl...|
| ham|Nah I don't think...|
+----+--------------------+
only showing top 5 rows



### Giving the header names

In [10]:
data = data.withColumnRenamed("_c0","class").withColumnRenamed("_c1","text")

data: org.apache.spark.sql.DataFrame = [class: string, text: string]


In [11]:
data.show(5)

+-----+--------------------+
|class|                text|
+-----+--------------------+
|  ham|Go until jurong p...|
|  ham|Ok lar... Joking ...|
| spam|Free entry in 2 a...|
|  ham|U dun say so earl...|
|  ham|Nah I don't think...|
+-----+--------------------+
only showing top 5 rows



### Checking ham and spam SMS Count

In [12]:
data.groupBy("class").count().show()

+-----+-----+
|class|count|
+-----+-----+
|  ham| 4827|
| spam|  747|
+-----+-----+



## Cleaning and preparing the data

### Creating a new length feature

In [13]:
import org.apache.spark.sql.functions.length

import org.apache.spark.sql.functions.length


In [14]:
data = data.withColumn("length",length($"text"))

data: org.apache.spark.sql.DataFrame = [class: string, text: string ... 1 more field]


In [15]:
data.show(5)

+-----+--------------------+------+
|class|                text|length|
+-----+--------------------+------+
|  ham|Go until jurong p...|   111|
|  ham|Ok lar... Joking ...|    29|
| spam|Free entry in 2 a...|   155|
|  ham|U dun say so earl...|    49|
|  ham|Nah I don't think...|    61|
+-----+--------------------+------+
only showing top 5 rows



### Grouping spam and ham sms

In [16]:
data.groupBy("class").mean().show()

+-----+-----------------+
|class|      avg(length)|
+-----+-----------------+
|  ham|71.45431945307645|
| spam|138.6706827309237|
+-----+-----------------+



From the above observation, we can conclude that if a sms is a spam, then it will be having more length as compared to ham sms.

## Feature Transformations

In [17]:
import org.apache.spark.ml.feature.{Tokenizer,StopWordsRemover,CountVectorizer,IDF,StringIndexer}

import org.apache.spark.ml.feature.{Tokenizer, StopWordsRemover, CountVectorizer, IDF, StringIndexer}


In [18]:
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("token_text")
val remover = new StopWordsRemover().setInputCol("token_text").setOutputCol("stop_tokens")
val count_vec = new CountVectorizer().setInputCol("stop_tokens").setOutputCol("c_vec")
val idf = new IDF().setInputCol("c_vec").setOutputCol("tf_idf")

tokenizer: org.apache.spark.ml.feature.Tokenizer = tok_885048ce3990
remover: org.apache.spark.ml.feature.StopWordsRemover = stopWords_ab36cab887c9
count_vec: org.apache.spark.ml.feature.CountVectorizer = cntVec_6f5a645b33a3
idf: org.apache.spark.ml.feature.IDF = idf_840de68286f9


### Converting the categorical data type class (ham-spam) to numerical type

In [19]:
val ham_spam_to_num = new StringIndexer().setInputCol("class").setOutputCol("label")

ham_spam_to_num: org.apache.spark.ml.feature.StringIndexer = strIdx_7fdfc0712563


In [20]:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors


In [21]:
val assembler = new VectorAssembler().setInputCols(Array("tf_idf","length")).setOutputCol("features")

assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_4b0b3b5f3361


## Building the model

### Using Naive Bayes and Logistic Regression

In [22]:
import org.apache.spark.ml.classification.{NaiveBayes,LogisticRegression,DecisionTreeClassifier}

import org.apache.spark.ml.classification.{NaiveBayes, LogisticRegression, DecisionTreeClassifier}


In [23]:
val nb = new NaiveBayes().setLabelCol("label").setFeaturesCol("features")
val lor = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
val dtc = new DecisionTreeClassifier().setLabelCol("label").setFeaturesCol("features")

nb: org.apache.spark.ml.classification.NaiveBayes = nb_5985ae8a8e88
lor: org.apache.spark.ml.classification.LogisticRegression = logreg_a8c8c4cf222b
dtc: org.apache.spark.ml.classification.DecisionTreeClassifier = dtc_f4c9373a4ef4


### Building the Pipeline

In [24]:
import org.apache.spark.ml.Pipeline

import org.apache.spark.ml.Pipeline


In [25]:
val pipeline_nb = new Pipeline().setStages(Array(tokenizer, remover, count_vec, idf, ham_spam_to_num, assembler, nb))
val pipeline_lor = new Pipeline().setStages(Array(tokenizer, remover, count_vec, idf, ham_spam_to_num, assembler, lor))

pipeline_nb: org.apache.spark.ml.Pipeline = pipeline_1bc6a2b0f5bf
pipeline_lor: org.apache.spark.ml.Pipeline = pipeline_dd9bc9a26553


In [26]:
val nb_model = pipeline_nb.fit(data)
val lor_model = pipeline_lor.fit(data)

2019-12-31 21:26:55 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
2019-12-31 21:26:55 WARN  BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS


nb_model: org.apache.spark.ml.PipelineModel = pipeline_1bc6a2b0f5bf
lor_model: org.apache.spark.ml.PipelineModel = pipeline_dd9bc9a26553


In [30]:
var clean_data_nb = nb_model.transform(data)
var clean_data_lor = lor_model.transform(data)

clean_data_nb: org.apache.spark.sql.DataFrame = [class: string, text: string ... 10 more fields]
clean_data_lor: org.apache.spark.sql.DataFrame = [class: string, text: string ... 10 more fields]


In [31]:
clean_data_nb.show(3)

+-----+--------------------+------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|class|                text|length|          token_text|         stop_tokens|               c_vec|              tf_idf|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|  ham|Go until jurong p...|   111|[go, until, juron...|[go, jurong, poin...|(13423,[7,11,31,6...|(13423,[7,11,31,6...|  0.0|(13424,[7,11,31,6...|[-1000.1575901830...|[1.0,9.5553840178...|       0.0|
|  ham|Ok lar... Joking ...|    29|[ok, lar..., joki...|[ok, lar..., joki...|(13423,[0,24,297,...|(13423,[0,24,297,...|  0.0|(13424,[0,24,297,...|[-299.78954560145...|[1.0,2.5831095772...|       0.0|


In [32]:
clean_data_lor.show(3)

+-----+--------------------+------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|class|                text|length|          token_text|         stop_tokens|               c_vec|              tf_idf|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+------+--------------------+--------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|  ham|Go until jurong p...|   111|[go, until, juron...|[go, jurong, poin...|(13423,[7,11,31,6...|(13423,[7,11,31,6...|  0.0|(13424,[7,11,31,6...|[28.2282693346141...|[0.99999999999944...|       0.0|
|  ham|Ok lar... Joking ...|    29|[ok, lar..., joki...|[ok, lar..., joki...|(13423,[0,24,297,...|(13423,[0,24,297,...|  0.0|(13424,[0,24,297,...|[26.1076534066338...|[0.99999999999541...|       0.0|


#### Training and Evaluation!

In [33]:
clean_data_nb = clean_data_nb.select("label","features")
clean_data_lor = clean_data_lor.select("label","features")

clean_data_nb: org.apache.spark.sql.DataFrame = [label: double, features: vector]
clean_data_lor: org.apache.spark.sql.DataFrame = [label: double, features: vector]


In [34]:
clean_data_nb.show(3)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(13424,[7,11,31,6...|
|  0.0|(13424,[0,24,297,...|
|  1.0|(13424,[2,13,19,3...|
+-----+--------------------+
only showing top 3 rows



In [35]:
clean_data_lor.show(3)

+-----+--------------------+
|label|            features|
+-----+--------------------+
|  0.0|(13424,[7,11,31,6...|
|  0.0|(13424,[0,24,297,...|
|  1.0|(13424,[2,13,19,3...|
+-----+--------------------+
only showing top 3 rows



In [36]:
val Array(train_nb,test_nb) = clean_data_nb.randomSplit(Array(0.7,0.3), seed=12345)
val Array(train_lor,test_lor) = clean_data_lor.randomSplit(Array(0.7,0.3), seed=12345)

train_nb: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
test_nb: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
train_lor: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]
test_lor: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [label: double, features: vector]


In [37]:
val spam_predictor_nb = nb.fit(train_nb)
val spam_predictor_lor = lor.fit(train_lor)

spam_predictor_nb: org.apache.spark.ml.classification.NaiveBayesModel = NaiveBayesModel (uid=nb_5985ae8a8e88) with 2 classes
spam_predictor_lor: org.apache.spark.ml.classification.LogisticRegressionModel = logreg_a8c8c4cf222b


In [38]:
val test_results_nb = spam_predictor_nb.transform(test_nb)
val test_results_lor = spam_predictor_lor.transform(test_lor)

test_results_nb: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 3 more fields]
test_results_lor: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 3 more fields]


In [39]:
test_results_nb.show(3)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(13424,[0,1,4,50,...|[-819.92914430065...|[1.0,9.2949203991...|       0.0|
|  0.0|(13424,[0,1,5,15,...|[-997.90523073055...|[1.0,1.6286640272...|       0.0|
|  0.0|(13424,[0,1,7,8,1...|[-874.03499580029...|[1.0,1.8162872740...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



In [40]:
test_results_lor.show(3)

+-----+--------------------+--------------------+--------------------+----------+
|label|            features|       rawPrediction|         probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
|  0.0|(13424,[0,1,4,50,...|[39.9998544815758...|[1.0,4.2489725140...|       0.0|
|  0.0|(13424,[0,1,5,15,...|[48.0074481259798...|[1.0,1.4145887133...|       0.0|
|  0.0|(13424,[0,1,7,8,1...|[29.8269701451849...|[0.99999999999988...|       0.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 3 rows



#### Evaluation

In [41]:
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator


In [42]:
val acc_eval = new MulticlassClassificationEvaluator().setLabelCol("label").setPredictionCol("prediction")

acc_eval: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_9ae58527c216


In [44]:
val acc_nb = acc_eval.evaluate(test_results_nb)
println(f"Accuracy of Naive Bayes Model at predictiong spam was: ${acc_nb}%1.2f")
println("*"*80)

val acc_lor = acc_eval.evaluate(test_results_lor)
println(f"Accuracy of Random Forest Classifier Model at predictiong spam was: ${acc_lor}%1.2f")
println("*"*80)

Accuracy of Naive Bayes Model at predictiong spam was: 0.92
********************************************************************************
Accuracy of Random Forest Classifier Model at predictiong spam was: 0.96
********************************************************************************


acc_nb: Double = 0.9246423704291925
acc_lor: Double = 0.9636361033474314


### Getting Confusion Matrix

In [45]:
import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.evaluation.MulticlassMetrics


In [46]:
val nb_predictionAndLabel = test_results_nb.select("prediction","label").as[(Double,Double)].rdd
val lor_predictionAndLabel = test_results_lor.select("prediction","label").as[(Double,Double)].rdd

nb_predictionAndLabel: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[295] at rdd at <console>:51
lor_predictionAndLabel: org.apache.spark.rdd.RDD[(Double, Double)] = MapPartitionsRDD[299] at rdd at <console>:52


In [47]:
val nb_metrics = new MulticlassMetrics(nb_predictionAndLabel)
val lor_metrics = new MulticlassMetrics(lor_predictionAndLabel)

nb_metrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@11c75c67
lor_metrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@673345ba


#### Confusion Matrix

In [48]:
println("Naive Bayes Model - Confusion Matrix:")
println(nb_metrics.confusionMatrix)

Naive Bayes Model - Confusion Matrix:
1346.0  131.0  
9.0     208.0  


In [49]:
println("Logistic Regression Model - Confusion Matrix:")
println(lor_metrics.confusionMatrix)

Logistic Regression Model - Confusion Matrix:
1475.0  2.0    
56.0    161.0  


#### Accuracy

In [50]:
println("Naive Bayes Model - Accuracy:")
println(nb_metrics.accuracy)

Naive Bayes Model - Accuracy:
0.9173553719008265


In [51]:
println("Logistic Regression Model - Accuracy:")
println(lor_metrics.accuracy)

Logistic Regression Model - Accuracy:
0.9657615112160567


#### Precision

In [52]:
println("Naive Bayes Model - Precision:")
println(nb_metrics.precision)

Naive Bayes Model - Precision:
0.9173553719008265


In [53]:
println("Logistic Regression Model - Precision:")
println(lor_metrics.precision)

Logistic Regression Model - Precision:
0.9657615112160567


#### Recall

In [54]:
println("Naive Bayes Model - Recall:")
println(nb_metrics.recall)

Naive Bayes Model - Recall:
0.9173553719008265


In [55]:
println("Logistic Regression Model - Recall:")
println(lor_metrics.recall)

Logistic Regression Model - Recall:
0.9657615112160567


### CLosing Spark Session

In [56]:
spark.stop()

## Thank You!