# Spam Mail Prediction using Random Forest

* Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. 
* Random decision forests correct for decision trees' habit of overfitting to their training set
* You can read more about it [here](https://towardsdatascience.com/the-random-forest-algorithm-d457d499ffcd)

Importing required components

In [32]:
from pyspark.ml.feature import VectorAssembler, VectorIndexer, StringIndexer, IndexToString
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

Reading dataset from Azure Blob, kindly replace the path based on where your data is stored

In [1]:
spam_data = spark.read.csv('wasb:///spambase.csv', header=True, inferSchema=True)

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
0,application_1522641190072_0004,pyspark3,idle,Link,Link,✔


SparkSession available as 'spark'.


Verifyingif the data was imported correctly

In [2]:
spam_data.limit(3).toPandas()

   word_freq_make  word_freq_address  word_freq_all  word_freq_3d  \
0            0.00               0.64           0.64             0   
1            0.21               0.28           0.50             0   
2            0.06               0.00           0.71             0   

   word_freq_our  word_freq_over  word_freq_remove  word_freq_internet  \
0           0.32            0.00              0.00                0.00   
1           0.14            0.28              0.21                0.07   
2           1.23            0.19              0.19                0.12   

   word_freq_order  word_freq_mail     ...       char_freq_;  char_freq_(  \
0             0.00            0.00     ...              0.00        0.000   
1             0.00            0.94     ...              0.00        0.132   
2             0.64            0.25     ...              0.01        0.143   

   char_freq_[  char_freq_!  char_freq_$  char_freq_#  \
0          0.0        0.778        0.000        0.000   
1  

In [30]:
spam_data.describe().toPandas()

  summary       word_freq_make    word_freq_address       word_freq_all  \
0   count                 4601                 4601                4601   
1    mean  0.10455335796565962  0.21301456205172783  0.2806563790480323   
2  stddev   0.3053575620234701   1.2905751909453216  0.5041428838471845   
3     min                  0.0                  0.0                 0.0   
4     max                 4.54                14.28                 5.1   

          word_freq_3d       word_freq_our       word_freq_over  \
0                 4601                4601                 4601   
1  0.06476852858074332  0.3122234296891974  0.09590089111062813   
2    1.392893374183434  0.6725127692846672  0.27382408300980804   
3                    0                 0.0                  0.0   
4                   43                10.0                 5.88   

      word_freq_remove   word_freq_internet     word_freq_order  \
0                 4601                 4601                4601   
1  0.1142077

In [4]:
spam_data.printSchema()

root
 |-- word_freq_make: double (nullable = true)
 |-- word_freq_address: double (nullable = true)
 |-- word_freq_all: double (nullable = true)
 |-- word_freq_3d: integer (nullable = true)
 |-- word_freq_our: double (nullable = true)
 |-- word_freq_over: double (nullable = true)
 |-- word_freq_remove: double (nullable = true)
 |-- word_freq_internet: double (nullable = true)
 |-- word_freq_order: double (nullable = true)
 |-- word_freq_mail: double (nullable = true)
 |-- word_freq_receive: double (nullable = true)
 |-- word_freq_will: double (nullable = true)
 |-- word_freq_people: double (nullable = true)
 |-- word_freq_report: double (nullable = true)
 |-- word_freq_addresses: double (nullable = true)
 |-- word_freq_free: double (nullable = true)
 |-- word_freq_business: double (nullable = true)
 |-- word_freq_email: double (nullable = true)
 |-- word_freq_you: double (nullable = true)
 |-- word_freq_credit: double (nullable = true)
 |-- word_freq_your: double (nullable = true)
 |--

Split the data into training and testing set in a 70/30 ratio. 

The training set is data on which our model will be trained, while the testing set is holdout data on which we test the performance of our trained model.

In [5]:
train, test = spam_data.randomSplit([0.7, 0.3])

In [8]:
print ("We have %d training examples and %d test examples." % (train.count(), test.count()))

We have 3229 training examples and 1372 test examples.

Encoding the label to a column of label indices.

In [22]:
labelIndexer = StringIndexer(inputCol="spam_nospam", outputCol="indexedLabel").fit(spam_data)

Creating a single vector column for all predictors

In [10]:
Cols = spam_data.columns
Cols.remove('spam_nospam')
vectorAssembler = VectorAssembler(inputCols=Cols, outputCol="Features")

Specifying Random Forest model

In [19]:
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="Features", numTrees=50)

 Creating evaluator to check model performance 


In [41]:
evaluator = BinaryClassificationEvaluator\
(labelCol=rf.getLabelCol(), rawPredictionCol=rf.getPredictionCol())

Specifying values of hyperparameters to be checked for

In [66]:
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [2, 5,10,20,50,100])\
.addGrid(rf.maxDepth, [10, 20,30]).build()

Creating cross validator to loop through all hyperparameter values and choose the best model

In [67]:
cv = CrossValidator(estimator=rf, evaluator=evaluator, estimatorParamMaps=paramGrid, numFolds= 5)

Combining all required transformations and modellers into a pipeline

In [68]:
pipeline = Pipeline(stages=[labelIndexer,vectorAssembler, cv])

Training using train data

In [69]:
model = pipeline.fit(train)

Predicting outcomes for test data

In [70]:
predictions = model.transform(test)

Verifying if the predictions were generated as required

In [71]:
predictions.select("prediction", "spam_nospam", "Features").show(3)

+----------+-----------+--------------------+
|prediction|spam_nospam|            Features|
+----------+-----------+--------------------+
|       0.0|          0|(57,[54,55,56],[1...|
|       0.0|          0|(57,[54,55,56],[1...|
|       0.0|          0|(57,[54,55,56],[1...|
+----------+-----------+--------------------+
only showing top 3 rows

AUC for model on test data. You can read more about the metric [here](https://en.wikipedia.org/wiki/Receiver_operating_characteristic)

In [72]:
evaluator.evaluate(predictions)

0.9551047372201578

AreaUnderPR for model on test data.  You can read more about the metric [here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html)

In [73]:
evaluator.evaluate(predictions,{evaluator.metricName: 'areaUnderPR'})

0.9551047372201578