<font size=5>

Classification with PySpark ML and MLLIB

</font>

<font size=5> This dataset does not have column name, but we will give the proper columns.  Data is downloadable online.  You need to specify the path, either with file:// if it is in the OS file path, if not, the dataset is assumed in an HDFS directory
    
    George Jen  -- Jen Tek LLC
    
</font>

In [134]:
import sys,os,os.path
os.environ['SPARK_HOME']='/opt/spark'
%matplotlib inline
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
#Read Data
df = spark.read.csv("file:///home/hadoop/SMS/SMSSpamCollection", sep = "\t", inferSchema=True, header = False)

<font size=5> Show 3 lines to get an idea about the dataset,  _c0 looks like as a label, c1 looks feature </font> 

In [135]:
df.show(3, truncate = False)

+----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|_c0 |_c1                                                                                                                                                        |
+----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|ham |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                            |
|ham |Ok lar... Joking wif u oni...                                                                                                                              |
|spam|Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's|
+----+----------------

<font size=5> Note: Spam is Spam, Han is OK.  Rename Column name _c0 as status, _c1 as feature  </font>

In [136]:
df = df.withColumnRenamed('_c0', 'status').withColumnRenamed('_c1', 'message')
df.show(3, truncate = False)

+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|status|message                                                                                                                                                    |
+------+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|ham   |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                            |
|ham   |Ok lar... Joking wif u oni...                                                                                                                              |
|spam  |Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's|
+------+--

<font size=5> 

Encode status column to numeric: ham to 1.0 and spam to 0. All our fields need to be numeric for machine to learn, also rename the column status to label
    
</font>

In [137]:
df.createOrReplaceTempView('temp')
df = spark.sql('select case status when "ham" then 1.0  else 0 end as label, message from temp')
df.show(3, truncate = False)

+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|message                                                                                                                                                    |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------+
|1.0  |Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...                                            |
|1.0  |Ok lar... Joking wif u oni...                                                                                                                              |
|0.0  |Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's|
+-----+---------

<font size=5> 1 is OK, 0 is Junk </font>

<font size=5>
Tokenize the messages
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). Let’s tokenize the messages and create a list of words of each message.
</font>

In [139]:
from pyspark.ml.feature import  Tokenizer
tokenizer = Tokenizer(inputCol="message", outputCol="words")
wordsData = tokenizer.transform(df)
wordsData.show(3)

+-----+--------------------+--------------------+
|label|             message|               words|
+-----+--------------------+--------------------+
|  1.0|Go until jurong p...|[go, until, juron...|
|  1.0|Ok lar... Joking ...|[ok, lar..., joki...|
|  0.0|Free entry in 2 a...|[free, entry, in,...|
+-----+--------------------+--------------------+
only showing top 3 rows



<font size=5> CountVectorizer converts a collection of text documents to vectors of token counts. 
    
See:
https://spark.apache.org/docs/latest/ml-features#countvectorizer


</font>


In [140]:
from pyspark.ml.feature import CountVectorizer
count = CountVectorizer (inputCol="words", outputCol="rawFeatures")
model = count.fit(wordsData)
featurizedData = model.transform(wordsData)
featurizedData.show(3)

+-----+--------------------+--------------------+--------------------+
|label|             message|               words|         rawFeatures|
+-----+--------------------+--------------------+--------------------+
|  1.0|Go until jurong p...|[go, until, juron...|(13587,[8,42,52,6...|
|  1.0|Ok lar... Joking ...|[ok, lar..., joki...|(13587,[5,75,411,...|
|  0.0|Free entry in 2 a...|[free, entry, in,...|(13587,[0,3,8,20,...|
+-----+--------------------+--------------------+--------------------+
only showing top 3 rows



<font size=5>
Apply Term frequency - inverse document frequency (TF-IDF)

#IDF reduces the features that often appear in the corpus. When using text as a feature, this usually improves performance because the most common, and therefore less important, words are weighted down.

</font>

In [141]:
from pyspark.ml.feature import  IDF
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)
rescaledData.select("label", "features").show(3)  #Only needed to train


+-----+--------------------+
|label|            features|
+-----+--------------------+
|  1.0|(13587,[8,42,52,6...|
|  1.0|(13587,[5,75,411,...|
|  0.0|(13587,[0,3,8,20,...|
+-----+--------------------+
only showing top 3 rows



<font size=5>
Randomly Split DataFrame into 80% Training (trainDF) and 20 Testing (testDF)
    
</font>


In [142]:
seed = 0  # random seed 0
trainDF, testDF = rescaledData.randomSplit([0.8,0.2],seed)

<font size=5> counts of train and test DataFrame </font>

In [143]:
trainDF.count()

4450

In [144]:
testDF.count()

1124

<font size=5>
Try different classifiers. 

Logistic regression classifier

Logistic regression is a common method of predicting classification responses. A special case of a generalized linear model is the probability of predicting a result. In spark.ml, logistic regression can be used to predict binary results by binomial logistic regression, or it can be used to predict multiple types of results by using multiple logistic regression. Use the family parameter to choose between these two algorithms, or leave it unset and Spark will infer the correct variable.

</font>

In [147]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
import numpy as np
lr = LogisticRegression(maxIter = 100)

model_lr = lr.fit(trainDF)


In [148]:
prediction_lr = model_lr.transform(testDF)

In [149]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
my_eval_lr = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label', metricName='areaUnderROC')
my_eval_lr.evaluate(prediction_lr)

0.9103448275862068

In [150]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
my_mc_lr = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='f1')
my_mc_lr.evaluate(prediction_lr)

0.9758808361858744

In [151]:
my_mc_lr = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')
my_mc_lr.evaluate(prediction_lr)

0.9768683274021353

In [152]:
train_fit_lr = prediction_lr.select('label','prediction')
train_fit_lr.groupBy('label','prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|  0.0|       1.0|   26|
|  1.0|       1.0|  979|
|  0.0|       0.0|  119|
+-----+----------+-----+



<font size=5>
Naive Bayes 
Naive Bayesian classifiers are a class of simple probability classifiers that apply strong (naive) independent assumptions between features based on Bayes' theorem. The spark.ml implementation currently supports polynomial naive Bayes and Bernoulli Naïve Bayes.
</font>

In [154]:
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
Model_nb = nb.fit(trainDF)

In [155]:
predictions_nb = Model_nb.transform(testDF)
predictions_nb.select('label', 'prediction').show(5)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
|  0.0|       0.0|
+-----+----------+
only showing top 5 rows



In [158]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
my_eval_nb = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label', metricName='areaUnderROC')
my_eval_nb.evaluate(predictions_nb)

0.9545419323024902

In [159]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
my_mc_nb = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='f1')
my_mc_nb.evaluate(predictions_nb)

0.9452781419741524

In [160]:
my_mc_nb = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='label', metricName='accuracy')
my_mc_nb.evaluate(predictions_nb)

0.9412811387900356

<font size=5>

Now let's try Random Forest Classfication to see how it performs on the classification on the same data
    
    
</font>

In [162]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator


In [163]:
rf = RandomForestClassifier(featuresCol='features', labelCol='label', predictionCol='prediction', probabilityCol='probability', rawPredictionCol='rawPrediction',maxDepth=3)
Model_rf = rf.fit(trainDF)

In [164]:
predictions_rf = Model_rf.transform(testDF)
predictions_rf.select('label', 'prediction').show(5)

+-----+----------+
|label|prediction|
+-----+----------+
|  0.0|       1.0|
|  0.0|       1.0|
|  0.0|       1.0|
|  0.0|       1.0|
|  0.0|       1.0|
+-----+----------+
only showing top 5 rows



In [165]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
my_eval_rf = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='label', metricName='areaUnderROC')
my_eval_rf.evaluate(predictions_rf)

0.5

<font size=5>

Given Area Under Curve is 0.5, we do not want to use this Random Forest Classification.  Area under Curve is between 0 to 1,
the more close to 1 the better the classication is.

We wil give up Random Forest on this classitication

</font>

<font size=5>

There are 2 machine learning libraries from Apache Spark.

pyspark.ml and pyspark.mllib.  Data for pyspark.ml is in the form of DataFrame, but in pyspark.mllib, data is requried in the form of RDD (Resilient Distributed Datasets), on top of that, pyspark.mllib training and testing data are required to be in the format of labeled point. See definition

https://spark.apache.org/In fact, it is lot harder to prepare data for machine learning in pyspark.mllib than in pyspark.ml.

docs/2.1.0/mllib-data-types.html#labeled-point


In fact, it is lot harder to prepare data for machine learning in pyspark.mllib than in pyspark.ml.

Now we switch from pyspark.ml to pyspark.mllib.
    
    
    
</font>

In [166]:
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors



<font size=5>

Below function takes a line from RDD and returns line that complies with LabeledPoint specification.
    
</font>

In [167]:
def parsePoint(line):
    return LabeledPoint(line[0], MLLibVectors.fromML(line[1]))

In [168]:

trainDF_2=trainDF.selectExpr("label as first",'features')
trainDF_2.columns
    

['first', 'features']

<font size=5>

Convert a training DataFrame into RDD


</font>

In [169]:
trainDF_rdd=trainDF_2.rdd

<font size=5>

Resulting RDD undergoes mapped to parsePoint function, converting into RDD with each line meeting the requirement of Labeled Point, which is needed to be used as training and testing data when calling train and predict


</font>

In [170]:
trainDF_lp=trainDF_rdd.map(parsePoint)

<font size=5>

Train the SVMwithSGD model using RDD that complies with Labeled Point

</font>

model_mllib_svm = SVMWithSGD.train(trainDF_lp, iterations=100)


<font size=5>
    
Evaluate SVM model using training data

</font>

In [172]:
# Evaluating the model on training data
labelsAndPreds = trainDF_lp.map(lambda p: (p.label, model_mllib_svm.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(trainDF_lp.count())
print("Training Error = " + str(trainErr))


Training Error = 0.0035955056179775282


<font size=5>

Evaluate SVM model using testing data, which also needs to be converted to Labeled Ponint format.

</font>

In [173]:
testDF_2=testDF.selectExpr("label as first",'features')
testDF_2.columns

['first', 'features']

In [174]:
testDF_rdd=testDF_2.rdd


In [175]:
testDF_lp=testDF_rdd.map(parsePoint)

In [176]:
# Evaluating the model on testing data
labelsAndPreds = testDF_lp.map(lambda p: (p.label, model_mllib_svm.predict(p.features)))
testErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(testDF_lp.count())
print("Testing Error = " + str(testErr))


Testing Error = 0.023131672597864767


<font size=5>
    
Now get additional testing metrics with BinaryClassificationMetrics, on 

Area Under Precision Recall and Area under ROC Curve.

Persoanlly, I look more on Area under ROC Curve to evaluate model performance, which is more reliable.



</font>

In [177]:
from pyspark.mllib.evaluation import BinaryClassificationMetrics

In [178]:
predictionAndLabels = testDF_lp.map(lambda lp: (float(model_mllib_svm.predict(lp.features)), lp.label))

In [179]:
# Instantiate metrics object
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under precision-recall curve
print("Area under PR = %s" % metrics.areaUnderPR)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

Area under PR = 0.9850256872959893
Area under ROC = 0.9455954351731183


<font size=5>

Since at it, try LogisticRegressionWithLBFGS under pyspark.mllib with the same way on SVM above.

    
</font>

In [180]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel


In [181]:
# Build the model
model_mllib_lr_BFGS = LogisticRegressionWithLBFGS.train(trainDF_lp)


In [182]:
# Evaluating the model on training data
labelsAndPreds = trainDF_lp.map(lambda p: (p.label, model_mllib_lr_BFGS.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(trainDF_lp.count())
print("Training Error = " + str(trainErr))

Training Error = 0.0


In [183]:
# Evaluating the model on testing data
labelsAndPreds = testDF_lp.map(lambda p: (p.label, model_mllib_lr_BFGS.predict(p.features)))
testErr = labelsAndPreds.filter(lambda lp: lp[0] != lp[1]).count() / float(testDF_lp.count())
print("Testing Error = " + str(testErr))

Testing Error = 0.023131672597864767


In [184]:
predictionAndLabels = testDF_lp.map(lambda lp: (float(model_mllib_lr_BFGS.predict(lp.features)), lp.label))

In [185]:
# Instantiate metrics object
metrics = BinaryClassificationMetrics(predictionAndLabels)

# Area under precision-recall curve
print("Area under PR = %s" % metrics.areaUnderPR)

# Area under ROC curve
print("Area under ROC = %s" % metrics.areaUnderROC)

Area under PR = 0.9713207462181085
Area under ROC = 0.8978302983339791


<font size=5>

This concludes pyspark ML/MLLIIB classification Excercise.

</font>