# **Classification Algorithms**
Spark MLlib provides a (limited) set of classification algorithms
- Logistic regression
    - Binomial logistic regression
    - Multinomial logistic regression
- Decision tree classifier
- Random forest classifier
- Gradient-boosted tree classifier
- Multilayer perceptron classifier
- Linear Support Vector Machine

All the Spark classification algorithms are trained on top of an input DataFrame containing (at least) two columns:
- label:
    - The class label, i.e., the attribute to be predicted by the classification model
    - It is an integer value (casted to a double)
- features:
    - A vector of doubles containing the values of the predictive attributes of the input records/data points
    - The data type of this column is pyspark.ml.linalg.Vector
    - Both dense and sparse vectors can be use 
    
## **Logistic Regression**

Consider the following example input training data file

label, attr1, attr2, attr3

  1.0,   0.0,   1.1,   0.1

  0.0,   2.0,   1.0,  -1.0

  0.0,   2.0,   1.3,   1.0

  1.0,   0.0,   1.2,  -0.5
  
It contains four records/data points. This is a binary classification problem because
the class label assumes only two values: **0 and 1**

In [1]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# input and output folders
trainingData = "./databases/trainingData.csv"
unlabeledData = "./databases/unlabeledData.csv"
outputPath = "./predictionsLR/"

In [2]:
# *************************
# Training step
# *************************

# Create a DataFrame from trainingData.csv
# Training data in raw format
trainingData = spark.read.load(trainingData,\
                                format="csv",\
                                header=True,\
                                inferSchema=True)

# Define an assembler to create a column (features) of type Vector
# containing the double values associated with columns attr1, attr2, attr3
assembler = VectorAssembler(inputCols=["attr1", "attr2", "attr3"],\
outputCol="features")

# Apply the assembler to create column features for the training data
trainingDataDF = assembler.transform(trainingData)

In [3]:
trainingDataDF.show()

+-----+-----+-----+-----+--------------+
|label|attr1|attr2|attr3|      features|
+-----+-----+-----+-----+--------------+
|  1.0|  0.0|  1.1|  0.1| [0.0,1.1,0.1]|
|  0.0|  2.0|  1.0| -1.0|[2.0,1.0,-1.0]|
|  0.0|  2.0|  1.3|  1.0| [2.0,1.3,1.0]|
|  1.0|  0.0|  1.2| -0.5|[0.0,1.2,-0.5]|
+-----+-----+-----+-----+--------------+



In [4]:
# Create a LogisticRegression object.
# LogisticRegression is an Estimator that is used to
# create a classification model based on logistic regression.
lr = LogisticRegression()

# We can set the values of the parameters of the
# Logistic Regression algorithm using the setter methods.
# There is one set method for each parameter
# For example, we are setting the number of maximum iterations to 10
# and the regularization parameter. to 0.0.1
lr.setMaxIter(10)
lr.setRegParam(0.01)

# Train a logistic regression model on the training data
classificationModel = lr.fit(trainingDataDF)

In [5]:
# *************************
# Prediction step
# *************************

# Create a DataFrame from unlabeledData.csv
# Unlabeled data in raw format
unlabeledData = spark.read.load(unlabeledData,\
format="csv", header=True, inferSchema=True)

# Apply the same assembler we created before also on the unlabeled data
# to create the features column
unlabeledDataDF = assembler.transform(unlabeledData)

# Make predictions on the unlabled data using the transform() method of the
# trained classification model transform uses only the content of 'features'
# to perform the predictions
predictionsDF = classificationModel.transform(unlabeledDataDF)

In [6]:
predictionsDF.show()

+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+
|label|attr1|attr2|attr3|      features|       rawPrediction|         probability|prediction|
+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+
| null| -1.0|  1.5|  1.3|[-1.0,1.5,1.3]|[-6.5872014439355...|[0.00137599470692...|       1.0|
| null|  3.0|  2.0| -0.1|[3.0,2.0,-0.1]|[3.98018281942565...|[0.98166040093741...|       0.0|
| null|  0.0|  2.2| -1.5|[0.0,2.2,-1.5]|[-6.3765177028604...|[0.00169814755783...|       1.0|
+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+



In [None]:
# The returned DataFrame has the following schema (attributes)
# - attr1
# - attr2
# - attr3
# - features: vector (values of the attributes)
# - label: double (value of the class label)
# - rawPrediction: vector (nullable = true)
# - probability: vector (The i-th cell contains the probability that the current
# record belongs to the i-th class
# - prediction: double (the predicted class label)
# Select only the original features (i.e., the value of the original attributes
# attr1, attr2, attr3) and the predicted class for each record
predictions = predictionsDF.select("attr1", "attr2", "attr3", "prediction")

# Save the result in an HDFS output folder
predictions.write.csv(outputPath, header="true")

In [8]:
predictions.show()

+-----+-----+-----+----------+
|attr1|attr2|attr3|prediction|
+-----+-----+-----+----------+
| -1.0|  1.5|  1.3|       1.0|
|  3.0|  2.0| -0.1|       0.0|
|  0.0|  2.2| -1.5|       1.0|
+-----+-----+-----+----------+



In [9]:
# Select only predictions with prob of class 1 > 0.9

from pyspark.sql.types import DoubleType 

def extractProb(probabilities, labelIndex):
    return float(probabilities[labelIndex])

In [11]:
# probability is a vector. I want to select the lines
# where the second element of this vector is greater than 0.9

spark.udf.register("extractProb", extractProb, DoubleType())

<function __main__.extractProb(probabilities, labelIndex)>

In [13]:
tempDF = predictionsDF.selectExpr("probability",\
                                "extractProb(probability, 0) as P0",
                                "extractProb(probability, 1) as P1",
                                "prediction")
tempDF.printSchema()
tempDF.show(truncate=False)

root
 |-- probability: vector (nullable = true)
 |-- P0: double (nullable = true)
 |-- P1: double (nullable = true)
 |-- prediction: double (nullable = false)

+------------------------------------------+---------------------+-------------------+----------+
|probability                               |P0                   |P1                 |prediction|
+------------------------------------------+---------------------+-------------------+----------+
|[0.0013759947069214356,0.9986240052930786]|0.0013759947069214356|0.9986240052930786 |1.0       |
|[0.9816604009374171,0.01833959906258293]  |0.9816604009374171   |0.01833959906258293|0.0       |
|[0.0016981475578358176,0.9983018524421641]|0.0016981475578358176|0.9983018524421641 |1.0       |
+------------------------------------------+---------------------+-------------------+----------+



highProbDF = predictionsDF.filter("extractProb(probability,1)>0.9")
highProbDF.show()

In [None]:
# If you want to store the model
# prameter -> folder to store the model
classificationModel.save("modelLR") 

In [15]:
# You will create two folders:
# - metadata: description of the model and its configuration
# - data

## **Logistic Regression with Pipelines**


In [6]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel

# input and output folders
trainingData = "./databases/trainingData.csv"
unlabeledData = "./databases/unlabeledData.csv"
outputPath = "./predictionsLR/"

In [7]:
# READ DATA

# *************************
# Training step
# *************************
# Create a DataFrame from trainingData.csv
# Training data in raw format
trainingData = spark.read.load(trainingData,\
        format="csv",\
        header=True,\
        inferSchema=True)

In [8]:
# Define an assembler to create a column (features) of type Vector
# containing the double values associated with columns attr1, attr2, attr3
assembler = VectorAssembler(inputCols=["attr1", "attr2", "attr3"], outputCol="features")

# Create a LogisticRegression object
# LogisticRegression is an Estimator that is used to
# create a classification model based on logistic regression.
lr = LogisticRegression()

# We can set the values of the parameters of the
# Logistic Regression algorithm using the setter methods.
# There is one set method for each parameter
# For example, we are setting the number of maximum iterations to 10
# and the regularization parameter. to 0.0.1
lr.setMaxIter(10)
lr.setRegParam(0.01)

LogisticRegression_86931dac514d

In [10]:
print(assembler)

VectorAssembler_130260aa10ff


Now we can create our **pipeline**. The parameter of **.setStages(...)** is a vector of steps. First we want to apply the assembler step, then we want to perform the Logistic Regression.

In [11]:
# Now we can define our Pipeline
# The pipeline includes also the preprocessing step
pipeline = Pipeline().setStages([assembler, lr])

# Execute the pipeline on the training data to build the
# classification model
classificationModel = pipeline.fit(trainingData)


Now, the classification model can be used to predict the class label of new unlabeled data

In [12]:
# *************************
# Prediction step
# *************************
# Create a DataFrame from unlabeledData.csv
# Unlabeled data in raw format
unlabeledData = spark.read.load(unlabeledData, format="csv", header=True, inferSchema=True)

# We simply apply transform to the classification model, giving the unlabled data as input
predictions = classificationModel.transform(unlabeledData)

In [13]:
# The returned DataFrame has the following schema (attributes)
# - attr1
# - attr2
# - attr3
# - features: vector (values of the attributes)
# - label: double (value of the class label)
# - rawPrediction: vector (nullable = true)
# - probability: vector (The i-th cell contains the probability that the current
# record belongs to the i-th class
# - prediction: double (the predicted class label)
# Select only the original features (i.e., the value of the original attributes
# attr1, attr2, attr3) and the predicted class for each record

predictions = predictionsDF.select("attr1", "attr2", "attr3", "prediction")

# Save the result in an HDFS output folder
# predictions.write.csv(outputPath, header="true")


In [14]:
trainingData.show()
predictionsDF.show()

+-----+-----+-----+-----+
|label|attr1|attr2|attr3|
+-----+-----+-----+-----+
|  1.0|  0.0|  1.1|  0.1|
|  0.0|  2.0|  1.0| -1.0|
|  0.0|  2.0|  1.3|  1.0|
|  1.0|  0.0|  1.2| -0.5|
+-----+-----+-----+-----+

+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+
|label|attr1|attr2|attr3|      features|       rawPrediction|         probability|prediction|
+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+
| null| -1.0|  1.5|  1.3|[-1.0,1.5,1.3]|[-6.5872014439355...|[0.00137599470692...|       1.0|
| null|  3.0|  2.0| -0.1|[3.0,2.0,-0.1]|[3.98018281942565...|[0.98166040093741...|       0.0|
| null|  0.0|  2.2| -1.5|[0.0,2.2,-1.5]|[-6.3765177028604...|[0.00169814755783...|       1.0|
+-----+-----+-----+-----+--------------+--------------------+--------------------+----------+



## **Decision Tree**
Same solution as previous, just change one line of code: **algorithm name**

In [30]:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml import Pipeline
from pyspark.ml import PipelineModel
...

Ellipsis

In [31]:
# Create a DecisionTreeClassifier object.
# DecisionTreeClassifier is an Estimator that is used to
# create a classification model based on decision trees.
dt = DecisionTreeClassifier()

# We can set the values of the parameters of the DecisionTree
# For example we can set the measure that is used to decide if a
# node must be split. In this case we set gini index
dt.setImpurity("gini")

DecisionTreeClassifier_56504325f349

In [None]:
...