-- Notepad to myself --

# Classification

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

Source: Iris dataset -> https://archive.ics.uci.edu/ml/datasets/iris

This data set contains data about three species of irises. The features are measurements of two parts of the flower, the sepal and the petal. There is a length and width measurement for both the sepal and the petal, creating a total of four features. The label in this data set is the name of the species.

In [2]:
iris_df = spark.read.csv("data/iris.data", 
                         inferSchema=True)
iris_df.printSchema()

root
 |-- _c0: double (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: double (nullable = true)
 |-- _c4: string (nullable = true)



In [3]:
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

In [4]:
iris_df = iris_df.select(col("_c0").alias("sepal_length"),
                        col("_c1").alias("sepal_width"),
                        col("_c2").alias("petal_length"),
                        col("_c3").alias("petal_width"),
                        col("_c4").alias("species"))
iris_df.show(5)

+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width|    species|
+------------+-----------+------------+-----------+-----------+
|         5.1|        3.5|         1.4|        0.2|Iris-setosa|
|         4.9|        3.0|         1.4|        0.2|Iris-setosa|
|         4.7|        3.2|         1.3|        0.2|Iris-setosa|
|         4.6|        3.1|         1.5|        0.2|Iris-setosa|
|         5.0|        3.6|         1.4|        0.2|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
only showing top 5 rows



Now, let's transform this into a vector structure. 

In [5]:
vectorAssembler = VectorAssembler(inputCols=["sepal_length", "sepal_width", "petal_length", "petal_width"], 
                                  outputCol="features")

In [6]:
viris_df = vectorAssembler.transform(iris_df)

In [7]:
viris_df.take(1)

[Row(sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2, species='Iris-setosa', features=DenseVector([5.1, 3.5, 1.4, 0.2]))]

So, what we have here is a row that has the four measurements, sepal length, width, petal length, and petal width, and then we have also the species, as well as the features, and that feature has the vector with the four measurements.

Now, the last piece of preprocessing we want to do is we want to convert the label name, which is the species name, into a numeric value, and here's where we're going to use that transformation called string indexer. 

In [8]:
indexer = StringIndexer(inputCol="species", outputCol="label")

In [9]:
iviris_df = indexer.fit(viris_df).transform(viris_df)

In [10]:
iviris_df.show(5, truncate=False)

+------------+-----------+------------+-----------+-----------+-----------------+-----+
|sepal_length|sepal_width|petal_length|petal_width|species    |features         |label|
+------------+-----------+------------+-----------+-----------+-----------------+-----+
|5.1         |3.5        |1.4         |0.2        |Iris-setosa|[5.1,3.5,1.4,0.2]|0.0  |
|4.9         |3.0        |1.4         |0.2        |Iris-setosa|[4.9,3.0,1.4,0.2]|0.0  |
|4.7         |3.2        |1.3         |0.2        |Iris-setosa|[4.7,3.2,1.3,0.2]|0.0  |
|4.6         |3.1        |1.5         |0.2        |Iris-setosa|[4.6,3.1,1.5,0.2]|0.0  |
|5.0         |3.6        |1.4         |0.2        |Iris-setosa|[5.0,3.6,1.4,0.2]|0.0  |
+------------+-----------+------------+-----------+-----------+-----------------+-----+
only showing top 5 rows



So, what we have is our four measurements, sepal length, sepal width, petal length, petal width, we have species listed, we have the feature vector, and then a little wrapping around the screen is the word "label", and label has a value of 0.0. That's because the species, in this case iris-setosa, has been mapped to an index value of zero. So, this concludes our preprocessing step, and now we're ready to apply classification algorithms.

## Classification Algorithms

### Naive Bayes

In [11]:
from pyspark.ml.classification import NaiveBayes
from pyspark.ml.evaluation import MulticlassClassificationEvaluator #to evaluate the accuracy of the model

We have split our iris data set into training and test sets. 

In [12]:
splits = iviris_df.randomSplit([0.6, 0.4], 1)

In [13]:
train_df = splits[0]
test_df = splits[1]

In [14]:
train_df.count()

98

In [15]:
test_df.count()

52

In [16]:
iviris_df.count()

150

The model type is multinomial, and that just means that there are more than two different classes that we're going to be working with. In our case, there are three different types of irises. 

In [17]:
nb = NaiveBayes(modelType="multinomial")

In [18]:
nbmodel = nb.fit(train_df)

In [19]:
predictions_df = nbmodel.transform(test_df)

In [20]:
predictions_df.take(1)

[Row(sepal_length=4.3, sepal_width=3.0, petal_length=1.1, petal_width=0.1, species='Iris-setosa', features=DenseVector([4.3, 3.0, 1.1, 0.1]), label=0.0, rawPrediction=DenseVector([-9.9894, -11.3476, -11.902]), probability=DenseVector([0.7118, 0.183, 0.1051]), prediction=0.0)]

We eventually have some raw predictions and a probability dense vector, which came from the model, and then we have the final column called "prediction." Now in this case, the prediction is zero. Zero is an index value into one of the species. Zero is the index for the iris-setosa, so this prediction is correct. 

Now, looking at one example does not tell us how well the model behaves overall, so let's do a more thorough evaluation. Let's create an evaluator, and the metric that we're trying to measure is accuracy.

In [21]:
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

In [22]:
nbaccuarcy = evaluator.evaluate(predictions_df)
nbaccuarcy

0.9807692307692307

### Multi-layer perception classification

Now we're going to work with a multi-layer perceptron, which is a type of neural network.

In [23]:
from pyspark.ml.classification import MultilayerPerceptronClassifier

The way a multi-layer perceptron classifier works is that we have, as the name implies, multiple levels of neurons. Now in all cases, **the first layer has the same number of nodes as there are inputs**. So for us we have four measures so our first layer will be four. And **the last element should have the same number of neurons as there are types of outputs**. Now in our case there's three types of iris species. Finally we want to have layers in between, and the layers in between will help the multi-layer perceptron learn how to classify correctly. So I'm going to insert two rows of five neurons each. 

So we are going to have a four layer multi-layer perceptron. First layer will have four neurons, the middle two layers will have five neurons each, and then the output layer will have three neurons. One for each kind of iris species. 

In [24]:
layers = [4, 5, 5, 3]

In [25]:
mlp = MultilayerPerceptronClassifier(layers=layers, seed=1)

In [26]:
mlp_model = mlp.fit(train_df)

In [27]:
mlp_predictions = mlp_model.transform(test_df)

In [28]:
mlp_evaluator = MulticlassClassificationEvaluator(metricName="accuracy")

In [29]:
mlp_accuracy = mlp_evaluator.evaluate(mlp_predictions)
mlp_accuracy

0.6923076923076923

### Decision tree classification

In [30]:
from pyspark.ml.classification import DecisionTreeClassifier

In [31]:
dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")

In [32]:
dt_model = dt.fit(train_df)

In [33]:
dt_predictions = dt_model.transform(test_df)

In [34]:
dt_evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")

In [35]:
dt_accuracy = dt_evaluator.evaluate(dt_predictions)
dt_accuracy

0.9423076923076923

*Notes:*
- The multilayer perceptron worked well, but required us to make some configuration decisions, with regards to how to structure the neural net.
- Decision trees worked as well as the multilayer perceptron, but didn't require us to make any configuration decisions.
- In general, when working with classification algorithms it's helpful to experiment with a number of different algorithms and a number of different configurations, if that's required by the algorithm. 
- Naive Bayes can work well in some cases. For example, if the attributes in your data set are what is known as independent of each other. That is, they don't tightly correlate with each other. 
- In other cases, when you have non-linear relationships between data elements, the multilayer perceptron is a good option. 
- Decision trees are a good option in many cases and it's worth starting with decision trees and then trying other classification algorithms from there.