### Classification Using SparkML
Classification is a machine learning algorithm under the part of superviesed learning which predicts the categories of each record given set of features.

for example:
- A student pass or fail the next exam
- Is the image of a dog or cat?

**Why Spark for machine learning**
- It provides high scalability
- could perform machine learning algorithms over distributed systems
- It handles big data

**What are the steps for implementing classification algorithms using pyspark and sparkML**
1. Import necessary libraries
2. Create a SparkSession instance
3. Load the dataset as a spark dataframe
4. Decide features and target variables
5. Assemble features columns into one vector in one variable named for instance `features`
6. Split the dataset into training and testing sets
7. Create a LogisticRegressin instance.
8. Train the model on training dataset
9. Use the model to predict the testing data to ensure the quality of the algorithm
10. Evaluate the model using classification metrics (Accuracy, Confusion Matric, Precision, Recall)
11. Don't forget to stop the sparkSession.

In [40]:
#Import needed  libraries
import findspark
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import StringIndexer
from pyspark.sql import functions as F

In [41]:
#Initiate SparkSession
findspark.init()
spark = SparkSession.builder.appName('Classification Prediction').getOrCreate()

In [42]:
%%bash
#Download data file
filename='drybeans.csv'
if test -f data/$filename; then
    echo "file already exists"
else
    if wget -d https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0231EN-SkillsNetwork/datasets/drybeans.csv;then
        echo 'file is downloaded successfully'
        mv $filename data/
    else
        echo 'download failed'
    fi
fi

file already exists


In [43]:
#Load the data from the file into spark dataframe
df = spark.read.csv('data/drybeans.csv',header=True, inferSchema=True)
df.printSchema()

root
 |-- Area: integer (nullable = true)
 |-- Perimeter: double (nullable = true)
 |-- MajorAxisLength: double (nullable = true)
 |-- MinorAxisLength: double (nullable = true)
 |-- AspectRation: double (nullable = true)
 |-- Eccentricity: double (nullable = true)
 |-- ConvexArea: integer (nullable = true)
 |-- EquivDiameter: double (nullable = true)
 |-- Extent: double (nullable = true)
 |-- Solidity: double (nullable = true)
 |-- roundness: double (nullable = true)
 |-- Compactness: double (nullable = true)
 |-- ShapeFactor1: double (nullable = true)
 |-- ShapeFactor2: double (nullable = true)
 |-- ShapeFactor3: double (nullable = true)
 |-- ShapeFactor4: double (nullable = true)
 |-- Class: string (nullable = true)



In [44]:
df.show(5)

+-----+---------+---------------+---------------+------------+------------+----------+-------------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+-----+
| Area|Perimeter|MajorAxisLength|MinorAxisLength|AspectRation|Eccentricity|ConvexArea|EquivDiameter|     Extent|   Solidity|  roundness|Compactness|ShapeFactor1|ShapeFactor2|ShapeFactor3|ShapeFactor4|Class|
+-----+---------+---------------+---------------+------------+------------+----------+-------------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+-----+
|28395|  610.291|    208.1781167|     173.888747| 1.197191424| 0.549812187|     28715|  190.1410973|0.763922518|0.988855999|0.958027126|0.913357755| 0.007331506| 0.003147289| 0.834222388| 0.998723889|SEKER|
|28734|  638.018|    200.5247957|    182.7344194| 1.097356461| 0.411785251|     29172|  191.2727505|0.783968133|0.984985603|0.887033637|0.953860842| 0.006978659| 0.00356362

In [45]:
df.groupBy('Class').agg(
    F.count('AspectRation').alias('Count')
).show()

+--------+-----+
|   Class|Count|
+--------+-----+
|    CALI| 1630|
|   SEKER| 2027|
|    SIRA| 2636|
|   HOROZ| 1928|
|  BOMBAY|  522|
|BARBUNYA| 1322|
|DERMASON| 3546|
+--------+-----+



You noticed that the Class is string we wanna  to convert it to be integers categories (1,2,3,4,..)

In [46]:
indexer = StringIndexer(inputCol='Class', outputCol='label')
df = indexer.fit(df).transform(df)
df.select(['Class','label']).show(10)

+-----+-----+
|Class|label|
+-----+-----+
|SEKER|  2.0|
|SEKER|  2.0|
|SEKER|  2.0|
|SEKER|  2.0|
|SEKER|  2.0|
|SEKER|  2.0|
|SEKER|  2.0|
|SEKER|  2.0|
|SEKER|  2.0|
|SEKER|  2.0|
+-----+-----+
only showing top 10 rows


In [47]:
df = df.withColumn('Label', F.col('label').cast('int'))
df.drop('label')

DataFrame[Area: int, Perimeter: double, MajorAxisLength: double, MinorAxisLength: double, AspectRation: double, Eccentricity: double, ConvexArea: int, EquivDiameter: double, Extent: double, Solidity: double, roundness: double, Compactness: double, ShapeFactor1: double, ShapeFactor2: double, ShapeFactor3: double, ShapeFactor4: double, Class: string]

In [48]:
df.select(['Class','Label']).show(10)

+-----+-----+
|Class|Label|
+-----+-----+
|SEKER|    2|
|SEKER|    2|
|SEKER|    2|
|SEKER|    2|
|SEKER|    2|
|SEKER|    2|
|SEKER|    2|
|SEKER|    2|
|SEKER|    2|
|SEKER|    2|
+-----+-----+
only showing top 10 rows


In [54]:
df.groupBy('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|    1| 2636|
|    6|  522|
|    3| 1928|
|    5| 1322|
|    4| 1630|
|    2| 2027|
|    0| 3546|
+-----+-----+



In [None]:
#Show how Class is labeled with integer values
df.select('Class','Label').distinct().orderBy('Label').show()

+--------+-----+
|   Class|Label|
+--------+-----+
|DERMASON|    0|
|    SIRA|    1|
|   SEKER|    2|
|   HOROZ|    3|
|    CALI|    4|
|BARBUNYA|    5|
|  BOMBAY|    6|
+--------+-----+



In [55]:
featureColumns = ["Area","Perimeter","Solidity","roundness","Compactness"]
assembler = VectorAssembler(inputCols=featureColumns, outputCol='features')
transformed_df  = assembler.transform(df)
transformed_df.show(5)

+-----+---------+---------------+---------------+------------+------------+----------+-------------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+-----+-----+--------------------+
| Area|Perimeter|MajorAxisLength|MinorAxisLength|AspectRation|Eccentricity|ConvexArea|EquivDiameter|     Extent|   Solidity|  roundness|Compactness|ShapeFactor1|ShapeFactor2|ShapeFactor3|ShapeFactor4|Class|Label|            features|
+-----+---------+---------------+---------------+------------+------------+----------+-------------+-----------+-----------+-----------+-----------+------------+------------+------------+------------+-----+-----+--------------------+
|28395|  610.291|    208.1781167|     173.888747| 1.197191424| 0.549812187|     28715|  190.1410973|0.763922518|0.988855999|0.958027126|0.913357755| 0.007331506| 0.003147289| 0.834222388| 0.998723889|SEKER|    2|[28395.0,610.291,...|
|28734|  638.018|    200.5247957|    182.7344194| 1.097356461| 0

In [56]:
#Split the data in ration 70:30 training, testing

(training_data, testing_data) = transformed_df.randomSplit([0.7,0.3], seed=123)

In [57]:
#Build the model
logisticRegression = LogisticRegression(featuresCol='features', labelCol='Label')
model =  logisticRegression.fit(training_data)

25/12/09 16:38:16 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


In [58]:
predictions = model.transform(testing_data)

In [59]:
predictions.columns

['Area',
 'Perimeter',
 'MajorAxisLength',
 'MinorAxisLength',
 'AspectRation',
 'Eccentricity',
 'ConvexArea',
 'EquivDiameter',
 'Extent',
 'Solidity',
 'roundness',
 'Compactness',
 'ShapeFactor1',
 'ShapeFactor2',
 'ShapeFactor3',
 'ShapeFactor4',
 'Class',
 'Label',
 'features',
 'rawPrediction',
 'probability',
 'prediction']

In [67]:
predictions.sample(False,0.1,seed=0).select(['Label','prediction']).show()

+-----+----------+
|Label|prediction|
+-----+----------+
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
+-----+----------+
only showing top 20 rows


In [None]:
predictions.groupBy('Label').count().show()

+-----+-----+
|Label|count|
+-----+-----+
|    1|  785|
|    6|  149|
|    3|  617|
|    5|  395|
|    4|  475|
|    2|  601|
|    0| 1119|
+-----+-----+



In [69]:
predictions.groupBy('prediction').count().show()

+----------+-----+
|prediction|count|
+----------+-----+
|       0.0| 1116|
|       1.0|  789|
|       4.0|  485|
|       3.0|  611|
|       2.0|  613|
|       6.0|  149|
|       5.0|  378|
+----------+-----+



In [71]:
#Get Accuracy of the model
evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='Label', metricName='accuracy')
accuracy = evaluator.evaluate(predictions)
print(f'Accuracy of the model = {accuracy}')

Accuracy of the model = 0.9128229896160348


In [72]:
#Stop Spark Session
spark.stop()