## Objective
---------------------
We will be exploring the Spark ML Library to perform a classification task using **`DecisionTreeClassifier`** on the wine dataset.

Dataset Source: https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

### [1] Explore Wine Dataset

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql import DataFrame

In [2]:
data_loc = '../Course Materials/spark-2-building-machine-learning-models/02/demos/datasets/wine.data'

In [3]:
# Create Spark Session Object

spark = SparkSession.builder.appName('Decision Tree Classifier').getOrCreate()

In [4]:
wine = spark.read.csv(data_loc)
wine.show(5)

+---+-----+----+----+----+---+----+----+---+----+----+----+----+----+
|_c0|  _c1| _c2| _c3| _c4|_c5| _c6| _c7|_c8| _c9|_c10|_c11|_c12|_c13|
+---+-----+----+----+----+---+----+----+---+----+----+----+----+----+
|  1|14.23|1.71|2.43|15.6|127| 2.8|3.06|.28|2.29|5.64|1.04|3.92|1065|
|  1| 13.2|1.78|2.14|11.2|100|2.65|2.76|.26|1.28|4.38|1.05| 3.4|1050|
|  1|13.16|2.36|2.67|18.6|101| 2.8|3.24| .3|2.81|5.68|1.03|3.17|1185|
|  1|14.37|1.95| 2.5|16.8|113|3.85|3.49|.24|2.18| 7.8| .86|3.45|1480|
|  1|13.24|2.59|2.87|  21|118| 2.8|2.69|.39|1.82|4.32|1.04|2.93| 735|
+---+-----+----+----+----+---+----+----+---+----+----+----+----+----+
only showing top 5 rows



In [5]:
# Apply Column Names to DF

wine = wine.toDF('Label',
                'Alcohol',
                'MalicAcid',
                'Ash',
                'AshAlkalinity',
                'Magnesium',
                'TotalPhenols',
                'Flavanoids',
                'NonflavanoidPhenols',
                'Proanthocyanins',
                'ColorIntensity',
                'Hue',
                'OD',
                'Proline')

In [6]:
wine.printSchema()

root
 |-- Label: string (nullable = true)
 |-- Alcohol: string (nullable = true)
 |-- MalicAcid: string (nullable = true)
 |-- Ash: string (nullable = true)
 |-- AshAlkalinity: string (nullable = true)
 |-- Magnesium: string (nullable = true)
 |-- TotalPhenols: string (nullable = true)
 |-- Flavanoids: string (nullable = true)
 |-- NonflavanoidPhenols: string (nullable = true)
 |-- Proanthocyanins: string (nullable = true)
 |-- ColorIntensity: string (nullable = true)
 |-- Hue: string (nullable = true)
 |-- OD: string (nullable = true)
 |-- Proline: string (nullable = true)



In [7]:
wine.toPandas().head(5)

Unnamed: 0,Label,Alcohol,MalicAcid,Ash,AshAlkalinity,Magnesium,TotalPhenols,Flavanoids,NonflavanoidPhenols,Proanthocyanins,ColorIntensity,Hue,OD,Proline
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


In [8]:
wine.describe().toPandas()

Unnamed: 0,summary,Label,Alcohol,MalicAcid,Ash,AshAlkalinity,Magnesium,TotalPhenols,Flavanoids,NonflavanoidPhenols,Proanthocyanins,ColorIntensity,Hue,OD,Proline
0,count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
1,mean,1.9382022471910112,13.000617977528083,2.336348314606741,2.3665168539325854,19.49494382022472,99.74157303370788,2.295112359550562,2.029269662921348,0.3618539325842697,1.5908988764044951,5.058089882022473,0.9574494382022468,2.6116853932584254,746.8932584269663
2,stddev,0.7750349899850563,0.811826538005858,1.1171460976144625,0.2743440090608148,3.339563767173504,14.282483515295652,0.6258510488339892,0.9988586850169472,0.1244533402966794,0.5723588626747612,2.318285871822413,0.2285715658298232,0.7099904287650503,314.9074742768492
3,min,1.0,11.03,0.74,1.36,10.6,100.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,1015.0
4,max,3.0,14.83,5.8,3.23,30.0,99.0,3.88,5.08,0.66,3.58,9.899999,1.71,4.0,990.0


#### Observations:
----------------------
1. Observe 'count' summary => Indicates no missing value for any of the columns. Pretty clean dataset.
2. Observe 'mean' summary => For some of the attributes, mean value is high. May be because of different units. Possible Normalization / Standardization ??

In [9]:
# Check Distinct labels and their distribution

wine.select('label').distinct().show()

+-----+
|label|
+-----+
|    3|
|    1|
|    2|
+-----+



In [10]:
wine.groupBy('label').count().show()

+-----+-----+
|label|count|
+-----+-----+
|    3|   48|
|    1|   59|
|    2|   71|
+-----+-----+



In [11]:
num_rec = wine.count()
print(f'Total Number of Records : {num_rec}')

Total Number of Records : 178


In [12]:
wine.groupBy('label').agg(func.count('label').alias('class_count')
                         ,func.round(((func.count('label')/num_rec) * 100),2).alias('class%')
                         ).show()

+-----+-----------+------+
|label|class_count|class%|
+-----+-----------+------+
|    3|         48| 26.97|
|    1|         59| 33.15|
|    2|         71| 39.89|
+-----+-----------+------+



** Doesn't seemingly have a class imbalance problem :)

### [2] Data preparation for Spark ML Model

* Spark model expect data in form of a dataframe with 2 columns:
    1. label => Output in form of a numeric vector
    2. features => In form of a Dense Vector
    

* In our case, dataset is pretty clean and no specific pre-processing is needed.


* We just have to convert the `label` column to numeric in dataframe.

In [13]:
from pyspark.ml.linalg import Vectors

In [14]:
def vectorize(data):
    return data.rdd.map(lambda x : [x[0], Vectors.dense(x[1:])]).toDF(["label","features"])

In [15]:
vectorizedData = vectorize(wine)

In [16]:
vectorizedData.show(5, truncate=False)

+-----+---------------------------------------------------------------------+
|label|features                                                             |
+-----+---------------------------------------------------------------------+
|1    |[14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0]|
|1    |[13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0] |
|1    |[13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0] |
|1    |[14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0] |
|1    |[13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0] |
+-----+---------------------------------------------------------------------+
only showing top 5 rows



In [17]:
vectorizedData = vectorizedData.withColumn('labelIndex', vectorizedData.label.cast('float')).drop('label')
vectorizedData.printSchema()

root
 |-- features: vector (nullable = true)
 |-- labelIndex: float (nullable = true)



In [18]:
# Split Training and test data

(trainData, testData) = vectorizedData.randomSplit([0.8, 0.2])

In [19]:
# Build DT Classifier

from pyspark.ml.classification import DecisionTreeClassifier

In [20]:
dtclf = DecisionTreeClassifier(labelCol='labelIndex', featuresCol='features', maxDepth=3, impurity='gini')

In [21]:
model = dtclf.fit(trainData)

In [22]:
predictions = model.transform(testData)

In [23]:
predictions.show(5)

+--------------------+----------+------------------+--------------------+----------+
|            features|labelIndex|     rawPrediction|         probability|prediction|
+--------------------+----------+------------------+--------------------+----------+
|[11.61,1.35,2.7,2...|       2.0|[0.0,0.0,50.0,0.0]|   [0.0,0.0,1.0,0.0]|       2.0|
|[11.84,2.89,2.23,...|       2.0|[0.0,0.0,50.0,0.0]|   [0.0,0.0,1.0,0.0]|       2.0|
|[12.22,1.29,1.94,...|       2.0|[0.0,0.0,50.0,0.0]|   [0.0,0.0,1.0,0.0]|       2.0|
|[12.29,1.41,1.98,...|       2.0|[0.0,0.0,50.0,0.0]|   [0.0,0.0,1.0,0.0]|       2.0|
|[12.36,3.83,2.38,...|       3.0|[0.0,0.0,1.0,37.0]|[0.0,0.0,0.026315...|       3.0|
+--------------------+----------+------------------+--------------------+----------+
only showing top 5 rows



In [24]:
# Evaluate Model

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='labelIndex', metricName='f1')

In [25]:
score = evaluator.evaluate(predictions)

In [26]:
score

0.9731443994601889

In [27]:
# Other Scores

evaluator1 = MulticlassClassificationEvaluator(predictionCol='prediction', labelCol='labelIndex')

In [28]:
accuracy = evaluator.evaluate(predictions, {evaluator.metricName: 'accuracy'})
precision = evaluator.evaluate(predictions, {evaluator.metricName: 'weightedPrecision'})
recall = evaluator.evaluate(predictions, {evaluator.metricName: 'weightedRecall'})

print(f'Accuracy : {accuracy}\nPrecision : {precision}\nRecall : {recall}')

Accuracy : 0.9736842105263158
Precision : 0.9750000000000001
Recall : 0.9736842105263158
