<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"> 
# Spark MLlib Lab

*Authors: Christoph Rahmede (LDN)*

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Create the spark context

In [2]:
import pyspark as ps
from pyspark.sql import SQLContext

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StandardScaler

In [3]:
sc = ps.SparkContext('local[4]')
sqlContext = SQLContext(sc)
spark = ps.sql.SparkSession(sc)

## Label encoding categorical features

Often we have categorical features with values given as strings which we would like to transform to numerical values. The analogue of sklearn's `LabelEncoder` is the `StringIndexer`.

In [4]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

In [5]:
ex_1 = sqlContext.createDataFrame([
    (4, "high"),
    (5, "low"),
    (6, "high"),
    (7, "high"),
    (8,'medium')
], ["id", "label"])

In [6]:
string_indexer = StringIndexer(
        inputCols=['label'],
        outputCols=['label' + "_index"]
    )

In [7]:
ex_2 = string_indexer.fit(ex_1).transform(ex_1)
ex_2.show()

+---+------+-----------+
| id| label|label_index|
+---+------+-----------+
|  4|  high|        0.0|
|  5|   low|        1.0|
|  6|  high|        0.0|
|  7|  high|        0.0|
|  8|medium|        2.0|
+---+------+-----------+



In [8]:
onehot = OneHotEncoder(
        dropLast=True,
        inputCols=['label_index'],
        outputCols=['label' + "_index_1"]
    )

In [9]:
onehot.fit(ex_2).transform(ex_2).show()

+---+------+-----------+-------------+
| id| label|label_index|label_index_1|
+---+------+-----------+-------------+
|  4|  high|        0.0|(2,[0],[1.0])|
|  5|   low|        1.0|(2,[1],[1.0])|
|  6|  high|        0.0|(2,[0],[1.0])|
|  7|  high|        0.0|(2,[0],[1.0])|
|  8|medium|        2.0|    (2,[],[])|
+---+------+-----------+-------------+



The one-hot-encoded values are given as a sparse vector for each observation. The first number indicates the length of the sparse vector, the second number in brackets indicates the position that is filled with the last value. As you can see from the last shown entry, dropping a redundant label (`drop_last`) is default here. You can apply both `StringIndexer` and `OneHotEncoder` to multiple columns at once as well.

## Read in the car evaluation dataset 

Use `acceptability` as target.

In [10]:
spark_df = spark.read.csv(
    path="data/car.csv",
    header=True,
    # Poorly formed rows in CSV are dropped rather than erroring entire operation
    mode="DROPMALFORMED",
    # Not always perfect but works well in most cases as of 2.1+
    inferSchema=True
)

In [11]:
spark_df.first()

Row(buying='vhigh', maint='vhigh', doors='2', persons='2', lug_boot='small', safety='low', acceptability='unacc')

In [12]:
spark_df.dtypes

[('buying', 'string'),
 ('maint', 'string'),
 ('doors', 'string'),
 ('persons', 'string'),
 ('lug_boot', 'string'),
 ('safety', 'string'),
 ('acceptability', 'string')]

In [13]:
[spark_df.dtypes[i][0] for i in range(len(spark_df.dtypes)) if spark_df.dtypes[i][1]=='string']

['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'acceptability']

In [14]:
spark_df

DataFrame[buying: string, maint: string, doors: string, persons: string, lug_boot: string, safety: string, acceptability: string]

## Dummify the categorical variables

Use first the `StringIndexer`, then the `OneHotEncoderEstimator` to create the dummified variables. Be careful not to use one-hot encoding on the target variable (`acceptability`).

In [15]:
[col+'_index' for col in spark_df.columns]

['buying_index',
 'maint_index',
 'doors_index',
 'persons_index',
 'lug_boot_index',
 'safety_index',
 'acceptability_index']

In [16]:
string_indexer = StringIndexer(
        inputCols = spark_df.columns,
        outputCols = [col+'_index' for col in spark_df.columns]
    )

spark_df = string_indexer.fit(spark_df).transform(spark_df)

In [17]:
spark_df

DataFrame[buying: string, maint: string, doors: string, persons: string, lug_boot: string, safety: string, acceptability: string, lug_boot_index: double, persons_index: double, maint_index: double, safety_index: double, buying_index: double, acceptability_index: double, doors_index: double]

In [18]:
label_column = 'acceptability_index'
feature_columns = [col for col in spark_df.columns if col!=label_column and 'index' in col]

In [19]:
onehot = OneHotEncoder(
        dropLast=True,
        inputCols=feature_columns,
        outputCols=[col + "_1" for col in feature_columns]
    )
spark_df = onehot.fit(spark_df).transform(spark_df)

In [20]:
spark_df.show(2)

+------+-----+-----+-------+--------+------+-------------+--------------+-------------+-----------+------------+------------+-------------------+-----------+--------------+-------------+----------------+---------------+--------------+-------------+
|buying|maint|doors|persons|lug_boot|safety|acceptability|lug_boot_index|persons_index|maint_index|safety_index|buying_index|acceptability_index|doors_index|safety_index_1|maint_index_1|lug_boot_index_1|persons_index_1|buying_index_1|doors_index_1|
+------+-----+-----+-------+--------+------+-------------+--------------+-------------+-----------+------------+------------+-------------------+-----------+--------------+-------------+----------------+---------------+--------------+-------------+
| vhigh|vhigh|    2|      2|   small|   low|        unacc|           2.0|          0.0|        3.0|         1.0|         3.0|                0.0|        0.0| (2,[1],[1.0])|    (3,[],[])|       (2,[],[])|  (2,[0],[1.0])|     (3,[],[])|(3,[0],[1.0])|
| vh

## Prepare your feature columns with `VectorAssembler`

In [21]:
from pyspark.ml.feature import VectorAssembler

In [22]:
feature_columns = [col for col in spark_df.columns if '1' in col]

In [23]:
feature_columns

['safety_index_1',
 'maint_index_1',
 'lug_boot_index_1',
 'persons_index_1',
 'buying_index_1',
 'doors_index_1']

In [24]:
vectorAssembler = VectorAssembler(inputCols=feature_columns,
                                  outputCol="features")

vector_df = vectorAssembler.transform(spark_df)

vector_df.first()

Row(buying='vhigh', maint='vhigh', doors='2', persons='2', lug_boot='small', safety='low', acceptability='unacc', lug_boot_index=2.0, persons_index=0.0, maint_index=3.0, safety_index=1.0, buying_index=3.0, acceptability_index=0.0, doors_index=0.0, safety_index_1=SparseVector(2, {1: 1.0}), maint_index_1=SparseVector(3, {}), lug_boot_index_1=SparseVector(2, {}), persons_index_1=SparseVector(2, {0: 1.0}), buying_index_1=SparseVector(3, {}), doors_index_1=SparseVector(3, {0: 1.0}), features=SparseVector(15, {1: 1.0, 7: 1.0, 12: 1.0}))

In [25]:
vector_df.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|(15,[1,7,12],[1.0...|
|(15,[7,12],[1.0,1...|
|(15,[0,7,12],[1.0...|
|(15,[1,6,7,12],[1...|
|(15,[6,7,12],[1.0...|
+--------------------+
only showing top 5 rows



## Fit and evaluate a spark decision tree model and tune with grid search

Once done, try also other models.

In [26]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator

In [27]:
model = DecisionTreeClassifier(featuresCol='features',
                           labelCol=label_column)

In [28]:
(data_train, data_test) = vector_df.randomSplit([0.7, 0.3], seed=1)

evaluator = MulticlassClassificationEvaluator(
                    predictionCol='prediction',
                    labelCol=label_column,
                    metricName='accuracy'
                         )

paramGrid = ParamGridBuilder() \
    .addGrid(model.maxDepth, range(3, 11)) \
    .build()

# the actual gridsearch
crossval = CrossValidator(estimator=model,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)  

# Run cross-validation, and choose the best set of parameters.
model_fit = crossval.fit(data_train)

print('Average cv scores:')
print(np.around(np.array(model_fit.avgMetrics), 4))

java_model = model_fit.bestModel._java_obj

print('Best model parameters:')
print({param.name: java_model.getOrDefault(java_model.getParam(param.name)) 
    for param in paramGrid[0]})
print()
#print(java_model.explainParams())

predictions = model_fit.transform(data_test)

print('Best model test accuracy:')
print(evaluator.evaluate(predictions))

Average cv scores:
[0.7817 0.7888 0.835  0.8659 0.8823 0.9131 0.9155 0.9438]
Best model parameters:
{'maxDepth': 10}

Best model test accuracy:
0.958
