<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px"> 
# Spark MLlib Lab

*Authors: Christoph Rahmede (LDN)*

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

## Create the spark context

In [2]:
import pyspark as ps
from pyspark.sql import SQLContext

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StandardScaler

In [3]:
sc = ps.SparkContext('local[4]')
sqlContext = SQLContext(sc)
spark = ps.sql.SparkSession(sc)

## Label encoding categorical features

Often we have categorical features with values given as strings which we would like to transform to numerical values. The analogue of sklearn's `LabelEncoder` is the `StringIndexer`.

In [4]:
from pyspark.ml.feature import StringIndexer, OneHotEncoder

In [5]:
ex_1 = sqlContext.createDataFrame([
    (4, "high"),
    (5, "low"),
    (6, "high"),
    (7, "high"),
    (8,'medium')
], ["id", "label"])

In [6]:
ex_1.show()

+---+------+
| id| label|
+---+------+
|  4|  high|
|  5|   low|
|  6|  high|
|  7|  high|
|  8|medium|
+---+------+



In [7]:
string_indexer = StringIndexer(
        inputCol='label',
        outputCol='label' + "_index"
    )

In [8]:
ex_2 = string_indexer.fit(ex_1).transform(ex_1)
ex_2.show()

+---+------+-----------+
| id| label|label_index|
+---+------+-----------+
|  4|  high|        0.0|
|  5|   low|        1.0|
|  6|  high|        0.0|
|  7|  high|        0.0|
|  8|medium|        2.0|
+---+------+-----------+



In [9]:
from pyspark.ml.feature import OneHotEncoder

onehot = OneHotEncoder(
        dropLast=True,
        inputCol='label_index',
        outputCol='label' + "_index_1"
    )

onehot.fit(ex_2).transform(ex_2).show()

+---+------+-----------+-------------+
| id| label|label_index|label_index_1|
+---+------+-----------+-------------+
|  4|  high|        0.0|(2,[0],[1.0])|
|  5|   low|        1.0|(2,[1],[1.0])|
|  6|  high|        0.0|(2,[0],[1.0])|
|  7|  high|        0.0|(2,[0],[1.0])|
|  8|medium|        2.0|    (2,[],[])|
+---+------+-----------+-------------+



The one-hot-encoded values are given as a sparse vector for each observation. The first number indicates the length of the sparse vector, the second number in brackets indicates the position that is filled with the last value. As you can see from the last shown entry, dropping a redundant label (`drop_last`) is default here.  You can apply both `StringIndexer` and `OneHotEncoder` to multiple columns at once as well.

## Read in the car evaluation dataset 

Use `acceptability` as target.

In [10]:
spark_df = spark.read.csv(
    path="data/car.csv",
    header=True,
    # Poorly formed rows in CSV are dropped rather than erroring entire operation
    mode="DROPMALFORMED",
    # Not always perfect but works well in most cases as of 2.1+
    inferSchema=True
)

In [11]:
spark_df.first()

Row(buying='vhigh', maint='vhigh', doors='2', persons='2', lug_boot='small', safety='low', acceptability='unacc')

In [12]:
spark_df.dtypes

[('buying', 'string'),
 ('maint', 'string'),
 ('doors', 'string'),
 ('persons', 'string'),
 ('lug_boot', 'string'),
 ('safety', 'string'),
 ('acceptability', 'string')]

In [13]:
spark_df.select('buying').dtypes

[('buying', 'string')]

In [14]:
[spark_df.dtypes[i][0] for i in range(len(spark_df.dtypes)) if spark_df.dtypes[i][1]=='string']

['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety', 'acceptability']

In [15]:
spark_df

DataFrame[buying: string, maint: string, doors: string, persons: string, lug_boot: string, safety: string, acceptability: string]

## Dummify the categorical variables

Use first the `StringIndexer`, then the `OneHotEncoderEstimator` to create the dummified variables. Be careful not to use one-hot encoding on the target variable (`acceptability`).

## Prepare your feature columns with `VectorAssembler`

In [16]:
from pyspark.ml.feature import VectorAssembler

## Fit and evaluate a spark decision tree model and tune with grid search

Once done, try also other models.

In [17]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator