# Decision trees with PySpark

In [2]:
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import VectorAssembler, StringIndexer, IndexToString
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

## Getting the data

It is easier (in this pyspark version) to first load the data as an RDD, and then modify it into a dataFrame. During this process we also remove the header using a combination of _zipWithIndex()_ and _filter()_ (taken from [here][1]). By looking at the file we see the "schema", which is used by the second _map()_.

[1]: http://stackoverflow.com/a/31798247/3121900

In [5]:
digits = spark.read.csv("/FileStore/tables/kox6mzsb1490346252745/digits.csv", 
                        header=True, inferSchema=True)
digits.show(5)

## Preparing the data (for ML API)

### Vectorizing the features

As explained earlier, _pyspark.ml_ models expect the features to be assembled together as a single **vector**. For that we have to use the _VectorAssembler_ **transformer** and create a new column.

In [9]:
va = VectorAssembler(inputCols=list('abcdefg'),
                     outputCol='features')
digits = va.transform(digits)
digits.show(5)

### Encoding the labels

As explained earlier, _pyspark.ml_ models cannot deal with non-numeric datatypes, therefore we have to encode such data as numbers. To that end we use the _StringIndexer_ **estimator** and **transformer**.

In [12]:
si = StringIndexer(inputCol='digit',
                   outputCol='digit (indexed)')
digits = si.fit(digits).transform(digits)
digits.show(5)

#### Understanding the mapping

Later on in the process we would probably apply the classification model which will inevitably produce the encoded (indexed) labels, so a natural question is how to interpret these numbers. Luckilly, like any other estimator, after fitting the data, the _StringIndexerModel_ holds some metadata of the model itself. This information is part of the data, and can be accessed by the following syntax (adopted from [this Stack Overflow answer][1]).

[1]: http://stackoverflow.com/a/33903867/3121900

In [15]:
labels = digits.schema.fields[-1].metadata['ml_attr']['vals']
print labels

And in a more readable fashion...

In [17]:
print list(enumerate(labels))

Even more exciting, we will be able to use the _IndexToString_ method to "return the wheel".

### Split the data

In [20]:
train, test = digits.randomSplit([0.7, 0.3],
                                 seed=1234)

## Creating the classification model

First we instantiate the [_DecisionTreeClassifier_][1] **estimator**.

[1]: http://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier "DecisionTreeClassifier API"

In [23]:
dt = DecisionTreeClassifier(featuresCol='features',
                            labelCol='digit (indexed)',
                            maxDepth=5)
print type(dt)

Next, we fit the estimator to get the **model**, which is a **transformer**.

In [25]:
dt_model = dt.fit(train)
print type(dt_model)

### Visualize the model

_spark.ml_ does not support visualizations at the moment, but we do have a textual representation of the tree by calling the _toDebugString_ attribute.

In [28]:
print dt_model.toDebugString[:500]  # partial printout

We can see the features importances using the attribute [featureImportances][1], which returns a vector.

[1]: http://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances "featureImportances API"

In [30]:
importances = dt_model.featureImportances
zip('abcdefg', importances.toArray())

## Evaluating the model

### Applying the model

We can now apply the model to predict the digit of the data used to train the model. We get three columns:

* _rawPrediction_ - counts of the predictions assigned to this specific set of features
* _probability_ - normalized rawPrediction, reflecting the probabilities of the predictions.
* _prediction_ - the actual label corresponding to the highest probability

In [34]:
train = dt_model.transform(train)
train.show(5)

In [35]:
train.select('rawPrediction').take(10)

### Scoring the model

We will use the [MulticlassClassificationEvaluator][1] object to evaluat the model. This is a new type of object called **evaluator**.

[1]: https://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.evaluation.MulticlassClassificationEvaluator

In [38]:
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction",
                                              labelCol="digit (indexed)",
                                              metricName="accuracy")
print type(evaluator)

In [39]:
print evaluator.evaluate(train)

#### Note about MLlib

As we said before, the _pyspark.ml_ is constantly growing, but still it lacks many features that have already been developed and implemented in [_pyspark.mllib_][1]. One of these functionalities is the confusion matrix, and to illustrate teh differences we will now use the [_MulticlassMetrics_][2] method from the _pyspark.mllib.evaluation_ module.

[1]: http://spark.apache.org/docs/2.0.2/api/python/pyspark.mllib.html "pyspark.mllib API"
[2]: http://spark.apache.org/docs/2.0.2/api/python/pyspark.mllib.html#pyspark.mllib.evaluation.MulticlassMetrics "MulticlassMetrics API"

In [42]:
from pyspark.mllib.evaluation import MulticlassMetrics

We start by noting that the _pyspark.mllib_ module utilizes **RDDs**, which are available thorugh the _rdd_ attribute of the _DataFrame_ class. From the documentation of the _MulticlassMetrics()_ method we learn that it requires an RDD containing only the predictions and the labels, so we select the relevant columns of the DataFrame using the method [_select()_][1] and create the proper RDD.

[1]: http://spark.apache.org/docs/2.0.2/api/python/pyspark.sql.html#pyspark.sql.DataFrame.select "DataFrame.select() API"

In [44]:
predictionAndLabels = train\
    .select(['prediction', 'digit (indexed)'])\
    .rdd
predictionAndLabels.take(5)

In [45]:
metrics = MulticlassMetrics(predictionAndLabels)
print type(metrics)

In [46]:
print metrics.confusionMatrix()

## Playing with parameters

### Changing the parameters

> **NOTE:** As part of the unified API pyspark is trying to expose, there is a class called [_Param_][1], which is utilized by all the models in _pyspark.ml_. In this session, however, we will use the parameters in their simplest form - key-word arguments of the estimator.

[1]: http://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#module-pyspark.ml.param "pyspark.ml.param API"

Each **estimator** contains some documentation about its parameters, available by the methods _explainParam()_ and _explainParams()_.

In [51]:
print dt.explainParams()[:564]

In [52]:
print dt.explainParam('maxDepth')

When we instantiated the estimator we used the default values, but of course we can tweek the model by changing its parameters. The method [_setParams()_][1], which is available for all _pyspark.ml_ estimators, allows to easily modify the model parameters. We note that it is possible and advisable to use the [\*\*kwargs notation][2] when calling this method.

[1]: http://spark.apache.org/docs/2.0.2/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier.setParams "setParams() API"
[2]: http://book.pythontips.com/en/latest/args_and_kwargs.html "kwargs explanation"

In [54]:
params = {'maxDepth':10}
dt.setParams(**params)

We can see that the parameters were indeed changed by looking at the explanation again.

In [56]:
print dt.explainParam('maxDepth')

### Re-fitting the model

We repeat the steps of the process described earlier. We note that it is necessary to overwrite _train_ and _test_, otherwise we will get an error in the fit step, because a column with the name _prediction_ alredy exist.

In [59]:
train, test = digits.randomSplit([0.7, 0.3], seed=1234)

In [60]:
train = dt.fit(train).transform(train)
train.show(5)

In [61]:
print evaluator.evaluate(train)

## Validating the model

We now apply the model to the test data and see how it performs.

In [64]:
test = dt_model.transform(test)
test.show(5)

In [65]:
print evaluator.evaluate(test)

In [66]:
predictionAndLabels = test\
    .select(['prediction', 'digit (indexed)'])\
    .rdd
print MulticlassMetrics(predictionAndLabels).confusionMatrix()

> **Your turn 1:**

> * Part I - Create a classification model for predicting the sex of the guys listed in the weight.txt file based on their heights and weights.
> * Part II - Repeat the previous task, but now use also the age as a feature. Did it improve the model?

> **Your turn 2:**

> The file dessert.csv contains some information about groups who arrived at a restaurant, and the field “dessert” states whether they purchased a dessert or not. Use the file to develop a classification model for predicting which groups will order a dessert. Do not forget to split your data and validate your models.

> * Part I - Develop the classification model, where “num.of.guests” is an integer.
> * Part II - Improve your model by considering “num.of.guests” as categorical.