<img style="float: right" src="images/surfsara.png">
<br/>
<hr style="clear: both" />

# Machine Learning - random forests in Spark
In the previous notebook you have seen how to explore a credit data set, preprocess it, train a decision tree, and evaluate the model's performance.

In this notebook, you will work on a similar problem. However, instead of a credit data set, we will work on the well-known [Covertype data set](https://archive.ics.uci.edu/ml/datasets/covertype). This data set contains cartographic variables, such as elevation and slope, and you will need to predict the _type_ of forest cover. There are seven possible forest cover types, so we are dealing with a multiclass classification problem.

Whereas we used decision trees in our last notebook, we will use a more powerful model this time, called a random forest. As the name suggests, the random forest will build multiple trees, based on random subsets of our data, and with random subsets of our features at each node in the tree. When we want to predict a new instance, we simply combine the predictions of the invididual trees to arrive at the final classification.

We will generally follow the structure of the last notebook, but will depart from it in certain cases.

**A general hint: it may help to keep the previous notebook opened for reference.**

As with all our notebooks, we will start by setting up a SparkSession:

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .getOrCreate()

We read in the Covertype data set. This data set has been preprocessed a bit for your convenience, and can be loaded from Parquet:

In [None]:
covtype_df = spark.read.load("../data/covtype.parquet")

In [None]:
covtype_df.printSchema()

## Data exploration

## Assignment 1
Count the number of rows in the data set.

In [None]:
<FILL IN>

## Assignment 2
Inspect the first few rows of the data set to verify the data makes sense to you. Please note that the data set has been preprocessed by converting the original one-hot encoded binary columns of the soil and wilderness type to a numeric variable. 

In [None]:
<FILL IN>

## Assignment 3
We would like to get an idea of the number of instances per class. If one class is vastly over-represented, for example by 90%, a classifier may simply decide to always predict this class. This will provide very high accuracy (close to 90%, probably), but will not generalise very well. In that case, we may need to resample our data set to account for this fact, or perform some other tricks.

Calculate the number of instances per class, ordered from most-occurring to least-occurring.

In [None]:
<FILL IN>

You will see that some classes are over-represented, and others under-represented. Although we will ignore this observation for now, we must be aware of the impact this will have on our models and their performance.

## Data preprocessing

Now that we have some idea of what our data looks like, we will need to assemble the relevant features into feature vectors. As in the previous notebook, we will use the [`VectorAssembler`](https://spark.apache.org/docs/2.1.1/ml-features.html#vectorassembler) for this.

## Assignment 4
Make a list of columns you would like to include in the input feature vectors.

In [None]:
feature_column_names = [
    <FILL IN>
]

# Assignment 5
Use the [`VectorAssembler`](https://spark.apache.org/docs/2.1.1/ml-features.html#vectorassembler) to create a new data set with the feature vectors added. Remember: the `VectorAssembler` needs to know the input columns and the output columns.

**Hint**: it may help to keep the previous notebook open for reference.

In [None]:
from pyspark.ml.feature import VectorAssembler  

assembler = VectorAssembler(<FILL IN>)
covtype_features_df = assembler.transform(<FILL IN>)
covtype_features_df.head()

## Assignment 6
Now that we have feature vectors, we need to deal with the two categorical features contained in the data set, `wilderness` and `soil`. Use the [`VectorIndexer`](https://spark.apache.org/docs/2.1.1/ml-features.html#vectorindexer) to convert these features into categorical features. As with all transformers, the `VectorIndexer` requires input and output column names. In addition, it needs to know the maximum number of categories, called `maxCategories`.

**Hint**: be careful in specifying the maximum number of categories. You can calculate the number of distinct values of a column by first selecting it, and then using the [`distinct`](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.distinct) method.

In [None]:
from pyspark.ml.feature import VectorIndexer

indexer = VectorIndexer(<FILL IN>)
model = indexer.fit(<FILL IN>)
covtype_features_idx_df = model.transform(<FILL IN>)
covtype_features_idx_df.head()

# Assignment 7
This assignment doesn't require any programming. The following cell will show the metadata for the column that was added by the `VectorIndexer`. Verify that the correct numeric and categorical are present. How can you tell?

In [None]:
covtype_features_idx_df.schema[-1].metadata

## Model training
Having successfully preprocessed our data set, we will need to split our data into a training and test set. The following cell will split it 80%-20%, for training and test, respectively:

In [None]:
train_data, test_data = covtype_features_idx_df.randomSplit([0.8, 0.2], 0)
train_data.count(), test_data.count()

# Assignment 8
We are now ready to fit a random forest classifier on our training data. Please fill out the relevant information to train our model.

**Hint**: because of the data size, please keep the number of trees relatively low (i.e. fewer than 100).

In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(
    featuresCol=<FILL IN>,
    labelCol=<FILL IN>,
    numTrees=<FILL IN>,
    maxDepth=<FILL IN>,
    seed=<FILL IN>
)
model = rf.fit(train_data)
model

## Assignment 9
With the trained model, transform the test data to obtain the predictions. Have a look at the first prediction for verification (second line).

In [None]:
predictions = <FILL IN>
<FILL IN>

## Model evaluation
Having obtained the model's predictions, we can evaluate the model's performance. Instead of the `BinaryClassificationEvaluator`, we will use the `MulticlassClassificationEvaluator` and calculate the model's accuracy. A random classifier will have an accuracy of 37%. How does your model compare?

In [None]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol='Cover_Type', metricName='accuracy')
evaluator.evaluate(predictions) 

## Assignment 10
Try to play around with the different settings of the `RandomForestClassifier`. Can you improve on the result above? What seems to influence performance, and why do you thing that is?

**Hint**: an outline to train, predict and evaluate in a single cell is provided below:

In [None]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(
    featuresCol=<FILL IN>,
    labelCol=<FILL IN>,
    numTrees=<FILL IN>,
    maxDepth=<FILL IN>,
    seed=<FILL IN>
)
model = rf.fit(train_data)
predictions = model.transform(test_data)
evaluator = MulticlassClassificationEvaluator(labelCol='Cover_Type', metricName='accuracy')
evaluator.evaluate(predictions) 

## Assignment 11
We can plot the the importance of each feature using seaborn. Do the importances make sense to you? Why? Why not?

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

mpl.rcParams['axes.labelsize'] = 24
mpl.rcParams['xtick.labelsize'] = 18
mpl.rcParams['ytick.labelsize'] = 18

importances_df = pd \
    .DataFrame({'importance': model.featureImportances.toArray(), 'feature': feature_column_names}) \
    .sort_values('importance', ascending=False)

plt.figure(figsize=(16, 8))    
sns.barplot(data=importances_df, x='feature', y='importance')
plt.xticks(rotation=90, fontsize=18);