# Decision Trees and Random Forests
Decision trees algorithms are used for classification and regression tasks. By means of if-else statements, the input space is iteratively divided in subsets, until all the samples in each subset belong to the same class. 
Splits that produce pure subsets increase our information about the data, and are better than splits that produce subsets with mixed observations. We can quantify the amount of information using a measure called entropy. Entropy is given by the following equation:

$$H=-{\sum}_c P_{c}\log_{2}P_{c}$$

where $P_{c}$ is the proportion of the samples that belongs to class $c$. 
Because the log of a number less than one is negative, the entire sum returns a positive value. The entropy is 0 if all samples at a node belong to the same class ($\log_{2}(1)=0$), and it is maximal if we have a uniform class distribution. In other words a better split reduces the entropy.
Reduction in entropy is measured by a metric called information gain.

$$I_{G}=H_{P}-\frac{N_{L}}{N_{P}}H_{L}-\frac{N_{R}}{N_{P}}H_{R}$$

Here, $H_{P}$ is the entropy of the parent set, $H_{L}$ and $H_{R}$
are the entropies for the left and right child set, $N_{P}$ is the
total number of samples at the parent set, $N_{L}$ and $N_{R}$ are
the number of samples in the left and right child set. By calculating
$I_{G}$ the best split candidate is decided.

This process can result in a very deep tree with many nodes, which
can lead to overfitting. Thus, it is common practice to set a maximum
depth for the tree, but this is not enough. We need to combine multiple
estimators to reduce the effect of overfitting.

A $\textbf{random forest}$ is a collection of decision trees that has
a better predictive performance than a single decision tree. The algorithm
trains each decision tree in the ensemble independently using a random
sample of data. The prediction is the average of the predictions of
the single trees. The number of trees in an ensemble is of the order
of hundreds.

In [1]:
import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.DenseVector
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import breeze.plot._
import convert.jfc.tohtml

## Load the dataset

In [2]:
val df = spark.read.
  format("csv").
  option("header", "true").
  option("inferschema", "true").
  option("delimiter",",").
  load("../Datasets/Circles.csv")

df = [x1: double, x2: double ... 1 more field]


[x1: double, x2: double ... 1 more field]

## Explore the dataset

In [3]:
df.show(10)

+----------------+---------------+---+
|              x1|             x2|  y|
+----------------+---------------+---+
|  0.846649756819| 0.487329981917|  0|
|  -1.03849757478|-0.274275182577|  0|
|   1.03690844231|-0.384525436991|  0|
| -0.584861591913| 0.996708715045|  0|
|   1.03363614647| 0.373811127616|  0|
|  -0.61080949638|    1.006592094|  0|
|-0.0516171570897| 0.139178302889|  1|
|  0.133018383164|0.0940726995832|  1|
| -0.924003510567| 0.222330133012|  0|
|-0.0860432056902|0.0420813350282|  1|
+----------------+---------------+---+
only showing top 10 rows



In [4]:
df.printSchema

root
 |-- x1: double (nullable = true)
 |-- x2: double (nullable = true)
 |-- y: integer (nullable = true)



In [5]:
df.describe().show()

+-------+--------------------+--------------------+------------------+
|summary|                  x1|                  x2|                 y|
+-------+--------------------+--------------------+------------------+
|  count|                1000|                1000|              1000|
|   mean|-9.40353624820699...|0.001200687946179...|               0.5|
| stddev|  0.5221233975699611|  0.5239333827823931|0.5002501876563866|
|    min|      -1.18437664394|      -1.17585629601|                 0|
|    max|       1.21443132534|       1.25587185691|                 1|
+-------+--------------------+--------------------+------------------+



In [6]:
df.groupBy("y").count.show()

+---+-----+
|  y|count|
+---+-----+
|  1|  500|
|  0|  500|
+---+-----+



## Apply Vector Assembler

In [7]:
val assembler = new VectorAssembler().
  setInputCols(Array("x1", "x2")).
  setOutputCol("features")

assembler = vecAssembler_9ade0ac31156


vecAssembler_9ade0ac31156

In [8]:
val df_v = assembler.transform(df).select(col("features"),col("y").cast("double").as("label"))
df_v.show(10)

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.846649756819,0...|  0.0|
|[-1.03849757478,-...|  0.0|
|[1.03690844231,-0...|  0.0|
|[-0.584861591913,...|  0.0|
|[1.03363614647,0....|  0.0|
|[-0.61080949638,1...|  0.0|
|[-0.0516171570897...|  1.0|
|[0.133018383164,0...|  1.0|
|[-0.924003510567,...|  0.0|
|[-0.0860432056902...|  1.0|
+--------------------+-----+
only showing top 10 rows



df_v = [features: vector, label: double]


[features: vector, label: double]

## Split into train and test sets

In [9]:
val Array(trainingData, testData) = df_v.randomSplit(Array(0.7, 0.3))

trainingData = [features: vector, label: double]
testData = [features: vector, label: double]


[features: vector, label: double]

## Train the model

In [10]:
val rf = new RandomForestClassifier().
    setNumTrees(100).
    setMaxDepth(10)

rf = rfc_55975a216bfd


rfc_55975a216bfd

In [11]:
rf.explainParams

labelCol: label column name (default: ...


cacheNodeIds: If false, the algorithm will pass trees to executors to match instances with nodes. If true, the algorithm will cache node IDs for each instance. Caching can speed up training of deeper trees. (default: false)
checkpointInterval: set checkpoint interval (>= 1) or disable checkpoint (-1). E.g. 10 means that the cache will get checkpointed every 10 iterations (default: 10)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: auto, all, onethird, sqrt, log2, (0.0-1.0], [1-n]. (default: auto)
featuresCol: features column name (default: features)
impurity: Criterion used for information gain calculation (case-insensitive). Supported options: entropy, gini (default: gini)
labelCol: label column name (default: label)
maxBins: Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature. (default: 32)
maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 mean

In [12]:
val model = rf.fit(trainingData)

model = RandomForestClassificationModel (uid=rfc_f4c2f87751c8) with 100 trees


RandomForestClassificationModel (uid=rfc_f4c2f87751c8) with 100 trees

## Feature importance

In [13]:
val fi = List.range(0,2).map(i => (df.columns(i), model.featureImportances.toArray(i)) ).toDF("feature", "importance")
fi.sort(col("importance").desc).show()

+-------+-------------------+
|feature|         importance|
+-------+-------------------+
|     x2| 0.5682885834147079|
|     x1|0.43171141658529216|
+-------+-------------------+



fi = [feature: string, importance: double]


[feature: string, importance: double]

## Make predictions

In [14]:
val predictions = model.transform(testData)
predictions.show(10)

+--------------------+-----+-------------+-----------+----------+
|            features|label|rawPrediction|probability|prediction|
+--------------------+-----+-------------+-----------+----------+
|[-1.16248692634,-...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-1.13954035663,0...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-1.12819135003,-...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-1.07740659047,0...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-1.04526759498,0...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-1.01430737658,-...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-1.01115751569,-...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-1.00825132088,-...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-1.00193170155,0...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
|[-0.968726624436,...|  0.0|  [100.0,0.0]|  [1.0,0.0]|       0.0|
+--------------------+-----+-------------+-----------+----------+
only showing top 10 rows



predictions = [features: vector, label: double ... 3 more fields]


[features: vector, label: double ... 3 more fields]

## Plot predictions

In [15]:
val x1 = predictions.select("features").collect.map(row=>row(0).asInstanceOf[DenseVector](0))

x1 = Array(-1.16248692634, -1.13954035663, -1.12819135003, -1.07740659047, -1.04526759498, -1.01430737658, -1.01115751569, -1.00825132088, -1.00193170155, -0.968726624436, -0.963957290914, -0.962787689277, -0.958047075336, -0.955793548895, -0.953455441962, -0.950549404207, -0.914348006217, -0.910685065827, -0.902187151805, -0.88317749423, -0.877127210879, -0.873078480076, -0.858976159578, -0.857533335587, -0.850207401739, -0.832613418806, -0.821940876465, -0.820566310549, -0.794309111682, -0.788562640801, -0.767305839441, -0.757583275275, -0.757466409686, -0.728474317033, -0.724602687449, -0.696470425313, -0.688534888825, -0.61080949638, -0.593493134074, -0.55759457081, -0.555819240103, -0.549826787006, -0.547671957612, -0.531576430493, -0.511814723138, -0.50292129151, -0...


[-1.16248692634, -1.13954035663, -1.12819135003, -1.07740659047, -1.04526759498, -1.01430737658, -1.01115751569, -1.00825132088, -1.00193170155, -0.968726624436, -0.963957290914, -0.962787689277, -0.958047075336, -0.955793548895, -0.953455441962, -0.950549404207, -0.914348006217, -0.910685065827, -0.902187151805, -0.88317749423, -0.877127210879, -0.873078480076, -0.858976159578, -0.857533335587, -0.850207401739, -0.832613418806, -0.821940876465, -0.820566310549, -0.794309111682, -0.788562640801, -0.767305839441, -0.757583275275, -0.757466409686, -0.728474317033, -0.724602687449, -0.696470425313, -0.688534888825, -0.61080949638, -0.593493134074, -0.55759457081, -0.555819240103, -0.549826787006, -0.547671957612, -0.531576430493, -0.511814723138, -0.50292129151, -0.4802036071, -0.467071864127, -0.447512043389, -0.443652522021, -0.441426082325, -0.434574011787, -0.430820806116, -0.40682376338, -0.367570546025, -0.356522152831, -0.351736118292, -0.332318140888, -0.329553164446, -0.323129393

In [16]:
val x2 = predictions.select("features").collect.map(row=>row(0).asInstanceOf[DenseVector](1))

x2 = Array(-0.301233995694, 0.011278119127, -0.00441023742714, 0.129495560338, 0.165053807973, -0.497797051188, -0.114707144458, -0.0410405527739, 0.0636945464323, -0.60658340289, -0.126531129403, 0.132307280096, 0.416965500712, 0.116224846513, 0.386488512151, 0.0209826010842, -0.338386978365, -0.0567778562773, 0.21760617646, -0.433726713136, -0.59511618466, 0.248582432801, -0.590115278181, -0.575687806863, -0.44912138132, -0.648995999503, 0.784862974465, 0.0169802387372, -0.0318665555085, 0.628715258976, -0.580357403783, -0.722161751535, -0.479421226203, 0.680256326939, 0.809688308346, -0.563555340848, 0.818651740645, 1.006592094, 0.606958421855, 0.955120700792, -0.781454087152, 0.937439546893, -0.993230465057, -0.723985811026, 0.969242877753, -0.819756142857, -0.9862613...


[-0.301233995694, 0.011278119127, -0.00441023742714, 0.129495560338, 0.165053807973, -0.497797051188, -0.114707144458, -0.0410405527739, 0.0636945464323, -0.60658340289, -0.126531129403, 0.132307280096, 0.416965500712, 0.116224846513, 0.386488512151, 0.0209826010842, -0.338386978365, -0.0567778562773, 0.21760617646, -0.433726713136, -0.59511618466, 0.248582432801, -0.590115278181, -0.575687806863, -0.44912138132, -0.648995999503, 0.784862974465, 0.0169802387372, -0.0318665555085, 0.628715258976, -0.580357403783, -0.722161751535, -0.479421226203, 0.680256326939, 0.809688308346, -0.563555340848, 0.818651740645, 1.006592094, 0.606958421855, 0.955120700792, -0.781454087152, 0.937439546893, -0.993230465057, -0.723985811026, 0.969242877753, -0.819756142857, -0.986261335505, 0.84648422096, -0.969985584234, 0.777590625772, 0.812385508945, 0.933698256138, 0.923694483998, -1.06065675269, -0.256572657742, 0.768565375616, 1.08296451269, -0.0742127314211, -0.330904437295, 0.063562378046, 0.16934626

In [17]:
val fig = Figure()
val plt = fig.subplot(0)

fig = breeze.plot.Figure@61d909e5
plt = breeze.plot.Plot@bc57abb


breeze.plot.Plot@bc57abb

In [18]:
val c = predictions.select("prediction").as[Double].collect.map(Map(0.0 -> java.awt.Color.red, 1.0 -> java.awt.Color.blue))

plt += scatter(x1, x2, size = x1.map(i=>0.05), colors = c)

plt.legend = true
plt.title = "RFC"
plt.xlabel = "x1"
//plt.xlim(0,6)
plt.ylabel = "x2"

c = Array(java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255...


[java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.awt.Color[r=255,g=0,b=0], java.aw

In [19]:
kernel.magics.html(tohtml(plt.chart))

## Evaluate the model

In [20]:
val evaluator = new MulticlassClassificationEvaluator().
    setLabelCol("label").
    setPredictionCol("prediction").
    setMetricName("accuracy")

evaluator = mcEval_d4810f9c41ba


mcEval_d4810f9c41ba

In [21]:
val accuracy = evaluator.evaluate(predictions)

accuracy = 0.9936708860759493


0.9936708860759493