<a href="https://cocl.us/Data_Science_with_Scalla_top"><img src = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/SC0103EN/adds/Data_Science_with_Scalla_notebook_top.png" width = 750, align = "center"></a>
 <br/>
<a><img src="https://ibm.box.com/shared/static/ugcqz6ohbvff804xp84y4kqnvvk3bq1g.png" width="200" align="center"></a>"

# 3.4.3 Gradient-Boosting Trees

Welcome to Gradient-Boosting Trees (GBTs)

After completing this set of lessons about Predicting Grant Applications, you should be able to:

* Understand how to fit together the functions available in Spark's machine learning libraries to solve real problems
* Use a spark cluster to fit models in a fraction of the time
* Perform classification and regression with Gradient-Boosted Trees
* Understand and use Gradient-Boosted Trees parameters

## Gradient-Boosting Trees

* Like Random Forests, they are ensembles of decision trees
* Iteratively trained to minimize a loss function
* Supports binary classification
* Supports regression
* Supports continuous and categorical features

The Pipelines API for gradient boosted trees supports regression and binary classification it also supports continuous and categorical features.
This is a quick description of the basic algorithm of Gradient-Boosted Trees:
* Iteratively trains a sequence of decision trees
* On each iteration it uses the current ensemble to make label predictions and then it compares these to true labels
* Next it re-labels the dataset to put more emphasis on instances with poor predictions, according to a given loss function
* With each iteration it reduces the loss function, thus correcting for previous mistakes
* Supported loss functions:
  * `classification`: Log Loss (twice binomial negative log likelihood)
  * `regression`: Squared Error (L2 loss, default) and Absolute Error (L1 loss, more robust to outliers)

## Gradient-Boosted Trees Parameters

* `loss`: loss function (Log Loss, for classification, Squared and Absolute errors, for regression)
* `numIterations`: number of trees in the ensemble
   * each iteration produces one tree
    * if it increases:
        * model gets more expressive, improving training data accuracy
        * test-time accuracy may suffer (if too large)
 * `learningRate`: should NOT need to be tuned
    * if behaviour seems unstable, decreasing it may improve stability



## Validation While Training

* Gradient-Boosted Trees can overfit when trained with more trees
* The method `runWithValidation` allows validation while training
  * takes a pair of RDDs: training and validation datasets
* Training is stopped when validation error improvement is less than the tolerance specified as `validationTol`in `BoostingStrategy`
  * validation error decreases initially and later increases
  * there might be cases in which the validation error does not change monotonically
    * set a large enough negative tolerance
    * examine validation curve using `evaluateEachIteration`, which gives the error or loss per iteration
    * tune the number of iterations



## Inputs & Outputs

**TODO table screenshot**

Here we have inputs and outputs. The inputs taken by Gradient-Boosted Trees in the Pipelines API are just the same as the inputs taken by Decision Trees, that is, the label and features columns. However, Gradient-Boosted Trees output only one column, the prediction itself.



## Continuing From Previous Example I

You need to run the following script from previous lessons to be able to run this example. If you haven't downloaded the data set from the previous lesson then there is a link in the script to download it to your temporary folder and load it.

In [None]:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._

import org.apache.spark.mllib.util.MLUtils.{
  convertVectorColumnsFromML => fromML,
  convertVectorColumnsToML => toML
}

In [None]:
import org.apache.spark.mllib.util.MLUtils
 
val data = toML(MLUtils.loadLibSVMFile(sc, "/resources/data/sample_libsvm_data.txt").toDF())

val splitData = data.randomSplit(Array(0.7, 0.3))
val trainingData = toML(splitData(0))
val testData = toML(splitData(1))

training Data

In [None]:
trainingData.show(5)

test Data

In [None]:
testData.show(5)

## Continuing From Previous Example II

In the previous lesson we also created two preprocessing estimators, and one post-processing transformer. We will use the same estimators and transformers in our Gradient-Boosting Trees Pipeline. For a GBT classifier, first create a new instance of it and set its label and features columns just like on the Random Forest course.

In [None]:
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.{GBTClassificationModel, GBTClassifier}
import org.apache.spark.ml.feature.{StringIndexer, IndexToString, VectorIndexer}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator


import org.apache.spark.mllib.util.MLUtils.{
  convertVectorColumnsFromML => fromML,
  convertVectorColumnsToML => toML
}

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)

val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)
  
val gbt = new GBTClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setMaxIter(10)

val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, gbt, labelConverter))

// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions.select("predictedLabel", "label", "features").show(5)

// Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println("Test Error = " + (1.0 - accuracy))

val gbtModel = model.stages(2).asInstanceOf[GBTClassificationModel]
println("Learned classification GBT model:\n" + gbtModel.toDebugString)

## RBT Regression

Having completed an example of classification with Gradient-Boosted Trees, it is time for an example of regression. Once again, I will build upon previous regression examples. The Pipelines for regression had only two stages, and I replace the second one with my current `regressor`, a `GBTRegressor`.

We use the same data already split into a training and test. Everything else is the same as before, calling the `fit` method to get a model and calling the `transform` method to make predictions:

In [None]:
import org.apache.spark.ml.regression.GBTRegressor
import org.apache.spark.ml.regression.GBTRegressionModel

val gbtR = new GBTRegressor().setLabelCol("label").setFeaturesCol("indexedFeatures").setMaxIter(10)

val pipelineGBTR = new Pipeline().setStages(Array(featureIndexer, gbtR))

val modelGBTR = pipelineGBTR.fit(trainingData)

The predictions and then returned in the `predictionsGBTR` `DataFrame`:

In [None]:
val predictionsGBTR = modelGBTR.transform(testData)
predictionsGBTR.show()

As you can see, the Pipelines API makes it very easy to manage the workflow and replace and/or extend models as you go.


## Random Forests vs GBTs

Finally, let's compare both ensemble algorithms, Random Forests and Gradient-Boosted Trees. As the number of trees increase, Random Forests reduce the variance and the likelihood of overfitting, improving the performance monotonically. Gradient-Boosted Trees, on the other hand, reduce the bias, but increase the likelihood of overfitting, so the performance can actually decrease if the number of trees grows too large.

Other important differences are that Random Forests are highly parallelizable, each tree being trained independently from each other, while Gradient-Boosted Trees are trained one at a time. The algorithms also differ in the usual depth of its trees, while Random Forests usually grow deeper trees, since it can benefit from a large number of trees to
compensate for overfitting, Gradient-Boosted Trees are usually grown shallower.

* Number of trees
  * **RFs**: more trees reduce variance and the likelihood of overfitting; improves performance monotonically
  * **GBTs**: more trees reduce bias, but increase the likelihood of overfitting and performance can start to decrease if the number of trees grows too large
* Parallelization
  * **RFs**: can train multiple trees in parallel
  * **GBTs**: train one tree at a time
* Depth of trees
  * **RFs**: deeper trees
  * **GBTs**: shallower trees

## Lesson Summary

Having completed this lesson, you should now be able to:

* Understand the Pipelines API for Random Forests and Gradient-Boosted Trees
* Describe default Input and Output columns
* Perform classification and regression with RFs and GBTs
* Understand and use RFs and GBTs parameters
* Outline the differences between RFs and GBTs regarding its parameters

### About the Authors

[Petro Verkhogliad](https://www.linkedin.com/in/vpetro) is Consulting Manager at Lightbend. He holds a Masters degree in Computer Science with specialization in Intelligent Systems. He is passionate about functional programming and applications of AI.