#### Hyperparameter Tuning for Gradient Boosted Regression

Boosting is an iterative learning process which leverages computing power to improve the prediction accuracy of a set of weak learners. Within this use case, tree K's prediction outputs are weighed against K-1's outputs. Although decision trees alone are a classic example of supervised learning, as each target prediction Y is weighed against actual target value Y hat, ensemble methods like boosted trees utilize principles from reinforcement learning--if output K is a weak classifier relative to the set of trees as a whole, its outcomes will be weighed more heavily--in other words, ensemble methods are 'punished' for faulty outcomes to increase the predictive power of the entire model. In our use case, there are several hyperparameters we can tune to regulate overfitting, prevent underfitting, and increase overall model robustness.

We're using gradient boosted trees, which, in contrast to random forest, utilizes iterative, or sequential tree instantiation. While random forest builds trees stochastically (working via randomization to produce the best 'collective intelligence', similar to the approach genetic algorithms take), each gradient boosted tree K aims to minimize the gradient of the previous tree's loss function. Simply put, rather than picking the arbitrarily best-performing tree of the set, gradient boosting sequentially produces trees until the optimal tree is found with respect to the dataset and tree count. 

We will focus on a few prevalent hyperparameters:

1 -> Max tree depth--prevents overfitting by disallowing individual trees to learn dynamics overly specific to single samples

2 -> Minimum samples allowed per tree split iteration--reduces overfitting for the same reason as the previous hyperparameter--by keeping trees from learning overly specific dynamics

3 -> Learning rate--while potentially computationally expensive, lowering the learning rate hyperparameter will increase the number of training iterations, maximizing generalization power for the model as a whole

4 -> Iteration count--as a rule of thumb, more trees translates to a more versatile model. However, too many trees leads to overfitting, which will negatively impact the model's performance when evaluating data less homogenous to training data

5 -> Loss function minimization--controls the 'punishment' the model receives for bad predictions; the value of this hyperparameter should be iteratively selected based on resultant classification accuracy

In the absence of nuanced understanding regarding one's chosen model and dataset, the prospect of manually editing hyperparameters is intimidating. Instead, we'll automate the process via grid search, a brute force mechanism which tries each possible tuning combination within a given range and returns the optimal set of hyperparameters. Let's begin!

First, we'll need to change our target name to 'label'--crossValidator requires this to work.

In [0]:
df = spark.read.options(header='True', inferSchema='True', delimiter=',').csv('dbfs:/FileStore/bitamss_mlib/score.csv')
data = df.selectExpr("Hours as hours", "Scores as label")
display(data)

hours,label
2.5,21
5.1,47
3.2,27
8.5,75
3.5,30
1.5,20
9.2,88
5.5,60
8.3,81
2.7,25


In [0]:
# Imports
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.feature import VectorAssembler, MinMaxScaler
from pyspark.ml.regression import GBTRegressor
from pyspark.ml import Pipeline


# Assembler
vec_assembler = VectorAssembler(inputCols=['hours'], outputCol='hoursV')

# Scaler
scaler = MinMaxScaler(inputCol="hoursV", outputCol="features")

# Split
train_data_tune, test_data_tune = data.randomSplit([0.8, 0.2], 24)
tuned_gbt = GBTRegressor(labelCol='label', featuresCol='features')

Before we go any further, let's make sure the hyperparameters we want are available to us by calling .extractParamMap() on our gradient boosted tree object.

In [0]:
tuned_gbt.extractParamMap()

Out[3]: {Param(parent='GBTRegressor_f76fcabc91ec', name='seed', doc='random seed.'): -6682481135904123338,
 Param(parent='GBTRegressor_f76fcabc91ec', name='maxDepth', doc='Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30].'): 5,
 Param(parent='GBTRegressor_f76fcabc91ec', name='maxBins', doc='Max number of bins for discretizing continuous features.  Must be >=2 and >= number of categories for any categorical feature.'): 32,
 Param(parent='GBTRegressor_f76fcabc91ec', name='minInstancesPerNode', doc='Minimum number of instances each child must have after split. If a split causes the left or right child to have fewer than minInstancesPerNode, the split will be discarded as invalid. Should be >= 1.'): 1,
 Param(parent='GBTRegressor_f76fcabc91ec', name='minInfoGain', doc='Minimum information gain for a split to be considered at a tree node.'): 0.0,
 Param(parent='GBTRegressor_f76fcabc91ec', name='maxMemor

In [0]:
# Set pipeline
tuned_pipeline = Pipeline(stages=[vec_assembler, scaler, tuned_gbt])

# Define search space
gridSearch = (ParamGridBuilder()
             #.addGrid(tuned_gbt.maxDepth, [5, 10, 35, 80])
             #.addGrid(tuned_gbt.subsamplingRate, [.1, .3, .7])
             #.addGrid(tuned_gbt.stepSize, [.1, .3, .7])
             #.addGrid(tuned_gbt.maxIter, [5, 10, 35, 80])
             .addGrid(tuned_gbt.lossType, ['squared', 'absolute'])
             .build())

# Build evaluator
gbtEval = RegressionEvaluator(labelCol = 'label', metricName='r2')
# Build crossvalidator
cv = CrossValidator(estimator=tuned_pipeline, estimatorParamMaps=gridSearch, evaluator=gbtEval, numFolds=3, parallelism=4)
tunedModel = cv.fit(train_data_tune)
bestModel = tunedModel.bestModel
tunedPred = tunedModel.transform(test_data_tune)

In [0]:
print('Best maxDepth: ', bestModel.stages[-1]._java_obj.parent().getMaxDepth())
print('Best subsampling rate: ', bestModel.stages[-1]._java_obj.parent().getSubsamplingRate())
print('Best stepSize: ', bestModel.stages[-1]._java_obj.parent().getStepSize())
print('Best maxIter: ', bestModel.stages[-1]._java_obj.parent().getMaxIter())
print('Best loss type: ', bestModel.stages[-1]._java_obj.parent().getLossType())

Best maxDepth:  5
Best subsampling rate:  1.0
Best stepSize:  0.1
Best maxIter:  20
Best loss type:  squared


In [0]:
# rsquared
print(gbtEval.evaluate(tunedPred))

0.7328948379359923


In [0]:
# Set best params
max_depth = bestModel.stages[-1]._java_obj.parent().getMaxDepth()
subsampling_rate = bestModel.stages[-1]._java_obj.parent().getSubsamplingRate()
step_size = bestModel.stages[-1]._java_obj.parent().getStepSize()
max_iter = bestModel.stages[-1]._java_obj.parent().getMaxIter()
loss_type = bestModel.stages[-1]._java_obj.parent().getLossType()

# Train new tree object with best params
best_tree = GBTRegressor(maxDepth=max_depth, subsamplingRate=subsampling_rate, stepSize=step_size, maxIter=max_iter, lossType=loss_type, labelCol='label', featuresCol='features')
pipe = Pipeline(stages=[vec_assembler, scaler, best_tree])
model = pipe.fit(train_data_tune)
pred = model.transform(test_data_tune)

# Display predictions
from pyspark.sql.functions import col
pred.select(col("features"),col("label")).show()

+--------------------+-----+
|            features|label|
+--------------------+-----+
|[0.2592592592592593]|   27|
|[0.49382716049382...|   47|
|[0.7777777777777779]|   69|
|[0.8271604938271605]|   86|
+--------------------+-----+

