As one would expect, this is a Scala
implementation of GBDT.
I have implemented GBDT in Python
, see Machine-Learning-From-Scratch. And now I rewrite it to Scala
. The structure of the code is almost the same as the Python
implementation.
This implementation supports most of the core features of xgboost
. Briefly, it supports:
- Built-in loss: Mean squared loss for regression task and log loss for classfication task.
- Customized loss: Other loss are also supported. User should provide the activation function, the gradient, and the hessian.
- Hessian information: It uses Newton Method for boosting, thus makes full use of the second-order derivative information.
- Regularization: lambda and gamma, as in
xgboost
. - Multi-processing: It uses
Scala
parallel collection
for multi-processing
To keep the code neat, some features of xgboost
are not implemented. For example, it does not handle missing value, and randomness is not supported.
The algorithm is implemented in Scala 2.12
. The algorithm is build from scratch: It depends on Breeze
for matrix calculation, and there is no dependence on any other machine learning package.
The project is build with Maven
. So you can find the information about the dependencies in pom.xml
.
As for peformance, Scala implementation is faster than Python implementation with Numpy, tested on my machine. But it could not compare to the Python implementation with Numba. I have to say that Numba
is really fast.
In terms of implementation, two remarks are in order. First, parallelization could be easily implemented in Scala
with parallel collection
, which makes the implementation process quite lovely. Second, I have to say that Breeze
is not as mature as Numpy
. It takes a lot of time to make Breeze
code work.
Refer to example.scala
for usage.
Initialize model
var model = new GBDT(loss="mse",max_depth=3,min_sample_split=10,lambda=1.0,gamma=0.0,learning_rate=0.1,n_estimators=100)
loss
: Loss function for gradient boosting.'mse'
is mean squared error for regression task and'log'
is log loss for classification task.loss_object
: Pass an object that extends theloss
trait to use customized loss. See source code for details.max_depth
: The maximum depth of a tree.min_sample_split
: The minimum number of samples required to further split a node.lambda
: The regularization coefficient for leaf score, also known as lambda.gamma
: The regularization coefficient for number of tree nodes, also know as gamma.learning_rate
: The learning rate of gradient boosting.n_estimators
: Number of trees.
Train
model.fit(train,target)
train
should be a DenseMatrix
of Double
, and target
should be a DenseVector
of Double
.
Returns Unit
.
Predict
var prediction = model.predict(test)
test
should be a DenseMatrix
of Double
.
Returns predictions as DenseVector
.
Customized loss
Define a object that inheritates the loss
trait (see source code), which specifies the activation function, the gradients and the hessian.
For example:
object log extends loss{
def act(score:DenseVector[Double]): DenseVector[Double] ={
1.0/(exp(-score)+1.0)
}
def g(score:DenseVector[Double],target:DenseVector[Double]):DenseVector[Double]={
var pred = act(score)
pred-target
}
def h(score:DenseVector[Double],target:DenseVector[Double]):DenseVector[Double]={
var pred = act(score)
pred*:*(1.0-pred)
}
}
g
is gradient and h
is hessian.
And the object could be passed when initializing the model.
var model = new GBDT(loss_object=log,learning_rate=0.1,n_estimators=100)