GBDT speed benchmarks

This benchmark implements optimization algorithm on a pre-defined grid of hyper-parameters for every library. It measures time to perform boosting iteration and metric's value on test set using library primitives. XGBoost benchmark code served as a starting point for this benchmark.

Library dependencies

xgboost==0.80
matplotlib==2.2.2
tqdm==4.26.0
scipy==1.1.0
lightgbm==2.2.1
pandas==0.23.4
catboost==0.10.3
numpy==1.15.2
scikit_learn==0.20.0

TL;DR

How to run all experiments in one line and get table with results. It will run each library on each dataset (from list: abalone, airline, epsilon, higgs, letters, msrank, msrank-classification, synthetic, synthetic-5k-features) with max_depth=6 three times (with learning rate 0.03, 0.05 and 0.15) for 1000 iterations, dump logs to directory 'logs' write all results to file result.json and create table with time per iteration and total learning time (in common-table.txt).

python benchmark.py --use-gpu

For CPU benchmark run without --use-gpu flag.

Run benchmark (how to define your own benchmark parameters)

First of all you need to define a hyperparameters grid in json file, (you may find grid example in the root directory named example_params_grid.json)

Here every line define a hyperparameter name and a list of values to iterate in experiments.

{
    "max_depth": [6, 10, 12],
    "learning_rate": [0.02, 0.05, 0.08],
    "reg_lambda": [1, 10],
    "subsample": [0.5, 1.0]
}

This is example of how you can run an experiments on GPU for all three libraries for dataset airline using grid ''example_params_grid.json'' with 5000 iterations.

python run.py --learners cat lgb xgb --experiment airline --params-grid example_params_grid.json --iterations 5000 --use-gpu

Logs by default will be written to directory 'logs', its compressed version (containing only timestamps and quality values on iteration) will be written to file 'result.json'.

Supported datasets

Higgs, Epsilon, MSRank (MSLR-WEB10K), CoverType, Airline, Synthetic, Abalone, Letters.

Names for script run.py:

abalone,airline,cover-type,epsilon,higgs,letters,msrank,msrank-classification,syntetic,synthetic-5k-features

Draw plots:

It is supported to draw four types of plots:

Learning curves for top N runs between all experiments (parameter name -- best).
Box plot of time per iteration for each library (time-per-iter).
Draw time to achieve percent from best quality (quality-vs-time).
Learning curves for concrete experiments (custom).

Third method calculates the best achieved quality between all experiments. Then fix a grid of relative percents of this value. Then for each level of quality the method filter all experiments from grid search that reach it and compute median (circle point), minimum and maximum time when algorithm achieved that score.

Examples

Draw learning curves for 5 best experiments on dataset MSRank (multiclass mode):

python plot.py --type best --top 5 --from-iter 500 -i result.json -o msrank_mc_plots

Draw time curves with figure size 10x8 for experiments on Higgs dataset starting from 85% of best achieved quality to 100% in 30 points:

python plot.py --type quality-vs-time -f 10 8 -o higgs_plots -i result.json --low-percent 0.85 --num-bins 30

You can pass option --only-min in this mode, to draw only minimum time to achive percent of quality.

Draw time per iteration box plots for every library:

python plot.py --type time-per-iter -f 10 8 -o higgs_plots -i result.json

If you want to draw experiments of concrete method pass option --only method_name. Choices for method_name are cat, xgb, lgb, cpu, gpu and combinations of those.

Experiments

Infrastructure

Used library versions: CatBoost -- 0.10.3, XGBoost -- 0.80, LightGBM -- 2.2.1.
Hardware for CPU experiments: Intel Xeon E312xx (Sandy Bridge) (32 cores).
Hardware for GPU experiments: NVIDIA GTX1080TI + Intel Core i7-6800K (12 cores).

Description

We took the most popular datasets between benchmarks in other libraries and kaggle competitions. Datasets present different task types (regression, classification, ranking), and have different sparsity number of features and samples. This set of datasets cover the majority of computation bottlenecks in gbdt algorithm, for example with large number of samples computation of loss function gradients may become bottleneck, with large number of features the phase of searching of the tree structure may become computational bottleneck. To demonstrate performance of each library in such experiments we created two synthetic datasets -- the first with 10 million objects and 100 features, the second with 100K objects and 5K features.

In such different tasks the optimal hyper parameters, for example maximal tree depth or learning rate, will be different. Thus we choose different hyper parameter grids for different datasets. The selection of the grid for optimization consisted of the following steps:

Run each library with different maximum depth (or number of leaves in case of LightGBM) and learning rate.
Select best tries for each library and run with different sample rate and regularization parameters.
Create grid based on all observations. It must contain all best tries for every algorithm.

Learning curves, distribution of time per iteration and other plots you may find in plots folder.

Datasets charateristics

Table with dataset characteristics. Sparsity values was calculated using the formula sparsity = (100 - 100 * float(not_null_count) / (pool.shape[0] * pool.shape[1])), where not_null_count is number of NaN, None or zero (abs(x) < 1e-6) values in dataset.

Dataset characteristics	#samples	#features	sparsity
Abalone	4.177	8	3.923
Letters	20.000	16	2.621
Epsilon	500.000	2000	3.399
Higgs	11.000.000	28	7.912
MSRank	1.200.192	137	37.951
Airlines	115.069.017	13	7.821
Synthetic (10M samples, 100 features)	10.000.000	100	0.089
Synthetic (100K samples, 5K features)	100.000	5000	0.079

Chosen grid

Parameters grid	iterations	max_depth	learning_rate	subsample	reg_lambda
Abalone	2000	[6, 8, 10]	[.03, .07, .15]	[.5, .75, 1]	[1]
Letters	2000	[6, 8, 10]	[.03, .07, .15]	[.5, .75, 1]	[1]
Epsilon	5000	[4,6,8]	[.09, .15, .25]	[.5, 1]	[1, 10, 100]
Higgs	5000	[6, 8, 10]	[.01, .03, .07, .15, .3]	[.25, .5, .75, 1]	[2, 4, 8, 16]
MSRank-RMSE	8000	[6, 10, 12]	[.02, .05, .08]	[.5, 1]	[1, 10]
MSRank-MultiClass	5000	[6, 8, 10]	[.03, .07, .15]	[.5, .75, 1]	[1]
Airlines	5000	[6, 8, 10]	[.05]	default	default
Synthetic (10M samples, 100 features)	5000	[6, 8, 10]	[.05, .15]	default	default
Synthetic (100K samples, 5K features)	1000	[6, 8]	[.05]	default	default

Table of quality values

Metric abbreviations: RMSE -- root mean squared error. Error = # incorrect classified samples / # samples.

Dataset	Metric	CatBoost	XGBoost	LightGBM
Abalone	RMSE	2.154	2.154	2.143
Letters	Error	0.020	0.026	0.023
Epsilon	Error	0.109	0.111	0.111
Higgs	Error	0.227	0.224	0.222
MSRank-RMSE	RMSE	0.737	0.738	0.739
MSRank-MultiClass	Error	0.424	0.425	0.424
Airlines	Error	0.188	0.190	0.176
Synthetic (10M samples, 100 features)	RMSE	8.363	8.287	9.354
Synthetic (100K samples, 5K features)	RMSE	6.814	10.39	11.060

Table with speed measurements. Here we present only experiments with max_depth = 6 (other parameters were set to default).

CPU time per iteration, seconds:

median time-per-iter,	catboost	xgboost	lightgbm
abalone	0.005 +/- 0.004	0.01 +/- 0.001	0.002
airline	7.08 +/- 2.684	6.293 +/- 0.378	5.002 +/- 0.375
epsilon	0.271 +/- 0.015	9.008 +/- 0.456	1.344 +/- 0.185
higgs	0.723 +/- 0.032	0.75 +/- 0.192	0.691 +/- 0.254
letters	0.265 +/- 0.011	0.125 +/- 0.025	0.074 +/- 0.02
msrank	0.092 +/- 0.004	0.256 +/- 0.09	0.131 +/- 0.027
msrank-classification	0.565 +/- 0.02	1.701 +/- 0.198	0.95 +/- 0.284
synthetic	0.541 +/- 0.079	1.199 +/- 0.23	1.097 +/- 0.242
synthetic-5k-features	0.166 +/- 0.011	12.481 +/- 3.716	1.1 +/- 0.213

CPU total training time 1000 iterations, seconds:

total, sec	catboost	xgboost	lightgbm
abalone	6.57	13.70	6.43
airline	974.107	611.934	526.182
epsilon	266.77	8935.36	1323.40
higgs	733.27	804.18	788.64
letters	259.75	176.82	112.74
msrank	89.72	341.99	185.26
msrank-classification	55.349	174.726	92.670
synthetic	593.87	1327.40	1154.13
synthetic-5k-features	166.57	14951.47	1135.94

GPU time per iteration, seconds:

median time-per-iter	catboost	xgboost	lightgbm
abalone	0.006 ± 0.001	0.005 ± 0.001	0.004 ± 0.016
airline	0.486 ± 0.016	0.932 ± 0.094	3.641 ± 0.612
epsilon	0.045 ± 0.005	2.686 ± 0.283	0.659 ± 0.107
higgs	0.056 ± 0.001	0.111 ± 0.019	0.166 ± 0.06
letters	0.009	0.034 ± 0.008	0.349 ± 0.099
msrank	0.011 ± 0.001	0.039 ± 0.006	0.053 ± 0.013
msrank-classification	0.022 ± 0.001	0.366 ± 0.021	0.191 ± 2.961
synthetic	0.035 ± 0.003	0.209 ± 0.022	0.112 ± 0.03
synthetic-5k-features	0.027 ± 0.002	1.357 ± 0.149	1.51 ± 0.148

GPU total training time 1000 iterations, seconds:

total, sec	catboost	xgboost	lightgbm
abalone	5.763	5.764	23.109
airline	485.825	928.99	3716.344
epsilon	44.971	2661.741	686.4
higgs	55.601	110.485	166.228
letters	9.271	36.79	115.383
msrank	11.057	40.003	60.688
msrank-classification	22.156	372.541	204.634
synthetic	36.658	210.305	123.255
synthetic-5k-features	28.014	1353.212	1557.471

Third Party Licenses

This repository contains modified versions of files released by third parties under other licenses. See the corresponding notifications in these files for the details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GBDT speed benchmarks

Library dependencies

TL;DR

Run benchmark (how to define your own benchmark parameters)

Supported datasets

Draw plots:

Examples

Experiments

Infrastructure

Description

Datasets charateristics

Chosen grid

Table of quality values

CPU time per iteration, seconds:

CPU total training time 1000 iterations, seconds:

GPU time per iteration, seconds:

GPU total training time 1000 iterations, seconds:

Third Party Licenses

Files

README.md

Latest commit

History

README.md

File metadata and controls

GBDT speed benchmarks

Library dependencies

TL;DR

Run benchmark (how to define your own benchmark parameters)

Supported datasets

Draw plots:

Examples

Experiments

Infrastructure

Description

Datasets charateristics

Chosen grid

Table of quality values

CPU time per iteration, seconds:

CPU total training time 1000 iterations, seconds:

GPU time per iteration, seconds:

GPU total training time 1000 iterations, seconds:

Third Party Licenses