The truth: Catboost slower than lightgbm ? #505

thomas9000 · 2018-10-19T15:45:53Z

Problem: A review about Catboost slower than lightgbm at the following website:

https://blog.griddynamics.com/xgboost-vs-catboost-vs-lightgbm-which-is-best-for-price-prediction/

Is this true?

Operating System: Windows
CPU: Intel

khrisanfov · 2018-10-19T18:29:42Z

LightGBM is much faster than Catboost on CPU. In my task it is about x10 faster. But on GPU Catboost faster than LightGBM and support multiclass learning that is killer feature for me.

annaveronika · 2018-10-19T20:11:07Z

For GPU we are always faster than all other libraries.

On CPU the speed will highly depend on dataset properties.
First of all, we don't have a specific support for sparse data (we are working on that currently). So we will be slower for sparse data.

Then there are several details in the algorithm that we do differently, because we get quality improvement from it.
For small datasets (<50k objects) we use ordered boosting. It helps to prevent overfitting, but it's 2-3 times slower than classical boosting. Thus we'll be slower on small datasets.

For datasets with little amount of features (<= 10 features) we are also slower, because we use specific procedure for calculating leaf values, and it starts to be visible in profiler if you have little amount of features. This procedure also helps to boost quality.

For other datasets - for large dense datasets with many features we should be more or less the same as LightGBM. An example of a dataset where catboost is faster than LightGBM is Epsilon dataset. It has 2000 features and 400k samples. On this dataset catboost is 2 times faster than LightGBM.

About the blog post - thanks for pointing it out. The timings are high for catboost, because for catboost features are marked as categorical. And for other algorithms they are marked as numeric. Having categorical features is not free in terms of time - we generate many combinations of categorical features and for each combination generate several numerical features. If you treat this features as float (the same as in other libraries) the difference in time will be much less.

As for quality - if I understand correctly they used dates as categorical features for catboost, and as numeric in other libraries. It's not always a good idea to use dates as categorical features. So to get the comparable quality and speed you need to set all features to numeric for all libraries.

One more thing that appears to be important for this particular dataset is binarization. It is 256 by default for LightGBM and XGBoost and 128 for CatBoost. It usually doesn't have a large influence on quality, so we didn't include it in our Parameter tuning page. But in this case it's better to set it to the same value as in other libraries.

If you do that and run parameter tunning again, catboost will win in quality. And it will not be slowest by far, it still will not be the fastest, because there are little amount of features though.

thomas9000 · 2018-10-20T00:33:13Z

Ok thank you for your explanation.

pankaj-kvhld · 2019-08-19T15:09:24Z

@annaveronika : you guys have done a marvelous job. Kudos!

thomas9000 closed this as completed Oct 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The truth: Catboost slower than lightgbm ? #505

The truth: Catboost slower than lightgbm ? #505

thomas9000 commented Oct 19, 2018

khrisanfov commented Oct 19, 2018

annaveronika commented Oct 19, 2018

thomas9000 commented Oct 20, 2018

pankaj-kvhld commented Aug 19, 2019

The truth: Catboost slower than lightgbm ? #505

The truth: Catboost slower than lightgbm ? #505

Comments

thomas9000 commented Oct 19, 2018

khrisanfov commented Oct 19, 2018

annaveronika commented Oct 19, 2018

thomas9000 commented Oct 20, 2018

pankaj-kvhld commented Aug 19, 2019