Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The truth: Catboost slower than lightgbm ? #505

Closed
thomas9000 opened this issue Oct 19, 2018 · 4 comments
Closed

The truth: Catboost slower than lightgbm ? #505

thomas9000 opened this issue Oct 19, 2018 · 4 comments

Comments

@thomas9000
Copy link

Problem: A review about Catboost slower than lightgbm at the following website:

https://blog.griddynamics.com/xgboost-vs-catboost-vs-lightgbm-which-is-best-for-price-prediction/

Is this true?

Operating System: Windows
CPU: Intel

@khrisanfov
Copy link

LightGBM is much faster than Catboost on CPU. In my task it is about x10 faster. But on GPU Catboost faster than LightGBM and support multiclass learning that is killer feature for me.

@annaveronika
Copy link
Contributor

For GPU we are always faster than all other libraries.

On CPU the speed will highly depend on dataset properties.
First of all, we don't have a specific support for sparse data (we are working on that currently). So we will be slower for sparse data.

Then there are several details in the algorithm that we do differently, because we get quality improvement from it.
For small datasets (<50k objects) we use ordered boosting. It helps to prevent overfitting, but it's 2-3 times slower than classical boosting. Thus we'll be slower on small datasets.

For datasets with little amount of features (<= 10 features) we are also slower, because we use specific procedure for calculating leaf values, and it starts to be visible in profiler if you have little amount of features. This procedure also helps to boost quality.

For other datasets - for large dense datasets with many features we should be more or less the same as LightGBM. An example of a dataset where catboost is faster than LightGBM is Epsilon dataset. It has 2000 features and 400k samples. On this dataset catboost is 2 times faster than LightGBM.

About the blog post - thanks for pointing it out. The timings are high for catboost, because for catboost features are marked as categorical. And for other algorithms they are marked as numeric. Having categorical features is not free in terms of time - we generate many combinations of categorical features and for each combination generate several numerical features. If you treat this features as float (the same as in other libraries) the difference in time will be much less.

As for quality - if I understand correctly they used dates as categorical features for catboost, and as numeric in other libraries. It's not always a good idea to use dates as categorical features. So to get the comparable quality and speed you need to set all features to numeric for all libraries.

One more thing that appears to be important for this particular dataset is binarization. It is 256 by default for LightGBM and XGBoost and 128 for CatBoost. It usually doesn't have a large influence on quality, so we didn't include it in our Parameter tuning page. But in this case it's better to set it to the same value as in other libraries.

If you do that and run parameter tunning again, catboost will win in quality. And it will not be slowest by far, it still will not be the fastest, because there are little amount of features though.

@thomas9000
Copy link
Author

Ok thank you for your explanation.

@pankaj-kvhld
Copy link

@annaveronika : you guys have done a marvelous job. Kudos!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants