New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-threaded pool creation #385
Comments
You can try reading dataset from file in multithreaded mode, this should help. This will also help with memory consumption - you will not have to store a copy of dataset in dataframe format |
My workflow is that I am doing some experimentation on the dataframe before running catboost. |
We'll work on this, but this will not happen in next release, so you'll have to wait for some time. |
You can try using FeaturesData structure to create pool from numpy array. It's included in the new version and published on pypi. |
@annaveronika Could you please explain further? I can't find any reference to this structure in the module documentation. |
We provided a pool construction using FeatureData class (https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_pool-docpage/ - here is the description of Pool constructors, https://tech.yandex.com/catboost/doc/dg/concepts/python-features-data__desc-docpage/#python-features-data__desc - here is the description of the structure) |
With large datasets, creating the Pool data structure from a dataframe (this is in Python so a Pandas dataframe, although the same holds in R as well) takes a very long time, often dominating the other parts of training and especially predicting.
I noticed that this process is single-threaded and was wondering if it was possible to optimise it in some way.
The text was updated successfully, but these errors were encountered: