Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threaded pool creation #385

Closed
ctlaltdefeat opened this issue Jun 13, 2018 · 6 comments
Closed

Multi-threaded pool creation #385

ctlaltdefeat opened this issue Jun 13, 2018 · 6 comments

Comments

@ctlaltdefeat
Copy link

With large datasets, creating the Pool data structure from a dataframe (this is in Python so a Pandas dataframe, although the same holds in R as well) takes a very long time, often dominating the other parts of training and especially predicting.

I noticed that this process is single-threaded and was wondering if it was possible to optimise it in some way.

@annaveronika
Copy link
Contributor

You can try reading dataset from file in multithreaded mode, this should help. This will also help with memory consumption - you will not have to store a copy of dataset in dataframe format

@ctlaltdefeat
Copy link
Author

ctlaltdefeat commented Jun 13, 2018

My workflow is that I am doing some experimentation on the dataframe before running catboost.
This makes it difficult for me to read from a file, because then that means that every new operation that I do on the data requires me to write it to disk beforehand before applying catboost.

@annaveronika
Copy link
Contributor

We'll work on this, but this will not happen in next release, so you'll have to wait for some time.

@annaveronika
Copy link
Contributor

You can try using FeaturesData structure to create pool from numpy array. It's included in the new version and published on pypi.

@ctlaltdefeat
Copy link
Author

@annaveronika Could you please explain further? I can't find any reference to this structure in the module documentation.

@annaveronika
Copy link
Contributor

We provided a pool construction using FeatureData class (https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_pool-docpage/ - here is the description of Pool constructors, https://tech.yandex.com/catboost/doc/dg/concepts/python-features-data__desc-docpage/#python-features-data__desc - here is the description of the structure)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants