Multi-threaded pool creation #385

ctlaltdefeat · 2018-06-13T19:40:17Z

With large datasets, creating the Pool data structure from a dataframe (this is in Python so a Pandas dataframe, although the same holds in R as well) takes a very long time, often dominating the other parts of training and especially predicting.

I noticed that this process is single-threaded and was wondering if it was possible to optimise it in some way.

annaveronika · 2018-06-13T19:44:19Z

You can try reading dataset from file in multithreaded mode, this should help. This will also help with memory consumption - you will not have to store a copy of dataset in dataframe format

ctlaltdefeat · 2018-06-13T20:26:39Z

My workflow is that I am doing some experimentation on the dataframe before running catboost.
This makes it difficult for me to read from a file, because then that means that every new operation that I do on the data requires me to write it to disk beforehand before applying catboost.

annaveronika · 2018-06-15T13:55:02Z

We'll work on this, but this will not happen in next release, so you'll have to wait for some time.

annaveronika · 2018-07-06T10:30:27Z

You can try using FeaturesData structure to create pool from numpy array. It's included in the new version and published on pypi.

ctlaltdefeat · 2018-07-07T00:06:30Z

@annaveronika Could you please explain further? I can't find any reference to this structure in the module documentation.

annaveronika · 2018-07-09T09:44:01Z

We provided a pool construction using FeatureData class (https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_pool-docpage/ - here is the description of Pool constructors, https://tech.yandex.com/catboost/doc/dg/concepts/python-features-data__desc-docpage/#python-features-data__desc - here is the description of the structure)

…syncroniousy with categorical features and in parallel.,. #385, Fix #2542.

annaveronika added the planned label Jun 15, 2018

annaveronika closed this as completed Jul 6, 2018

RunxingZhong mentioned this issue Nov 30, 2023

Request for Multithreading Support in CatBoost Pool Dataset Construction #2542

Closed

andrey-khropov added the performance label Feb 6, 2024

robot-piglet pushed a commit that referenced this issue Feb 7, 2024

Process numerical features in numpy.ndarray for float32 in C++ code a…

588e1bc

…syncroniousy with categorical features and in parallel.,. #385, Fix #2542.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threaded pool creation #385

Multi-threaded pool creation #385

ctlaltdefeat commented Jun 13, 2018

annaveronika commented Jun 13, 2018

ctlaltdefeat commented Jun 13, 2018 •

edited

annaveronika commented Jun 15, 2018

annaveronika commented Jul 6, 2018

ctlaltdefeat commented Jul 7, 2018

annaveronika commented Jul 9, 2018

Multi-threaded pool creation #385

Multi-threaded pool creation #385

Comments

ctlaltdefeat commented Jun 13, 2018

annaveronika commented Jun 13, 2018

ctlaltdefeat commented Jun 13, 2018 • edited

annaveronika commented Jun 15, 2018

annaveronika commented Jul 6, 2018

ctlaltdefeat commented Jul 7, 2018

annaveronika commented Jul 9, 2018

ctlaltdefeat commented Jun 13, 2018 •

edited