Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Multithreading Support in CatBoost Pool Dataset Construction #2542

Closed
RunxingZhong opened this issue Nov 29, 2023 · 19 comments
Closed

Comments

@RunxingZhong
Copy link

Hello,

I've noticed that CatBoost's Pool class uses only a single thread during the construction of datasets. I was wondering if there's any possibility to support multithreading in this process?

The reason for this request is that, when dealing with large datasets, I find that the time taken to build the Pool is actually longer than the time it takes for training the model. This seems inefficient and likely not the intended behavior.

Implementing multithreading could significantly reduce the dataset construction time, leading to a more streamlined and efficient workflow.

Thank you for considering this enhancement.

Best regards

@andrey-khropov
Copy link
Member

The reason for this request is that, when dealing with large datasets, I find that the time taken to build the Pool is actually longer than the time it takes for training the model.

This should not be the case and in fact Pool construction uses multithreading internally where it can (in C++ code part). And when data is stored in effective python structures like numpy ndarrays and pandas DataFrames Pool creation speed should not be an issue.
Pool construction when data is loaded from files directly is also multithreaded.

Python/C++ conversion for data structures like Python's list will be slower because this part needs to call CPython interpreter API and cannot be run effectively with multiple threads due to CPython's Global Interpreter Lock.

If you provide a more detailed example (maybe code similar that you use but with the synthetic data instead of the real one) that in your opinion should work faster I could look into it.

If you data is big you might also consider running distibuted CatBoost using Apache Spark.

@RunxingZhong
Copy link
Author

RunxingZhong commented Nov 30, 2023

I am using numpy with float32 for continuous values, but I find it surprisingly slow, which is not what I expected. First of all, I can confirm that when using a GPU, our catboost training is indeed very fast. However, the data construction process is extremely slow. Let's focus on comparing the data construction time of different methods. Here is my code:

import time
import numpy as np
import pandas as pd
from catboost import FeaturesData, Pool

np.random.seed(42)
n, m = int(1e7), 1000
X = np.random.random((n, m)).astype(np.float32)
y = np.random.random(n).astype(np.float32)
print(f'{np.prod(X.shape) * 4:,.0f}')

# -------------------- method_1 --------------------
t1 = time.time()
dtrain_1 = Pool(X, y)
t2 = time.time()
print(f'method_1 = {t2 - t1:.2f}')

# -------------------- method_2 --------------------
t1 = time.time()
dtrain_2 = Pool(FeaturesData(X), y)
t2 = time.time()
print(f'method_2 = {t2 - t1:.2f}')

# -------------------- method_3 --------------------
t1 = time.time()
dtrain_3 = Pool(pd.DataFrame(X), pd.Series(y))
t2 = time.time()
print(f'method_3 = {t2 - t1:.2f}')

In this example, I used about 40GB of data (in fact, I will use even larger datasets for training). You can adjust the size of 'n' based on the memory of your machine. During the test run, the time taken by each method is as follows:

method_1 = 245s
method_2 = 251s
method_3 = 101s

The time taken by method_1 and method_2 is too long, especially considering that my subsequent training time might be less than 5 minutes.

Interestingly, I found that using pandas DataFrame was the fastest. Why is this the case? Also, during the process, I observed that the data construction only used 1 CPU. Is there a way to utilize multiple CPUs for data construction? Please help me speed up this process.

@RunxingZhong
Copy link
Author

@andrey-khropov Hi dear andrey-khropov, please test this code~

@RunxingZhong
Copy link
Author

a similar issue discussed in 2018
#385

@andrey-khropov
Copy link
Member

andrey-khropov commented Nov 30, 2023

In this example, I used about 40GB of data (in fact, I will use even larger datasets for training). You can adjust the size of 'n' based on the memory of your machine. During the test run, the time taken by each method is as follows:

method_1 = 245s method_2 = 251s method_3 = 101s

The time taken by method_1 and method_2 is too long, especially considering that my subsequent training time might be less than 5 minutes.

Can you specify your hardware configuration (CPUs, GPUs, total RAM), Operating system (with version), python version, CatBoost version, numpy version, pandas version?

Also for method_2 and method_3 you measure FeaturesData, pandas.DataFrame and pandas.Series creation together with Pool creation, although I expect FeaturesData creation to be very fast, it would still be interesting to measure the creation of these data structures separately from the Pool construction call.

Interestingly, I found that using pandas DataFrame was the fastest. Why is this the case?

Because CatBoost likes the columnar layout of data (when data for each feature is stored in an continuous array, if that is the case CatBoost uses this data as is, without copying and additional transposition internally).

You can try specifying additional parameter order='F' in your numpy's astype calls to get this layout for numpy data. This should make Pool construction as well as pandas.DataFrame construction faster.

@RunxingZhong
Copy link
Author

Thank you for the information provided. My system configuration is as follows: CPUs = 256, GPUs = 8 * A100 (80G), RAM = 2TB, running Python 3.9 on a Linux system. I have updated numpy, pandas, and catboost to their latest versions.

In methods 2 and 3, using FeaturesData, pandas.DataFrame, and pandas.Series almost doesn't consume any time. This is because it sets a reference. I tested it, and it only took about 100 microseconds, which is even less than 1 millisecond. This is consistent with the time taken when testing Pool separately. You can try it yourself.

-------------------- method_1 --------------------
t0 = time.time()
X = X.astype(np.float32, order = 'F')
t1 = time.time()
dtrain_1 = Pool(X, y)
t2 = time.time()
print(f'method_1_pre = {t1 - t0:.2f}')
print(f'method_1 = {t2 - t1:.2f}')

At this point, the time taken is:
f'method_1_pre = 145s
f'method_1 = 5ms (which seems to be in the target format? I tested it, and indeed, the training starts very quickly.)

However, the line X = X.astype(np.float32, order = 'F') takes xx seconds. In actual use, I read in the default numpy array format, np.float32. So, the actual time taken for data construction is 145 seconds, which is still slower than using pandas.DataFrame(100s).

@andrey-khropov
Copy link
Member

Thank you for the information provided. My system configuration is as follows: CPUs = 256, GPUs = 8 * A100 (80G), RAM = 2TB, running Python 3.9 on a Linux system. I have updated numpy, pandas, and catboost to their latest versions.

What is the exact model of CPUs? Do you use an SMP system with several CPUs (each of which has multiple cores, of course)?

In methods 2 and 3, using FeaturesData, pandas.DataFrame, and pandas.Series almost doesn't consume any time. This is because it sets a reference. I tested it, and it only took about 100 microseconds, which is even less than 1 millisecond. This is consistent with the time taken when testing Pool separately. You can try it yourself.

-------------------- method_1 --------------------
t0 = time.time()
X = X.astype(np.float32, order = 'F')
t1 = time.time()
dtrain_1 = Pool(X, y)
t2 = time.time()
print(f'method_1_pre = {t1 - t0:.2f}')
print(f'method_1 = {t2 - t1:.2f}')

At this point, the time taken is: f'method_1_pre = 145s f'method_1 = 5ms (which seems to be in the target format? I tested it, and indeed, the training starts very quickly.)

That confirms that the time is spent transposing the array. Let me check if we can improve the performance here.

Meanwhile you can:

  • Changing order to 'F' is basically the transposition on the physical level. Unfortunately numpy uses single-threaded implementation for that. You can try some tricks that make it faster (see, for example, the discussion here ) and then use reshape to return to the original logical shape.
  • try to persist numpy ndarrays with 'F' order like that if you use the same data repeatedly for training multiple times.

However, the line X = X.astype(np.float32, order = 'F') takes xx seconds. In actual use, I read in the default numpy array format, np.float32. So, the actual time taken for data construction is 145 seconds, which is still slower than using pandas.DataFrame(100s).

@andrey-khropov
Copy link
Member

andrey-khropov commented Nov 30, 2023

However, the line X = X.astype(np.float32, order = 'F') takes xx seconds. In actual use, I read in the default numpy array format, np.float32. So, the actual time taken for data construction is 145 seconds, which is still slower than using pandas.DataFrame(100s).

Actually X = X.astype(np.float32, order = 'F') does two things at once - conversion from float64 (that's what np.random.random returns) and reshaping the result in 'F' order. So if you want to estimate the preprocessing to F order for your data more realistically you can split these two and check the time for the second stage only:

X = X.astype(np.float32)
t0 = time.time()
X = np.asfortranarray(X)
t1 = time.time()

@RunxingZhong
Copy link
Author

Hello, I have another question. How can I use GPU for prediction? Because using CPU for prediction is relatively slow, and I would like to use GPU for prediction. Is it supported? You may refer to https://github.com/rapidsai/cuml for reference.

@RunxingZhong
Copy link
Author

However, the line X = X.astype(np.float32, order = 'F') takes xx seconds. In actual use, I read in the default numpy array format, np.float32. So, the actual time taken for data construction is 145 seconds, which is still slower than using pandas.DataFrame(100s).

Actually X = X.astype(np.float32, order = 'F') does two things at once - conversion from float64 (that's what np.random.random returns) and reshaping the result in 'F' order. So if you want to estimate the preprocessing to F order for your data more realistically you can split these two and check the time for the second stage only:

X = X.astype(np.float32)
t0 = time.time()
X = np.asfortranarray(X)
t1 = time.time()

I've tested it, and the time is still around 140 seconds because I passed X as np.float32. I've resolved this issue by directly copying a Fortran array using Numba. Thank you.

@andrey-khropov
Copy link
Member

Hello, I have another question. How can I use GPU for prediction? Because using CPU for prediction is relatively slow, and I would like to use GPU for prediction. Is it supported? You may refer to https://github.com/rapidsai/cuml for reference.

Yes, you can but only if you have only numeric features. You just pass task_type='GPU' to the predict method.

You can see the example in the tests here

@RunxingZhong
Copy link
Author

Hello, I have another question. How can I use GPU for prediction? Because using CPU for prediction is relatively slow, and I would like to use GPU for prediction. Is it supported? You may refer to https://github.com/rapidsai/cuml for reference.

Yes, you can but only if you have only numeric features. You just pass task_type='GPU' to the predict method.

You can see the example in the tests here

I've tested it, and when predicting on a larger dataset, it takes approximately 10 seconds on CPU and 15 seconds on GPU, which is even slower. both I pass numpy.float32 with order = 'F'

Additionally, how can I specify which GPU to use during prediction? Similar to specifying devices during training.
(pass task_type='GPU', only using the first gpu )

@andrey-khropov
Copy link
Member

Hello, I have another question. How can I use GPU for prediction? Because using CPU for prediction is relatively slow, and I would like to use GPU for prediction. Is it supported? You may refer to https://github.com/rapidsai/cuml for reference.

Yes, you can but only if you have only numeric features. You just pass task_type='GPU' to the predict method.
You can see the example in the tests here

I've tested it, and when predicting on a larger dataset, it takes approximately 10 seconds on CPU and 15 seconds on GPU, which is even slower. both I pass numpy.float32 with order = 'F'

How large is this dataset and how large is the model (primarily tree depth and the number of iterations)?

And what is the CPU and GPU configuration - same as you've mentioned above when we've been discussing training?

Additionally, how can I specify which GPU to use during prediction? Similar to specifying devices during training. (pass task_type='GPU', only using the first gpu )

Unfortunately it is not possible inside CatBoost now, it's our omission. I've created a new issue #2545 for that. But it is possible to limit GPUs visible to the process by using CUDA_VISIBLE_DEVICES environment variable.

@RunxingZhong
Copy link
Author

RunxingZhong commented Dec 1, 2023

n_estimators = 3000, depth = 8.

dtest.shape # (10000000, 1000)
dtest.dtype # np.float32
np.isfortran(dtest) # True

run 3 times as follow.

  1. for i in range(3): y_pred = model.predict(dtest, thread_count = 64) # 10.5s
  2. for i in range(3): y_pred = model.predict(dtest, thread_count = 64, task_type = 'GPU') # 22.4s

Thank you,
you may compare with this https://github.com/rapidsai/cuml
load catboost and predict, same model predict for dtest only take 2~3 seconds

@andrey-khropov
Copy link
Member

you may compare with this https://github.com/rapidsai/cuml
load catboost and predict, same model predict for dtest only take 2~3 seconds

What exactly do you mean? CatBoost does not support applying on cudf.DataFrame right now.

@RunxingZhong
Copy link
Author

Hello, sorry for the delayed response.
The link I mentioned, https://github.com/rapidsai/cuml, is not something I personally tested.

I know that some people have used this project for making predictions with CatBoost, and it seems to be significantly faster than the official GPU predictions provided by CatBoost. However, this is an internal code that I don't have access to.

If you have some free time, you might want to explore this project. It supports loading a model and making GPU predictions, and the performance appears to be very fast.

@andrey-khropov
Copy link
Member

andrey-khropov commented Feb 5, 2024

@RunxingZhong I've found a serious performance bug and fixed it in 45175a2 .
After the fix the Pool creation in your first case is approximately 3 times faster.

We're planning a new release very soon.

@andrey-khropov
Copy link
Member

    np.random.seed(42)
    
    # reduced the first dimension or there's not enough RAM during the dataset generation stage
    #n, m = int(1e7), 1000
    n, m = int(5e6), 1000

    X = np.random.random((n, m)).astype(np.float32)
    y = np.random.random(n).astype(np.float32)
    print(f'{np.prod(X.shape) * 4:,.0f}')

    t1 = time.time()
    Pool(X, y)
    t2 = time.time()
    print(f'Pool (C-order X) = {t2 - t1:.2f}')

    t1 = time.time()
    Xpd = pd.DataFrame(X)
    t2 = time.time()
    print(f'C-order X -> pd.DataFrame = {t2 - t1:.2f}')

    t1 = time.time()
    Xf = X.astype(np.float32, order = 'F')
    t2 = time.time()
    print(f'C-order X -> F-order Xf = {t2 - t1:.2f}')

    # to reduce memory usage
    del X

    t1 = time.time()
    Pool(Xf, y)
    t2 = time.time()
    print(f'Pool (F-order X) = {t2 - t1:.2f}')

On 32 core 2x Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz :

Before 45175a2 and 588e1bc

20,000,000,000
Pool (C-order X) = 213.50
C-order X -> F-order Xf = 129.11
Pool (F-order X) = 0.01

after:

20,000,000,000
Pool (C-order X) = 14.41
C-order X -> F-order Xf = 103.45
Pool (F-order X) = 0.01

@dazzle-me
Copy link

dazzle-me commented Apr 14, 2024

Please, add warning if data format is not correct?
I spent couple of hours debugging what might be wrong with the data, but finally found this issue thread.

Indeed, data = data.astype(data.dtype, order='F') works, thank you, but finding it was not user-friendly
I just used dataset with more than 20 millions rows and >200 features, pool creation time took >10 mins, while converting to proper numpy data format took less than 30 seconds, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants