New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Request for Multithreading Support in CatBoost Pool Dataset Construction #2542
Comments
This should not be the case and in fact Pool construction uses multithreading internally where it can (in C++ code part). And when data is stored in effective python structures like numpy ndarrays and pandas DataFrames Pool creation speed should not be an issue. Python/C++ conversion for data structures like Python's list will be slower because this part needs to call CPython interpreter API and cannot be run effectively with multiple threads due to CPython's Global Interpreter Lock. If you provide a more detailed example (maybe code similar that you use but with the synthetic data instead of the real one) that in your opinion should work faster I could look into it. If you data is big you might also consider running distibuted CatBoost using Apache Spark. |
I am using numpy with float32 for continuous values, but I find it surprisingly slow, which is not what I expected. First of all, I can confirm that when using a GPU, our catboost training is indeed very fast. However, the data construction process is extremely slow. Let's focus on comparing the data construction time of different methods. Here is my code:
In this example, I used about 40GB of data (in fact, I will use even larger datasets for training). You can adjust the size of 'n' based on the memory of your machine. During the test run, the time taken by each method is as follows: method_1 = 245s The time taken by method_1 and method_2 is too long, especially considering that my subsequent training time might be less than 5 minutes. Interestingly, I found that using pandas DataFrame was the fastest. Why is this the case? Also, during the process, I observed that the data construction only used 1 CPU. Is there a way to utilize multiple CPUs for data construction? Please help me speed up this process. |
@andrey-khropov Hi dear andrey-khropov, please test this code~ |
a similar issue discussed in 2018 |
Can you specify your hardware configuration (CPUs, GPUs, total RAM), Operating system (with version), python version, CatBoost version, numpy version, pandas version? Also for method_2 and method_3 you measure FeaturesData, pandas.DataFrame and pandas.Series creation together with Pool creation, although I expect FeaturesData creation to be very fast, it would still be interesting to measure the creation of these data structures separately from the Pool construction call.
Because CatBoost likes the columnar layout of data (when data for each feature is stored in an continuous array, if that is the case CatBoost uses this data as is, without copying and additional transposition internally). You can try specifying additional parameter |
Thank you for the information provided. My system configuration is as follows: CPUs = 256, GPUs = 8 * A100 (80G), RAM = 2TB, running Python 3.9 on a Linux system. I have updated numpy, pandas, and catboost to their latest versions. In methods 2 and 3, using FeaturesData, pandas.DataFrame, and pandas.Series almost doesn't consume any time. This is because it sets a reference. I tested it, and it only took about 100 microseconds, which is even less than 1 millisecond. This is consistent with the time taken when testing Pool separately. You can try it yourself.
At this point, the time taken is: However, the line X = X.astype(np.float32, order = 'F') takes xx seconds. In actual use, I read in the default numpy array format, np.float32. So, the actual time taken for data construction is 145 seconds, which is still slower than using pandas.DataFrame(100s). |
What is the exact model of CPUs? Do you use an SMP system with several CPUs (each of which has multiple cores, of course)?
That confirms that the time is spent transposing the array. Let me check if we can improve the performance here. Meanwhile you can:
|
Actually
|
Hello, I have another question. How can I use GPU for prediction? Because using CPU for prediction is relatively slow, and I would like to use GPU for prediction. Is it supported? You may refer to https://github.com/rapidsai/cuml for reference. |
I've tested it, and the time is still around 140 seconds because I passed X as np.float32. I've resolved this issue by directly copying a Fortran array using Numba. Thank you. |
Yes, you can but only if you have only numeric features. You just pass You can see the example in the tests here |
I've tested it, and when predicting on a larger dataset, it takes approximately 10 seconds on CPU and 15 seconds on GPU, which is even slower. both I pass numpy.float32 with order = 'F' Additionally, how can I specify which GPU to use during prediction? Similar to specifying devices during training. |
How large is this dataset and how large is the model (primarily tree depth and the number of iterations)? And what is the CPU and GPU configuration - same as you've mentioned above when we've been discussing training?
Unfortunately it is not possible inside CatBoost now, it's our omission. I've created a new issue #2545 for that. But it is possible to limit GPUs visible to the process by using |
n_estimators = 3000, depth = 8. dtest.shape # (10000000, 1000) run 3 times as follow.
Thank you, |
What exactly do you mean? CatBoost does not support applying on |
Hello, sorry for the delayed response. I know that some people have used this project for making predictions with CatBoost, and it seems to be significantly faster than the official GPU predictions provided by CatBoost. However, this is an internal code that I don't have access to. If you have some free time, you might want to explore this project. It supports loading a model and making GPU predictions, and the performance appears to be very fast. |
@RunxingZhong I've found a serious performance bug and fixed it in 45175a2 . We're planning a new release very soon. |
np.random.seed(42)
# reduced the first dimension or there's not enough RAM during the dataset generation stage
#n, m = int(1e7), 1000
n, m = int(5e6), 1000
X = np.random.random((n, m)).astype(np.float32)
y = np.random.random(n).astype(np.float32)
print(f'{np.prod(X.shape) * 4:,.0f}')
t1 = time.time()
Pool(X, y)
t2 = time.time()
print(f'Pool (C-order X) = {t2 - t1:.2f}')
t1 = time.time()
Xpd = pd.DataFrame(X)
t2 = time.time()
print(f'C-order X -> pd.DataFrame = {t2 - t1:.2f}')
t1 = time.time()
Xf = X.astype(np.float32, order = 'F')
t2 = time.time()
print(f'C-order X -> F-order Xf = {t2 - t1:.2f}')
# to reduce memory usage
del X
t1 = time.time()
Pool(Xf, y)
t2 = time.time()
print(f'Pool (F-order X) = {t2 - t1:.2f}') On 32 core 2x Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz :
after:
|
Please, add warning if data format is not correct? Indeed, |
Hello,
I've noticed that CatBoost's
Pool
class uses only a single thread during the construction of datasets. I was wondering if there's any possibility to support multithreading in this process?The reason for this request is that, when dealing with large datasets, I find that the time taken to build the
Pool
is actually longer than the time it takes for training the model. This seems inefficient and likely not the intended behavior.Implementing multithreading could significantly reduce the dataset construction time, leading to a more streamlined and efficient workflow.
Thank you for considering this enhancement.
Best regards
The text was updated successfully, but these errors were encountered: