Cgroups awareness for XGBoost when using n_jobs=-1 #7653

daschnerm · 2022-02-14T09:41:43Z

The performance of XGBoost can be significantly affected when running with default parameters in a container whose CPU is limited using cgroups.

Projects such as https://mybinder.org/ and other container-based projects, use cgroups to limit resource usage. This leads to a large performance impact, when default parameters (n_jobs=-1) are used: XGBoost reads the host CPU count using OpenMP, which leads to over-subscription and causes containers to be throttled immediately. In a small test using XGBClassifier with the Wisconsin breast cancer dataset (569 rows and 32 columns) and a container limited using cgroups to 1 CPU core, the executing time is about 90x slower compared to specifying n_jobs = container_cpus

n_jobs	Execution time	Container CPUs	Host CPUs
-1	28s	1	16
1	0.3s	1	16

Looking at the number of threads spawned during the test, one can see, that indeed the host CPUs are used for determining the number of threads.

Therefore, I suggest, that:

XGboost should respect limits imposed by cgroups, when using default parameters / specifying n_jobs=-1
The limits may be overwritten by OMP_THREAD_LIMIT, if the variable is set

sklearn for example, respect the limits imposed by cgroups and can therefore correctly determine the number of suitable cores:

[...] take cgroups quotas into account when deciding the number of threads used by OpenMP. This avoids performance problems caused by over-subscription when using those classes in a docker container for instance

An implemented could look similar to the one of Sklearn:

min(openmp.omp_get_max_threads(), cpu_count())
, where cpu_count = min(physical_cpu_cout(), container_cpu_count())

container_cpu_count could be implemented similar to joblib, by reading /sys/fs/cgroup/cpu/cpu.cfs_quota_us and /sys/fs/cgroup/cpu/cpu.cfs_period_us to calculate the usable CPU cores XGBoost using int(math.ceil(cfs_quota_us / cfs_period_us)).

The text was updated successfully, but these errors were encountered:

trivialfis · 2022-02-16T17:16:45Z

Thank you for raising the issue, would you like to review #7654 ?

daschnerm · 2022-02-21T09:39:06Z

@trivialfis Sorry, I was a bit too late. Many thanks for the fast fix!!

trivialfis mentioned this issue Feb 14, 2022

Honor CPU counts from CFS. #7654

Merged

trivialfis closed this as completed in #7654 Feb 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cgroups awareness for XGBoost when using n_jobs=-1 #7653

Cgroups awareness for XGBoost when using n_jobs=-1 #7653

daschnerm commented Feb 14, 2022

trivialfis commented Feb 16, 2022

daschnerm commented Feb 21, 2022

Cgroups awareness for XGBoost when using n_jobs=-1 #7653

Cgroups awareness for XGBoost when using n_jobs=-1 #7653

Comments

daschnerm commented Feb 14, 2022

trivialfis commented Feb 16, 2022

daschnerm commented Feb 21, 2022