1.1.0RC2/RC1 Performance degradation #5666

ShvetsKS · 2020-05-14T21:01:21Z

Slowdown was discovered in case we compare training time obtained on release version - 1.1.0rc2(1.1.0rc1)https://pypi.org/project/xgboost/#history vs custom build on head (c42f533) of release branch(https://github.com/dmlc/xgboost/commits/release_1.1.0) we will see next results:

dataset	1.1.0rc2(1.1.0rc1)	custom build on `c42f533`
higgs1m	21.33s	16.24s
mortgage1Q	22.0s	17.67s

25% slowdown in average.

custom build was obtained by default instruction:

mkdir build
cd build
cmake ..
make -j4

gcc version:
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

higgs1m is public benchmark (https://github.com/dmlc/xgboost-bench/tree/master/hist_method).

note we can see big difference from run to run measurements on rc1/2 build:
higgs1m:
rc1: 21.3283 sec ( [19.96872500400059, 20.516250001004664, 21.93073065700446, 22.897386002005078, 32.79936764600279] )
vs
custom: 16.2467 sec ( [15.885650180993252, 15.897585067999898, 16.517098495001846, 16.686414559000696, 18.001827495994803] )

The text was updated successfully, but these errors were encountered:

hcho3 · 2020-05-14T21:05:52Z

RC2 is actually at commit 8aaabce of the same release branch. Two commits have been made on top of RC2. So it seems like these two commits actually improved the performance?

hcho3 · 2020-05-14T21:08:32Z

Another possibility is the build environment. The PyPI wheels are built using CentOS 6 + GCC 5.x. So your custom build may be using native platform-specific optimization that’s not available in the PyPI build.

SmirnovEgorRu · 2020-05-14T21:16:06Z

What I see - new PRs shouldn't impact in the performance.
Different env is the most possible root cause. Do we use a docker container to build PyPI build? If so - how to get this?

Thank you!

hcho3 · 2020-05-14T21:18:01Z

Yes, we use Docker containers to build PyPI wheels. You can run the following command at the project root:

# Build XGBoost
tests/ci_build/ci_build.sh gpu_build docker -it --build-arg CUDA_VERSION=10.0 \
  tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON
# Get binary wheel
tests/ci_build/ci_build.sh gpu_build docker -it --build-arg CUDA_VERSION=10.0 \
  bash -c "cd python-package && rm -rf dist/* && python setup.py bdist_wheel --universal"

trivialfis · 2020-05-15T02:54:40Z

I think there are techniques for detecting supported instructions on runtime depending on platforms. On linux that would be simply parsing /proc/cpuinfo, and check against whether they are used during build.

SmirnovEgorRu · 2020-05-15T06:38:29Z

@trivialfis, probably, I missed you - do you mean that supported instructions sets depend on a platform where XGBoost was built?
If so - it means, that the build will be different depending on platform where it was build.
Also, in old Makefile we had usage of sse2 explicitly always. Is it not true now for CMake?

I suppose the issue is not solved yet. Probably, we have non-optimal env for XGBoost building or so on. @ShvetsKS, could you, please, try to build XGBoost using the appropriate container?

trivialfis · 2020-05-15T06:59:48Z

@SmirnovEgorRu The default binary distribution is not optimal and I don't think it can be. As it uses older gcc (5.x), also non aggressive optimization level. Also I'm not sure LTO is enabled in that case. If you want optimal performance on default build, then we need to do PGO on test farm.

do you mean that supported instructions sets depend on a platform where XGBoost was built?

No. I haven't use tensorflow for a while. But I remember that if you download from pip, it gives a warning saying tensorflow supports avx2 but the current binary is not built with this compiler flag enabled.

hcho3 · 2020-05-15T07:00:35Z

We have limit on what kind of optimization we can do, since PyPI wheels have to support a wide range of machines. For example, we cannot use AVX512 instructions.

However, I think the more probable cause is the library dependencies. Unfortunately, there's not much we can do about libraries either, since PyPI requires that all Linux builds use CentOS 6. See https://www.python.org/dev/peps/pep-0571/

If build environments exhibit significant performance characteristics, we should look into using alternative distribution channels, such as Conda.

SmirnovEgorRu · 2020-05-15T11:19:39Z

@hcho3, @trivialfis, I agree that there are things what we can't do in public PIP releases.
But anyway it makes sense to investigate the problem. There are 2 scenarios:

we will leave it as is, because changes will affect usability of the product
fix it if there is no impact to the users in terms of usability

hcho3 · 2020-05-15T11:26:22Z

@SmirnovEgorRu Thanks. We'll keep this issue open for now.

ShvetsKS · 2020-05-15T11:26:32Z

I did experiments with gcc 5.5 instead of 7.5 version. Results are the same

dataset	custom build on `8aaabce` gcc 5.5	custom build on c42f533 gcc 7.5
higgs1m	16.29s	16.24s

and if I use docker for build sudo tests/ci_build/ci_build.sh gpu_build docker -it --build-arg CUDA_VERSION=10.0 tests/ci_build/build_via_cmake.sh -DOPEN_MP:BOOL=ON I obtain the same results:
16.26s

option -DHIDE_CXX_SYMBOLS=ON causes: libxgboost.so: undefined symbol: RabitGetRank

ShvetsKS · 2020-05-15T12:10:56Z

but if i use -DUSE_CUDA=ON -DUSE_NCCL=ON additionally I get the described regression.

If we summarize current problem we can see:
that build with -DUSE_CUDA=ON introduce ~25% degradation for CPU

dataset	custom docker build on `8aaabce` -DUSE_CUDA=ON	custom docker build on `8aaabce` -DUSE_CUDA=OFF
higgs1m	21.33s	16.24s
mortgage1Q	22.0s	17.67s

trivialfis · 2020-05-15T12:14:02Z

That's weird. Let me try a bit more.

hcho3 · 2020-05-15T12:22:17Z

Interesting. Could it be because of code bloat?

ShvetsKS · 2020-05-15T12:24:52Z

I hope that modification of ci_build.sh file doesn't affect it

set -x
${DOCKER_BINARY} run --rm --pid=host \
    -v "${WORKSPACE}":/workspace \
    -w /workspace \
    "${DOCKER_IMG_NAME}" \
    "${COMMAND[@]}"

I needed it to get libxgboost.so from docker run:
lines

    ${USER_IDS} \
    "${CI_DOCKER_EXTRA_PARAMS[@]}" \

were deleted

hcho3 · 2020-05-17T08:50:26Z

I released 1.1.0 today, as scheduled. If we happen to find a way to speed up PyPI releases without compromising usability, we could release a patch release (1.1.1).

SmirnovEgorRu · 2020-05-17T11:39:25Z

@hcho3, got it, thank you!
@ShvetsKS, let's make that possible.

ShvetsKS · 2020-05-27T15:38:00Z

seems the root cause was found, and #5720 was prepared as attempt to fix it.

SmirnovEgorRu · 2020-05-31T01:39:18Z

@hcho3, the issue is solved in master. Can we release 1.1.1?
Thank you!

hcho3 · 2020-05-31T02:17:04Z

@SmirnovEgorRu @ShvetsKS I filed #5732 to release 1.1.1.

hcho3 · 2020-06-07T05:16:01Z

1.1.1 is now out.

SmirnovEgorRu · 2020-06-08T05:56:49Z

@hcho3, I appreciate this, thank you!

trivialfis closed this as completed May 15, 2020

SmirnovEgorRu reopened this May 15, 2020

trivialfis mentioned this issue May 15, 2020

Add RABIT_DLL tag to definitions of rabit APIs. dmlc/rabit#140

Merged

ShvetsKS mentioned this issue May 27, 2020

Fix release degradation #5720

Merged

hcho3 mentioned this issue May 31, 2020

Release patch release 1.1.1 with faster CPU performance #5732

Merged

hcho3 closed this as completed Jun 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1.1.0RC2/RC1 Performance degradation #5666

1.1.0RC2/RC1 Performance degradation #5666

ShvetsKS commented May 14, 2020

hcho3 commented May 14, 2020

hcho3 commented May 14, 2020

SmirnovEgorRu commented May 14, 2020

hcho3 commented May 14, 2020 •

edited

Loading

trivialfis commented May 15, 2020

SmirnovEgorRu commented May 15, 2020

trivialfis commented May 15, 2020 •

edited

Loading

hcho3 commented May 15, 2020 •

edited

Loading

SmirnovEgorRu commented May 15, 2020

hcho3 commented May 15, 2020

ShvetsKS commented May 15, 2020 •

edited

Loading

ShvetsKS commented May 15, 2020 •

edited

Loading

trivialfis commented May 15, 2020

hcho3 commented May 15, 2020

ShvetsKS commented May 15, 2020 •

edited

Loading

hcho3 commented May 17, 2020

SmirnovEgorRu commented May 17, 2020

ShvetsKS commented May 27, 2020

SmirnovEgorRu commented May 31, 2020

hcho3 commented May 31, 2020

hcho3 commented Jun 7, 2020

SmirnovEgorRu commented Jun 8, 2020

1.1.0RC2/RC1 Performance degradation #5666

1.1.0RC2/RC1 Performance degradation #5666

Comments

ShvetsKS commented May 14, 2020

hcho3 commented May 14, 2020

hcho3 commented May 14, 2020

SmirnovEgorRu commented May 14, 2020

hcho3 commented May 14, 2020 • edited Loading

trivialfis commented May 15, 2020

SmirnovEgorRu commented May 15, 2020

trivialfis commented May 15, 2020 • edited Loading

hcho3 commented May 15, 2020 • edited Loading

SmirnovEgorRu commented May 15, 2020

hcho3 commented May 15, 2020

ShvetsKS commented May 15, 2020 • edited Loading

ShvetsKS commented May 15, 2020 • edited Loading

trivialfis commented May 15, 2020

hcho3 commented May 15, 2020

ShvetsKS commented May 15, 2020 • edited Loading

hcho3 commented May 17, 2020

SmirnovEgorRu commented May 17, 2020

ShvetsKS commented May 27, 2020

SmirnovEgorRu commented May 31, 2020

hcho3 commented May 31, 2020

hcho3 commented Jun 7, 2020

SmirnovEgorRu commented Jun 8, 2020

hcho3 commented May 14, 2020 •

edited

Loading

trivialfis commented May 15, 2020 •

edited

Loading

hcho3 commented May 15, 2020 •

edited

Loading

ShvetsKS commented May 15, 2020 •

edited

Loading

ShvetsKS commented May 15, 2020 •

edited

Loading

ShvetsKS commented May 15, 2020 •

edited

Loading