Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.1.0RC2/RC1 Performance degradation #5666

Closed
ShvetsKS opened this issue May 14, 2020 · 22 comments
Closed

1.1.0RC2/RC1 Performance degradation #5666

ShvetsKS opened this issue May 14, 2020 · 22 comments

Comments

@ShvetsKS
Copy link
Contributor

Slowdown was discovered in case we compare training time obtained on release version - 1.1.0rc2(1.1.0rc1)https://pypi.org/project/xgboost/#history vs custom build on head (c42f533) of release branch(https://github.com/dmlc/xgboost/commits/release_1.1.0) we will see next results:

dataset 1.1.0rc2(1.1.0rc1) custom build on c42f533
higgs1m 21.33s 16.24s
mortgage1Q 22.0s 17.67s

25% slowdown in average.

custom build was obtained by default instruction:

mkdir build
cd build
cmake ..
make -j4

gcc version:
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

higgs1m is public benchmark (https://github.com/dmlc/xgboost-bench/tree/master/hist_method).

note we can see big difference from run to run measurements on rc1/2 build:
higgs1m:
rc1: 21.3283 sec ( [19.96872500400059, 20.516250001004664, 21.93073065700446, 22.897386002005078, 32.79936764600279] )
vs
custom: 16.2467 sec ( [15.885650180993252, 15.897585067999898, 16.517098495001846, 16.686414559000696, 18.001827495994803] )

@hcho3
Copy link
Collaborator

hcho3 commented May 14, 2020

RC2 is actually at commit 8aaabce of the same release branch. Two commits have been made on top of RC2. So it seems like these two commits actually improved the performance?

@hcho3
Copy link
Collaborator

hcho3 commented May 14, 2020

Another possibility is the build environment. The PyPI wheels are built using CentOS 6 + GCC 5.x. So your custom build may be using native platform-specific optimization that’s not available in the PyPI build.

@SmirnovEgorRu
Copy link
Contributor

What I see - new PRs shouldn't impact in the performance.
Different env is the most possible root cause. Do we use a docker container to build PyPI build? If so - how to get this?

Thank you!

@hcho3
Copy link
Collaborator

hcho3 commented May 14, 2020

Yes, we use Docker containers to build PyPI wheels. You can run the following command at the project root:

# Build XGBoost
tests/ci_build/ci_build.sh gpu_build docker -it --build-arg CUDA_VERSION=10.0 \
  tests/ci_build/build_via_cmake.sh -DUSE_CUDA=ON -DUSE_NCCL=ON \
  -DOPEN_MP:BOOL=ON -DHIDE_CXX_SYMBOLS=ON
# Get binary wheel
tests/ci_build/ci_build.sh gpu_build docker -it --build-arg CUDA_VERSION=10.0 \
  bash -c "cd python-package && rm -rf dist/* && python setup.py bdist_wheel --universal"

@trivialfis
Copy link
Member

I think there are techniques for detecting supported instructions on runtime depending on platforms. On linux that would be simply parsing /proc/cpuinfo, and check against whether they are used during build.

@SmirnovEgorRu
Copy link
Contributor

@trivialfis, probably, I missed you - do you mean that supported instructions sets depend on a platform where XGBoost was built?
If so - it means, that the build will be different depending on platform where it was build.
Also, in old Makefile we had usage of sse2 explicitly always. Is it not true now for CMake?

I suppose the issue is not solved yet. Probably, we have non-optimal env for XGBoost building or so on. @ShvetsKS, could you, please, try to build XGBoost using the appropriate container?

@SmirnovEgorRu SmirnovEgorRu reopened this May 15, 2020
@trivialfis
Copy link
Member

trivialfis commented May 15, 2020

@SmirnovEgorRu The default binary distribution is not optimal and I don't think it can be. As it uses older gcc (5.x), also non aggressive optimization level. Also I'm not sure LTO is enabled in that case. If you want optimal performance on default build, then we need to do PGO on test farm.

do you mean that supported instructions sets depend on a platform where XGBoost was built?

No. I haven't use tensorflow for a while. But I remember that if you download from pip, it gives a warning saying tensorflow supports avx2 but the current binary is not built with this compiler flag enabled.

@hcho3
Copy link
Collaborator

hcho3 commented May 15, 2020

We have limit on what kind of optimization we can do, since PyPI wheels have to support a wide range of machines. For example, we cannot use AVX512 instructions.

However, I think the more probable cause is the library dependencies. Unfortunately, there's not much we can do about libraries either, since PyPI requires that all Linux builds use CentOS 6. See https://www.python.org/dev/peps/pep-0571/

If build environments exhibit significant performance characteristics, we should look into using alternative distribution channels, such as Conda.

@SmirnovEgorRu
Copy link
Contributor

@hcho3, @trivialfis, I agree that there are things what we can't do in public PIP releases.
But anyway it makes sense to investigate the problem. There are 2 scenarios:

  • we will leave it as is, because changes will affect usability of the product
  • fix it if there is no impact to the users in terms of usability

@hcho3
Copy link
Collaborator

hcho3 commented May 15, 2020

@SmirnovEgorRu Thanks. We'll keep this issue open for now.

@ShvetsKS
Copy link
Contributor Author

ShvetsKS commented May 15, 2020

I did experiments with gcc 5.5 instead of 7.5 version. Results are the same

dataset custom build on 8aaabce gcc 5.5 custom build on c42f533 gcc 7.5
higgs1m 16.29s 16.24s

and if I use docker for build sudo tests/ci_build/ci_build.sh gpu_build docker -it --build-arg CUDA_VERSION=10.0 tests/ci_build/build_via_cmake.sh -DOPEN_MP:BOOL=ON I obtain the same results:
16.26s

option -DHIDE_CXX_SYMBOLS=ON causes: libxgboost.so: undefined symbol: RabitGetRank

@ShvetsKS
Copy link
Contributor Author

ShvetsKS commented May 15, 2020

but if i use -DUSE_CUDA=ON -DUSE_NCCL=ON additionally I get the described regression.

If we summarize current problem we can see:
that build with -DUSE_CUDA=ON introduce ~25% degradation for CPU

dataset custom docker build on  8aaabce -DUSE_CUDA=ON custom docker build on  8aaabce -DUSE_CUDA=OFF
higgs1m 21.33s 16.24s
mortgage1Q 22.0s 17.67s

@trivialfis
Copy link
Member

That's weird. Let me try a bit more.

@hcho3
Copy link
Collaborator

hcho3 commented May 15, 2020

Interesting. Could it be because of code bloat?

@ShvetsKS
Copy link
Contributor Author

ShvetsKS commented May 15, 2020

I hope that modification of ci_build.sh file doesn't affect it

set -x
${DOCKER_BINARY} run --rm --pid=host \
    -v "${WORKSPACE}":/workspace \
    -w /workspace \
    "${DOCKER_IMG_NAME}" \
    "${COMMAND[@]}"

I needed it to get libxgboost.so from docker run:
lines

    ${USER_IDS} \
    "${CI_DOCKER_EXTRA_PARAMS[@]}" \

were deleted

@hcho3
Copy link
Collaborator

hcho3 commented May 17, 2020

I released 1.1.0 today, as scheduled. If we happen to find a way to speed up PyPI releases without compromising usability, we could release a patch release (1.1.1).

@SmirnovEgorRu
Copy link
Contributor

@hcho3, got it, thank you!
@ShvetsKS, let's make that possible.

@ShvetsKS
Copy link
Contributor Author

seems the root cause was found, and #5720 was prepared as attempt to fix it.

@SmirnovEgorRu
Copy link
Contributor

@hcho3, the issue is solved in master. Can we release 1.1.1?
Thank you!

@hcho3
Copy link
Collaborator

hcho3 commented May 31, 2020

@SmirnovEgorRu @ShvetsKS I filed #5732 to release 1.1.1.

@hcho3
Copy link
Collaborator

hcho3 commented Jun 7, 2020

1.1.1 is now out.

@SmirnovEgorRu
Copy link
Contributor

@hcho3, I appreciate this, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants