Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated CircleCI tests to cap number of OpenMP threads #407

Merged
merged 5 commits into from Mar 10, 2020

Conversation

christopherbunn
Copy link
Contributor

@christopherbunn christopherbunn commented Feb 21, 2020

Removed cap on XGBoost version. Instead, version 1.0.0 is excluded from requirements.txt so that Python 3.5 compatibility is maintained.

Resolves #401

jeremyliweishih
jeremyliweishih previously requested changes Feb 21, 2020
docs/source/changelog.rst Outdated Show resolved Hide resolved
@dsherry dsherry mentioned this pull request Feb 21, 2020
@dsherry
Copy link
Collaborator

dsherry commented Feb 25, 2020

@christopherbunn @jeremyliweishih hmm why are we doing this? I thought #402 fixed this.

@christopherbunn
Copy link
Contributor Author

christopherbunn commented Feb 25, 2020

@christopherbunn @jeremyliweishih hmm why are we doing this? I thought #402 fixed this.

That PR excluded anything above and including 1.0.0. Since then, XGBoost has released 1.0.1 which is supposed to fix the Python 3.5 errors. This PR only excludes 1.0.0.

@christopherbunn christopherbunn force-pushed the 401_remove_xg_cap branch 2 times, most recently from e3b8516 to 3da7f69 Compare Mar 2, 2020
@codecov
Copy link

codecov bot commented Mar 2, 2020

Codecov Report

Merging #407 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #407   +/-   ##
=======================================
  Coverage   98.26%   98.26%           
=======================================
  Files         104      104           
  Lines        3292     3292           
=======================================
  Hits         3235     3235           
  Misses         57       57

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e9acab...7568a3c. Read the comment docs.

Copy link
Collaborator

@dsherry dsherry left a comment

Oh weird, I just saw this review was marked as "pending". Not sure if my comments showed up or not. Sorry about that!

mkdir build
cd build
cmake ..
make -j4
Copy link
Collaborator

@dsherry dsherry Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to build xgboost from source instead of using pip? This makes me nervous, because our tests should mirror our user install experience--do we have to ask people to build xgboost from source now?

requirements.txt Outdated
@@ -5,7 +5,7 @@ scikit-learn>=0.21.3,!=0.22
dask[complete]>=2.1.0
numpy>=1.16.4
pandas>=0.25.0,<1.0.0
xgboost>=0.82,<1.0.0
xgboost>=0.82,!=1.0.0
Copy link
Collaborator

@dsherry dsherry Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

dsherry
dsherry approved these changes Mar 3, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Thanks @christopherbunn. Left a comment about the pytest verbosity thing, otherwise LGTM!

Sorry again about the delayed review... it was probably because I was on my phone at the time.

@@ -20,11 +20,11 @@ test:

.PHONY: circleci-test
circleci-test:
pytest evalml/ -n 8 --doctest-modules --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure
pytest evalml/ -n 8 --doctest-modules --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure -v
Copy link
Collaborator

@dsherry dsherry Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason to keep this, or did you just forget to remove this after debugging?

Copy link
Contributor Author

@christopherbunn christopherbunn Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It spits out the individual names for testing. Since it's more descriptive about what test is being done imo it's better to keep it so that if we need to see what tests specific are passing in CircleCI. However, I can understand that it might look cluttered. Take a look in one of the checks and lmk what you think.

Copy link
Collaborator

@dsherry dsherry Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh cool. That's a great idea! I like it. Thanks.

In the future let's keep separate changes like this in separate PRs instead of lumping them into one. That'll keep it easy to look at our merge history and understand exactly what changed and when it changed. Having separate smaller PRs like that also makes it easier to tell where a particular test failure is coming from. That sound good?

@christopherbunn
Copy link
Contributor Author

christopherbunn commented Mar 3, 2020

@dsherry For some reason it doesn't pass still on the XGBoost 1.0.1 release on CircleCI. Locally on my machine and in the CircleCI CLI (which runs in the Docker container that they use), it runs within 5-8 minutes. However, it times out on CircleCI's machines. I'm not exactly 100% sure why this is the case.

After some digging the best explanation I have is based on the fact that CircleCI runs multiple Docker containers on their machine at a time. It could be possible that XGBoost is configured to grab all of the cores of the machine (I think it's 36 cores?) when n_jobs=-1 rather than the 8 cores that it has in Docker. Since it doesn't have access to all of the cores, the workers have the potential to fail, then restart (and loop in a cycle).

I'm going to test this by fixing n_jobs in XGBoost only to confirm but I definitely wanted to post an update on this.

UDPATE: tried the above and did not work, I'm going to adjust the CircleCI timeout time then look at the resulting Junit XML file to see what tests exactly are taking up all of the time.

@christopherbunn christopherbunn force-pushed the 401_remove_xg_cap branch 2 times, most recently from a862668 to 0adba96 Compare Mar 3, 2020
@dsherry
Copy link
Collaborator

dsherry commented Mar 3, 2020

@christopherbunn ah got it. Weird. Good digging. Any idea why this would start happening for xgboost 1.0.1 but not for our previous version? Good idea to test fixing n_jobs and see if that helps.

Hmm, I had assumed circleci's docker setup would limit the number of cores available to the process, so that something running in a circleci docker container could only take so many CPUs. Is that not the case?

And do you know how long the timeout is?

@christopherbunn
Copy link
Contributor Author

christopherbunn commented Mar 3, 2020

@christopherbunn ah got it. Weird. Good digging. Any idea why this would start happening for xgboost 1.0.1 but not for our previous version? Good idea to test fixing n_jobs and see if that helps.

Hmm, I had assumed circleci's docker setup would limit the number of cores available to the process, so that something running in a circleci docker container could only take so many CPUs. Is that not the case?

I assumed the same thing as well. However, since. This might be a stretch but I looked though Anglea's joblib error since I thought it might be related. In the CircleCI output here, you can see that the Parallel class is initiated with 26 jobs (i.e. Parallel(n_jobs=26)) despite the fact that the Docker container is set to use 8 cores. The 26 value most likely matches the EC2 instance they're using to run the tests. I'm guessing it's because Docker on macOS runs containers in a VM (which would restrict core counts) whereas Docker for Linux isn't a VM.

I'm pretty sure the same Parallel library is used in XGBoost but I would need to double check.

And do you know how long the timeout is?

I bumped up the timeout to 30 minutes. Definitely not a long term solution, but I wanted to be able to generate a JUnit report so that we can take a look at what tests ran the longest.

I've attached a pdf of the results (since Github for some reason doesn't support attaching XML or HTML files to comments 😬). It ran the entire test suite in ~21 minutes, which is significantly greater than the 5-8 minutes it would typically run.

junit.xml - Junit Test Report.pdf

@dsherry dsherry changed the title Excluded XGBoost 1.0.0 from requirements.txt [WIP] Excluded XGBoost 1.0.0 from requirements.txt Mar 3, 2020
@dsherry dsherry changed the title [WIP] Excluded XGBoost 1.0.0 from requirements.txt [WIP] Exclude XGBoost 1.0.0 from requirements.txt Mar 3, 2020
@dsherry dsherry changed the title [WIP] Exclude XGBoost 1.0.0 from requirements.txt [WIP] XGBoost: update to v1.0.1, and disallow v1.0.0 Mar 3, 2020
@christopherbunn christopherbunn changed the title [WIP] XGBoost: update to v1.0.1, and disallow v1.0.0 Updated CircleCI tests to cap number of OpenMP threads Mar 10, 2020
@christopherbunn
Copy link
Contributor Author

christopherbunn commented Mar 10, 2020

So after some digging I found a workaround for our CircleCI issues. Because it looks like CircleCI does not report the number of CPUs available in a container when nproc is called, XGBoost is most likely overallocating the actual number of threads available. The library that handles CPU thread parallelism, OpenMP, has the ability to set an upper limit through an environment variable.

For the CircleCI tests, I set the cap to be 8 threads since that's the current size of our Docker container. It looks like it runs without timing out as of now. I think this issue should be specific to CircleCI only and most regular machines should have the ability to correctly report the number of threads available to them.

@@ -12,6 +12,8 @@ executors:
working_directory: ~/evalml
docker:
- image: circleci/python:<< parameters.python_version >>
environment:
OMP_NUM_THREADS: 8
Copy link
Collaborator

@dsherry dsherry Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

Copy link
Collaborator

@dsherry dsherry left a comment

@christopherbunn oh that's great! Excellent digging :) have you verified this for our case? Either way, I suppose it can't hurt to try.

@dsherry
Copy link
Collaborator

dsherry commented Mar 10, 2020

@christopherbunn do you happen to know where the circleci doc is for OMP_NUM_THREADS ? I can't find it at first google.

I wonder if this is the same root cause as the failures in #167 / #441

@dsherry
Copy link
Collaborator

dsherry commented Mar 10, 2020

I do see that sklearn sets this parameter in their circleci config, which is promising.

@dsherry
Copy link
Collaborator

dsherry commented Mar 10, 2020

Aha, I think I found it: https://www.openmp.org/spec-html/5.0/openmpse50.html

@christopherbunn christopherbunn dismissed jeremyliweishih’s stale review Mar 10, 2020

Found root cause of CircleCI errors

@christopherbunn christopherbunn merged commit e66ac68 into master Mar 10, 2020
2 checks passed
@dsherry dsherry deleted the 401_remove_xg_cap branch Mar 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update to XGBoost 1.0.2
3 participants