-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated CircleCI tests to cap number of OpenMP threads #407
Conversation
@christopherbunn @jeremyliweishih hmm why are we doing this? I thought #402 fixed this. |
That PR excluded anything above and including 1.0.0. Since then, XGBoost has released 1.0.1 which is supposed to fix the Python 3.5 errors. This PR only excludes 1.0.0. |
e3b8516
to
3da7f69
Compare
Codecov Report
@@ Coverage Diff @@
## master #407 +/- ##
=======================================
Coverage 98.26% 98.26%
=======================================
Files 104 104
Lines 3292 3292
=======================================
Hits 3235 3235
Misses 57 57 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh weird, I just saw this review was marked as "pending". Not sure if my comments showed up or not. Sorry about that!
.circleci/config.yml
Outdated
mkdir build | ||
cd build | ||
cmake .. | ||
make -j4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to build xgboost from source instead of using pip? This makes me nervous, because our tests should mirror our user install experience--do we have to ask people to build xgboost from source now?
requirements.txt
Outdated
@@ -5,7 +5,7 @@ scikit-learn>=0.21.3,!=0.22 | |||
dask[complete]>=2.1.0 | |||
numpy>=1.16.4 | |||
pandas>=0.25.0,<1.0.0 | |||
xgboost>=0.82,<1.0.0 | |||
xgboost>=0.82,!=1.0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @christopherbunn. Left a comment about the pytest verbosity thing, otherwise LGTM!
Sorry again about the delayed review... it was probably because I was on my phone at the time.
@@ -20,11 +20,11 @@ test: | |||
|
|||
.PHONY: circleci-test | |||
circleci-test: | |||
pytest evalml/ -n 8 --doctest-modules --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure | |||
pytest evalml/ -n 8 --doctest-modules --cov=evalml --junitxml=test-reports/junit.xml --doctest-continue-on-failure -v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to keep this, or did you just forget to remove this after debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It spits out the individual names for testing. Since it's more descriptive about what test is being done imo it's better to keep it so that if we need to see what tests specific are passing in CircleCI. However, I can understand that it might look cluttered. Take a look in one of the checks and lmk what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh cool. That's a great idea! I like it. Thanks.
In the future let's keep separate changes like this in separate PRs instead of lumping them into one. That'll keep it easy to look at our merge history and understand exactly what changed and when it changed. Having separate smaller PRs like that also makes it easier to tell where a particular test failure is coming from. That sound good?
@dsherry For some reason it doesn't pass still on the XGBoost 1.0.1 release on CircleCI. Locally on my machine and in the CircleCI CLI (which runs in the Docker container that they use), it runs within 5-8 minutes. However, it times out on CircleCI's machines. I'm not exactly 100% sure why this is the case.
UDPATE: tried the above and did not work, I'm going to adjust the CircleCI timeout time then look at the resulting Junit XML file to see what tests exactly are taking up all of the time. |
a862668
to
0adba96
Compare
@christopherbunn ah got it. Weird. Good digging. Any idea why this would start happening for xgboost 1.0.1 but not for our previous version? Good idea to test fixing Hmm, I had assumed circleci's docker setup would limit the number of cores available to the process, so that something running in a circleci docker container could only take so many CPUs. Is that not the case? And do you know how long the timeout is? |
I assumed the same thing as well. However, since. This might be a stretch but I looked though Anglea's joblib error since I thought it might be related. In the CircleCI output here, you can see that the Parallel class is initiated with 26 jobs (i.e. I'm pretty sure the same Parallel library is used in XGBoost but I would need to double check.
I bumped up the timeout to 30 minutes. Definitely not a long term solution, but I wanted to be able to generate a JUnit report so that we can take a look at what tests ran the longest. I've attached a pdf of the results (since Github for some reason doesn't support attaching XML or HTML files to comments 😬). It ran the entire test suite in ~21 minutes, which is significantly greater than the 5-8 minutes it would typically run. |
3dc98b2
to
659ee6e
Compare
So after some digging I found a workaround for our CircleCI issues. Because it looks like CircleCI does not report the number of CPUs available in a container when For the CircleCI tests, I set the cap to be 8 threads since that's the current size of our Docker container. It looks like it runs without timing out as of now. I think this issue should be specific to CircleCI only and most regular machines should have the ability to correctly report the number of threads available to them. |
@@ -12,6 +12,8 @@ executors: | |||
working_directory: ~/evalml | |||
docker: | |||
- image: circleci/python:<< parameters.python_version >> | |||
environment: | |||
OMP_NUM_THREADS: 8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@christopherbunn oh that's great! Excellent digging :) have you verified this for our case? Either way, I suppose it can't hurt to try.
@christopherbunn do you happen to know where the circleci doc is for I wonder if this is the same root cause as the failures in #167 / #441 |
I do see that sklearn sets this parameter in their circleci config, which is promising. |
Aha, I think I found it: https://www.openmp.org/spec-html/5.0/openmpse50.html |
Found root cause of CircleCI errors
Removed cap on XGBoost version. Instead, version 1.0.0 is excluded from requirements.txt so that Python 3.5 compatibility is maintained.
Resolves #401