-
Notifications
You must be signed in to change notification settings - Fork 1.2k
modify cifar test to avoid timeouts #309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tests/integ/test_tf_cifar.py
Outdated
framework_version=tf_full_version, training_steps=500, evaluation_steps=5, | ||
train_instance_count=2, train_instance_type='ml.p2.xlarge', | ||
framework_version=tf_full_version, training_steps=100, evaluation_steps=5, | ||
train_instance_count=2, train_instance_type='ml.c4.xlarge', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure there's a good solution for this, but it would be nice if we could keep using a GPU instance type in regions where it makes sense because this is our only integ test that currently uses GPU (which is why it made it continuous testing cut)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I wasn't sure about this. Do you think we should use g2? The problem seems to be capacity, and I'd rather not keep increasing the timeouts. (We'd have to make it over an hour, since I think it's an hour before ICE).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ouch, yeah, an hour is too much. Are g2s more available?
Another possibility could be adding a parameter so that we can specify when running pytest if the particular test run should avoid trying to get p2s - haven't thought about it long enough to figure out if that would be really messy, though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess let's increase the timeout temporarily and communicate with teams related to this for long term solution?
Codecov Report
@@ Coverage Diff @@
## master #309 +/- ##
==========================================
- Coverage 92.74% 92.69% -0.06%
==========================================
Files 50 50
Lines 3475 3475
==========================================
- Hits 3223 3221 -2
- Misses 252 254 +2
Continue to review full report at Codecov.
|
9de2361
to
d1787d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just make sure to communicate with related teams on this.
Issue #, if available:
Description of changes: This test frequently fails in FRA, seemingly due to difficulty getting p2.xlarge (takes more than 20 minutes to start in some cases), causing spurious canary failures.
Merge Checklist
Put an
x
in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.