Skip to content

modify cifar test to avoid timeouts #309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 30, 2018

Conversation

andremoeller
Copy link
Contributor

Issue #, if available:

Description of changes: This test frequently fails in FRA, seemingly due to difficulty getting p2.xlarge (takes more than 20 minutes to start in some cases), causing spurious canary failures.

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works (if appropriate)
  • I have updated the changelog with a description of my changes (if appropriate)
  • I have updated any necessary documentation (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

framework_version=tf_full_version, training_steps=500, evaluation_steps=5,
train_instance_count=2, train_instance_type='ml.p2.xlarge',
framework_version=tf_full_version, training_steps=100, evaluation_steps=5,
train_instance_count=2, train_instance_type='ml.c4.xlarge',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure there's a good solution for this, but it would be nice if we could keep using a GPU instance type in regions where it makes sense because this is our only integ test that currently uses GPU (which is why it made it continuous testing cut)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wasn't sure about this. Do you think we should use g2? The problem seems to be capacity, and I'd rather not keep increasing the timeouts. (We'd have to make it over an hour, since I think it's an hour before ICE).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ouch, yeah, an hour is too much. Are g2s more available?

Another possibility could be adding a parameter so that we can specify when running pytest if the particular test run should avoid trying to get p2s - haven't thought about it long enough to figure out if that would be really messy, though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess let's increase the timeout temporarily and communicate with teams related to this for long term solution?

@codecov-io
Copy link

codecov-io commented Jul 19, 2018

Codecov Report

Merging #309 into master will decrease coverage by 0.05%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #309      +/-   ##
==========================================
- Coverage   92.74%   92.69%   -0.06%     
==========================================
  Files          50       50              
  Lines        3475     3475              
==========================================
- Hits         3223     3221       -2     
- Misses        252      254       +2
Impacted Files Coverage Δ
src/sagemaker/local/image.py 87.09% <0%> (-0.59%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 789ec37...fd1fe72. Read the comment docs.

@andremoeller andremoeller force-pushed the cifar-canary-change branch from 9de2361 to d1787d6 Compare July 19, 2018 22:23
Copy link
Contributor

@yangaws yangaws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make sure to communicate with related teams on this.

@yangaws yangaws merged commit fe5acbf into aws:master Jul 30, 2018
jnclt pushed a commit to jnclt/sagemaker-python-sdk that referenced this pull request Aug 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants