Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

flaky tests in tutorial CI #10846

Closed
zheng-da opened this issue May 8, 2018 · 11 comments
Closed

flaky tests in tutorial CI #10846

zheng-da opened this issue May 8, 2018 · 11 comments

Comments

@zheng-da
Copy link
Contributor

zheng-da commented May 8, 2018

I see the tutorial CI time out multiple times. It takes about one hour to run.

======================================================================

FAIL: test_tutorials.test_onnx_fine_tuning_gluon
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 158, in test_onnx_fine_tuning_gluon
    assert _test_tutorial_nb('onnx/fine_tuning_gluon')
AssertionError: 
-------------------- >> begin captured stdout << ---------------------
Cell execution timed out


--------------------- >> end captured stdout << ----------------------

======================================================================
FAIL: test_tutorials.test_onnx_inference_on_onnx_model
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 161, in test_onnx_inference_on_onnx_model
    assert _test_tutorial_nb('onnx/inference_on_onnx_model')
AssertionError: 
-------------------- >> begin captured stdout << ---------------------
Cell execution timed out

--------------------- >> end captured stdout << ----------------------

======================================================================
FAIL: test_tutorials.test_python_predict_image
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 173, in test_python_predict_image
    assert _test_tutorial_nb('python/predict_image')
AssertionError: 
-------------------- >> begin captured stdout << ---------------------
Cell execution timed out

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------
Ran 33 tests in 3242.108s

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10843/1/pipeline/

@szha
Copy link
Member

szha commented May 8, 2018

@ThomasDelteil

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented May 8, 2018

@zheng-da do you see multiple timeouts in this log or have you experienced that issue multiple time?
The fact that test_tutorials.test_python_predict_image and test_tutorials.test_onnx_inference_on_onnx_model time out is really strange. It is running simple inference code on CPU. The only obvious common denominator in these tests is the fact that they are downloading a sizable model. A possible reason I can think of for this time out is if the underlying instance is running out of disk space / bandwidth usage is very high ? @marcoabreu could that be possible ?

Tomorrow I'll look into replacing these model with smaller ones now that ONNX supports more models, that should benefit users anyway.

@zheng-da
Copy link
Contributor Author

zheng-da commented May 8, 2018

I should say two times myself. Once in my own PR and once in someone else PR.

@TaoLv
Copy link
Member

TaoLv commented May 8, 2018

I also encountered a connection error for test_onnx_fine_tuning_gluon after rebasing my PR to master branch.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10104/39/pipeline#step-1512-log-901

@ThomasDelteil
Copy link
Contributor

@TaoLv could be related, someone was working on implementing a retry logic on mx.test_utils.download to avoid failed HTTP connections issues but can't remember who.

@marcoabreu
Copy link
Contributor

If slaves run out of disk space, they'll be automatically disabled and no more jobs are being enqueued. None the less, your test should be able to catch that. The network bandwidth should not be a problem, EC2 has a quite nice connection and our slaves are connected using 10Gbit/s

@marcoabreu
Copy link
Contributor

@ThomasDelteil it was @KellenSunderland who worked on it. He had to close his PR due to breaking API

@KellenSunderland
Copy link
Contributor

KellenSunderland commented May 9, 2018

@ThomasDelteil and @marcoabreu I should bring that PR back to life. Classic move of getting some feedback, don't have time to address it, put the PR in the backlog. I think making it backwards compatible shouldn't be an issue.

@anirudh2290
Copy link
Member

@szha szha added this to To Do in Tests Improvement via automation May 14, 2018
@ThomasDelteil
Copy link
Contributor

Thanks @anirudh2290 , I have been distracted by other projects, will try to prioritize moving to smaller models this week.

@ThomasDelteil
Copy link
Contributor

@zheng-da a fix has been put in to reduce the size of the models from 500MB to <50MB for the tutorials. That should prevent this bug from happening in the future. Please consider closing for now.

@szha szha closed this as completed May 17, 2018
Tests Improvement automation moved this from To Do to Done May 17, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

7 participants