flaky tests in tutorial CI #10846

zheng-da · 2018-05-08T05:59:55Z

I see the tutorial CI time out multiple times. It takes about one hour to run.

======================================================================

FAIL: test_tutorials.test_onnx_fine_tuning_gluon
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 158, in test_onnx_fine_tuning_gluon
    assert _test_tutorial_nb('onnx/fine_tuning_gluon')
AssertionError: 
-------------------- >> begin captured stdout << ---------------------
Cell execution timed out


--------------------- >> end captured stdout << ----------------------

======================================================================
FAIL: test_tutorials.test_onnx_inference_on_onnx_model
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 161, in test_onnx_inference_on_onnx_model
    assert _test_tutorial_nb('onnx/inference_on_onnx_model')
AssertionError: 
-------------------- >> begin captured stdout << ---------------------
Cell execution timed out

--------------------- >> end captured stdout << ----------------------

======================================================================
FAIL: test_tutorials.test_python_predict_image
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/tutorials/test_tutorials.py", line 173, in test_python_predict_image
    assert _test_tutorial_nb('python/predict_image')
AssertionError: 
-------------------- >> begin captured stdout << ---------------------
Cell execution timed out

--------------------- >> end captured stdout << ----------------------

----------------------------------------------------------------------
Ran 33 tests in 3242.108s

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10843/1/pipeline/

The text was updated successfully, but these errors were encountered:

szha · 2018-05-08T06:15:47Z

@ThomasDelteil

ThomasDelteil · 2018-05-08T06:33:03Z

@zheng-da do you see multiple timeouts in this log or have you experienced that issue multiple time?
The fact that test_tutorials.test_python_predict_image and test_tutorials.test_onnx_inference_on_onnx_model time out is really strange. It is running simple inference code on CPU. The only obvious common denominator in these tests is the fact that they are downloading a sizable model. A possible reason I can think of for this time out is if the underlying instance is running out of disk space / bandwidth usage is very high ? @marcoabreu could that be possible ?

Tomorrow I'll look into replacing these model with smaller ones now that ONNX supports more models, that should benefit users anyway.

zheng-da · 2018-05-08T06:44:10Z

I should say two times myself. Once in my own PR and once in someone else PR.

TaoLv · 2018-05-08T06:53:01Z

I also encountered a connection error for test_onnx_fine_tuning_gluon after rebasing my PR to master branch.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10104/39/pipeline#step-1512-log-901

ThomasDelteil · 2018-05-08T06:56:58Z

@TaoLv could be related, someone was working on implementing a retry logic on mx.test_utils.download to avoid failed HTTP connections issues but can't remember who.

marcoabreu · 2018-05-09T12:38:23Z

If slaves run out of disk space, they'll be automatically disabled and no more jobs are being enqueued. None the less, your test should be able to catch that. The network bandwidth should not be a problem, EC2 has a quite nice connection and our slaves are connected using 10Gbit/s

marcoabreu · 2018-05-09T12:38:41Z

@ThomasDelteil it was @KellenSunderland who worked on it. He had to close his PR due to breaking API

KellenSunderland · 2018-05-09T12:59:29Z

@ThomasDelteil and @marcoabreu I should bring that PR back to life. Classic move of getting some feedback, don't have time to address it, put the PR in the backlog. I think making it backwards compatible shouldn't be an issue.

anirudh2290 · 2018-05-14T21:06:27Z

Failed again here: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10938/1/pipeline/

ThomasDelteil · 2018-05-14T22:36:52Z

Thanks @anirudh2290 , I have been distracted by other projects, will try to prioritize moving to smaller models this week.

ThomasDelteil · 2018-05-17T18:12:31Z

@zheng-da a fix has been put in to reduce the size of the models from 500MB to <50MB for the tutorials. That should prevent this bug from happening in the future. Please consider closing for now.

szha added Example Test Flaky labels May 8, 2018

szha added this to To Do in Tests Improvement via automation May 14, 2018

ThomasDelteil mentioned this issue May 15, 2018

[MXNET-307] Fix flaky tutorial tests from CI #10956

Merged

5 tasks

szha closed this as completed May 17, 2018

Tests Improvement automation moved this from To Do to Done May 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky tests in tutorial CI #10846

flaky tests in tutorial CI #10846

zheng-da commented May 8, 2018

szha commented May 8, 2018

ThomasDelteil commented May 8, 2018 •

edited

zheng-da commented May 8, 2018

TaoLv commented May 8, 2018

ThomasDelteil commented May 8, 2018

marcoabreu commented May 9, 2018

marcoabreu commented May 9, 2018

KellenSunderland commented May 9, 2018 •

edited

anirudh2290 commented May 14, 2018

ThomasDelteil commented May 14, 2018

ThomasDelteil commented May 17, 2018

flaky tests in tutorial CI #10846

flaky tests in tutorial CI #10846

Comments

zheng-da commented May 8, 2018

szha commented May 8, 2018

ThomasDelteil commented May 8, 2018 • edited

zheng-da commented May 8, 2018

TaoLv commented May 8, 2018

ThomasDelteil commented May 8, 2018

marcoabreu commented May 9, 2018

marcoabreu commented May 9, 2018

KellenSunderland commented May 9, 2018 • edited

anirudh2290 commented May 14, 2018

ThomasDelteil commented May 14, 2018

ThomasDelteil commented May 17, 2018

ThomasDelteil commented May 8, 2018 •

edited

KellenSunderland commented May 9, 2018 •

edited