flaky tests in tutorial CI #10846
Comments
@zheng-da do you see multiple timeouts in this log or have you experienced that issue multiple time? Tomorrow I'll look into replacing these model with smaller ones now that ONNX supports more models, that should benefit users anyway. |
I should say two times myself. Once in my own PR and once in someone else PR. |
I also encountered a connection error for |
@TaoLv could be related, someone was working on implementing a retry logic on |
If slaves run out of disk space, they'll be automatically disabled and no more jobs are being enqueued. None the less, your test should be able to catch that. The network bandwidth should not be a problem, EC2 has a quite nice connection and our slaves are connected using 10Gbit/s |
@ThomasDelteil it was @KellenSunderland who worked on it. He had to close his PR due to breaking API |
@ThomasDelteil and @marcoabreu I should bring that PR back to life. Classic move of getting some feedback, don't have time to address it, put the PR in the backlog. I think making it backwards compatible shouldn't be an issue. |
Thanks @anirudh2290 , I have been distracted by other projects, will try to prioritize moving to smaller models this week. |
@zheng-da a fix has been put in to reduce the size of the models from 500MB to <50MB for the tutorials. That should prevent this bug from happening in the future. Please consider closing for now. |
I see the tutorial CI time out multiple times. It takes about one hour to run.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10843/1/pipeline/
The text was updated successfully, but these errors were encountered: