Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky Tests Tracking Issue #9412

Closed
szha opened this issue Jan 13, 2018 · 17 comments
Closed

Flaky Tests Tracking Issue #9412

szha opened this issue Jan 13, 2018 · 17 comments

Comments

@szha
Copy link
Member

szha commented Jan 13, 2018

What

Thanks to community members, we identified tests that are flaky and need fixing. I'm putting together the list of open issues for tracking them and calling for help on fixing them.

We use this issue for tracking progress and coordinating efforts.

TODO

Issue Requester Category Cause Status
#7645 @rahul003 Operator numerical stability Disabled
#8211 @indhub Autograd autograd memory footprint Disabled
#8230 @indhub Operator numerical stability Disabled
#8283 @indhub Utility external dependency Fixed #9503
#8288 @indhub Operator numerical stability (?) Disabled
#8299 @indhub Operator testing through training. randomness Disabled
#8892 @marcoabreu Operator testing through training. randomness Disabled
#8934 @marcoabreu Operator segfault in MKL version Disabled
#9295 @marcoabreu Operator laop hangs in MKL version Disabled
#9384 @eric-haibin-lin Sparse/KVStore segfault for sparse Disabled
#8834 @marcoabreu Scala Operator numerical stability Disabled
#9415 @sergeykolychev Perl segfault in gluon rnn Disabled
#9669 @KellenSunderland Python external dependency Flaky
#9649 @KellenSunderland Mem test timer Flaky
#10087 @anirudhacharya Operator precision Flaky

Completed

Issue Requester Category Cause Status
#9604 @zhreshold Python external dependencies Fixed #9620
#8928 @marcoabreu Perl CPU segfault Fixed #9414
#9332 @KellenSunderland R external dependency Fixed #9598
#9553 @marcoabreu Operator need investigation Fixed #9581

Meaning of status:

  • Flaky: test is enabled and flaky, and is impacting CI.
  • Disabled: temporarily disabled after discovery. Fix is needed.
  • Fixed: fix has finished and test is no longer flaky.
  • @Someone: @Someone is fixing the test.

How

To add new flaky test that was discovered

Create a new issue for the test, and comment here and refer the new issue.

To help fixing the tests

Pick an issue that hasn't been taken. Comment here that you are working on which issue, and I will update the status in the table. Then start working on the issue, and put details, findings and resolutions in the original issue. Also, a good resource for understanding the issue is the people who wrote the feature and the tests. As such, we can identify them from the commit history and ping them for help.

Requester of the original issue, as well as @apache/mxnet-committers should make sure that as a result of the fix, the tests are:

  • Reliably passing with good coverage.
  • Avoid randomness unless necessary.
  • Avoid external dependency unless necessary (e.g. due to license).
  • Root-cause is found and fixed if it's actually a problem in code base.
  • Not resource-intensive unless necessary (e.g. scaling tests).

Reference

Discussions on dev

On GPU Randomness

@sergeykolychev
Copy link
Contributor

sergeykolychev commented Jan 13, 2018 via email

@sergeykolychev
Copy link
Contributor

@szha merged #9414 that should address two flaky perl tests and small change for viz that happened upstream recently.

@bhavinthaker
Copy link
Contributor

A few related PRs for reference on randomness problem in CI tests:

  1. Ci test randomness #8313
  2. Ci test randomness2 #8526

Info from @szha: https://pypi.python.org/pypi/flaky

See also: Email thread on dev@ titled: "Improving and rationalizing unit tests" and "Call for Help for Fixing Flaky Tests"

@szha It may help to add the above information to the above list for easy reference.

@szha
Copy link
Member Author

szha commented Jan 14, 2018

@bhavinthaker good suggestions. I added these references.

@szha
Copy link
Member Author

szha commented Jan 19, 2018

Working on #8283

@szha
Copy link
Member Author

szha commented Jan 20, 2018

I fixed #8283 in #9503.

@eric-haibin-lin
Copy link
Member

#9581 fixes #9553

@jeremiedb
Copy link
Contributor

#9598 fixes #9332

@KellenSunderland
Copy link
Contributor

Two more to be tracked: #9669 and #9649.

@anirudhacharya
Copy link
Member

One more issue to be tracked - #10087

@anirudhacharya
Copy link
Member

test_layer_norm has precision issues - #10114

@eric-haibin-lin
Copy link
Member

@szha
Copy link
Member Author

szha commented Mar 23, 2018

@eric-haibin-lin is there an open issue for test_correlation?

@haojin2
Copy link
Contributor

haojin2 commented Mar 27, 2018

@szha This problem was solved once by #9581 in the past but popped up again due to my recent changes to the operator to support all float data types, I'll do a deep dive on this issue today and get a possible fix.

@marcoabreu
Copy link
Contributor

I'm always adding them to https://github.com/apache/incubator-mxnet/projects/9#card-6995282 - do I have to call them out here as well?

@KellenSunderland
Copy link
Contributor

KellenSunderland commented Mar 30, 2018

I've partially disabled the test test_op_output_names_monitor in this PR #10342 as it's causing long hangs in our CI server. Tracked in issue: #10341

@szha szha moved this from To Do to In progress in Tests Improvement Jun 21, 2018
@szha
Copy link
Member Author

szha commented Jun 22, 2018

We now use github project functionality for issue tracking. https://github.com/apache/incubator-mxnet/projects/9

@szha szha closed this as completed Jun 22, 2018
Tests Improvement automation moved this from In progress to Done Jun 22, 2018
@szha szha self-assigned this Jun 23, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

No branches or pull requests

9 participants