Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

improve dataloader signals and messages #16114

Merged
merged 3 commits into from Sep 19, 2019

Conversation

zhreshold
Copy link
Member

Description

Improve dataloader use experience.
With this PR, dataloaders are

  • More responsive to terminate with Ctrl + C
  • Show helpful messages if exceptions raises in multiple workers. DataLoaders used to produce unreadable messages.
  • Less likely to hang forever, given timeout is added to fetch logic

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@zhreshold zhreshold requested a review from szha as a code owner September 6, 2019 22:34
@zhreshold
Copy link
Member Author

@szha, @eric-haibin-lin @sxjscience for review

@zhreshold
Copy link
Member Author

ping for review 😄

@leezu
Copy link
Contributor

leezu commented Sep 12, 2019

Thanks @zhreshold! I wonder if adding a default timeout is a backwards incompatible change that affects our semantic versioning guarantee? (some user code may currently run 130 seconds per batch, and it will stop working when users upgrade to MXNet 1.6)

If I understand correctly, the motivation to introduce timeout is:

Sometimes full shared_memory will cause all workers to hang and causes timeout. In these
cases please reduce num_workers or increase system shared_memory size instead.

@zhreshold
Copy link
Member Author

@leezu the timeout is for dataloader workers, not including the network training on the main thread. Is there any use case where each batch on cpu can take up to 2min?

@leezu
Copy link
Contributor

leezu commented Sep 13, 2019

I don't think it's a common or intended use-case to have workers process a batch for more than 120 seconds. It's still a breaking change, but it may be fine to break a feature that's potentially unused (ie noone may rely on timeout > 120).

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's reasonable to assume that the loading time of a batch is less than 120 seconds.

@wkcn
Copy link
Member

wkcn commented Sep 16, 2019

I have a suggestion: Dataloader does not terminate the program but print a warning when timeout. Users decide whether to terminate it.

@zhreshold
Copy link
Member Author

@wkcn It's mandatory to have timeout in order to catch excetions in subprocess due to a python bug.

@zhreshold zhreshold merged commit 53b2b40 into apache:master Sep 19, 2019
@zhreshold zhreshold deleted the improve-dataloader branch September 19, 2019 01:23
larroy pushed a commit to larroy/mxnet that referenced this pull request Sep 28, 2019
* improve dataloader signals and messages

* address comments

* fix spawn tests on windows
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants