New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving multi-processing reliability for gluon DataLoader #13318

Merged
merged 2 commits into from Nov 26, 2018

Conversation

Projects
None yet
5 participants
@YutingZhang
Copy link
Contributor

YutingZhang commented Nov 19, 2018

I found some multi-processing-related issues in the Gluon DataLoader.

  1. Each time a _MultiWorkerIter shuts down, it could leave some dangling processes. The shutdown mechanism could not guarantee that all worker processes can be terminated. As a result, after running for several epochs, more and more dangling processes will stay there.

This problem barely happens during training. In this case, there is a decent time interval between the last-batch data prefetching and the _MultiWorkerIter's shutting down).
But the problem frequently happens 1) when I stop the iter before the end of an epoch, and 2) when I use the DataLoader for a data loading service and load the data as fast as possible. In both cases, the time interval between the most recent data prefetching and the iter shutdown are short. I guess that the _MultiWorkerIter iter is unable to shut down properly during active data prefetching.

To fix this, I explicitly terminate the worker processes inside the shutdown function.

  1. When loading data fast (still mostly during testing and data serving), there seems to be a risk of data racing. The data iter uses a _MultiWorkerIter to cache prefetched data, but the dict does not seem to be thread-safe for concurrent inserting and deleting elements. So occasionally, the data can be missing from the dict.

To prevent this, I use a scope lock to guard the dict access.

Description

(Brief description on what this PR is about)

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here
Yuting Zhang
improving multi-processing reliability for gluon dataloader
I found some multi-processing-related issues in the Gluon  DataLoader.

 1) Each time a _MultiWorkerIter shuts down, it could leave some dangling processes. The shutdown mechanism could not guarantee that all worker processes can be terminated. As a result, after running for several epochs, more and more dangling processes will stay there.

  This problem barely happens during training. In this case, there is a decent time interval between the last-batch data prefetching and the _MultiWorkerIter's shutting down).
  But the problem frequently happens 1) when I stop the iter before the end of an epoch, and 2) when I use the DataLoader for a data loading service and load the data as fast as possible. In both cases, the time interval between the most recent data prefetching and the iter shutdown are short. I guess that the _MultiWorkerIter iter is unable to shut down properly during active data prefetching.

  To fix this, I explicitly terminate the worker processes inside the shutdown function.

  2) When loading data fast (still mostly during testing and data serving), there seems to be a risk of data racing. The data iter uses a _MultiWorkerIter to cache prefetched data, but the dict does not seem to be thread-safe for concurrent inserting and deleting elements. So occasionally, the data can be missing from the  dict.

  To prevent this, I use a scope lock to guard the dict access.

@YutingZhang YutingZhang requested a review from szha as a code owner Nov 19, 2018

@stu1130

This comment has been minimized.

Copy link
Contributor

stu1130 commented Nov 19, 2018

@mxnet-label-bot add [pr-awaiting-review, Gluon, Data-loading]
Thanks for the contribution @YutingZhang

for w in self._workers:
if w.is_alive():
import time
time.sleep(0.5) # allow 0.5 sec to join

This comment has been minimized.

@zhreshold

zhreshold Nov 20, 2018

Member

sleep in the main thread is not a good idea.

This comment has been minimized.

@YutingZhang

YutingZhang Nov 21, 2018

Contributor

I have two options:

  1. simply remove this loop. Then, do not wait the workers to finish, and just kill them if they are alive.
  2. fork a new process to do the waiting and termination.

This comment has been minimized.

@zhreshold

zhreshold Nov 21, 2018

Member

It's safe to go with 1.

This comment has been minimized.

@YutingZhang

YutingZhang Nov 23, 2018

Contributor

pushed with option 1

@zhreshold

This comment has been minimized.

Copy link
Member

zhreshold commented Nov 20, 2018

@YutingZhang I think it's okay to terminate worker processes right after shutdown, but I don't understand why you mentioned "The shutdown mechanism could not guarantee that all worker processes can be terminated".

In some cases, if worker has a propagated key queue(i.e., workers are busy), it's likely workers need longer time to exit, but the terminate signal (None, None) would make sure these daemon processes will quit when they finish their jobs, or get killed by the main process.

So I am curious do you know what caused the dangling processes?

@stu1130

This comment has been minimized.

Copy link
Contributor

stu1130 commented Nov 20, 2018

@mxnet-label-bot update [pr-awaiting-response, Gluon, Data-loading]

@YutingZhang

This comment has been minimized.

Copy link
Contributor

YutingZhang commented Nov 21, 2018

@zhreshold It was actually also a bit confusing to me, but that was what happened.

One guess:

Is there any size limit or get-put sync of the data_queue? Is it possible that the worker got stuck at the data_queue.put ? The fetcher_loop thread can get joined before the workers (this is possible in original code, and more likely in my PR), and then the data_queue is full can the put get stuck. Is there any such possibility?

By the way, I tried to join the workers before sending (None, None) to the fetcher_loop thread, but this can cause the fetcher_loop to get stuck at data_queue.get (the requested data is not in the queue, and no worker will put the data). This can stuck the main thread, if we try to join the fetcher_loop thread or leave a dangling thread otherwise.

If the above guess is true, a possibly more decent solution is to add the logic of joining workers in the fetcher_loop.

@ThomasDelteil

This comment has been minimized.

Copy link
Contributor

ThomasDelteil commented Nov 21, 2018

Can you see if your implementation solves this bug? #13126

@stu1130

This comment has been minimized.

Copy link
Contributor

stu1130 commented Nov 21, 2018

@mxnet-label-bot update [pr-awaiting-review, Gluon, Data-loading]

@zhreshold

This comment has been minimized.

Copy link
Member

zhreshold commented Nov 21, 2018

@YutingZhang Oh yes, "Is it possible that the worker got stuck at the data_queue.put" this is possible. If you destroy the queue before workers exit, it may cause problems.
So actually even though daemon processes got killed after main process exits, it's not guaranteeing the dangling processes will be freed while the main process is still running.

Now I get your point, thanks @YutingZhang

The solution is fairly simple, make sure shutdown is killing all workers and fetcher.

@YutingZhang

This comment has been minimized.

Copy link
Contributor

YutingZhang commented Nov 21, 2018

Can you see if your implementation solves this bug? #13126

@ThomasDelteil Yes. I think it solves the problem. I met this type of problems before, but I could not remember the exact cause now (I did the fixes for my code a while ago). It is possibly due to the incorrect ordering of shutting down the workers and the fether_loop.

I tried your example. I could replicate the problem, and the problem was solved by my fixes. How about to try it out at your side?

@YutingZhang

This comment has been minimized.

Copy link
Contributor

YutingZhang commented Nov 21, 2018

@zhreshold Right. If the shutdown() does not wait for process to join, the queue can be destroyed first.
As to the "daemon processes" thing, I think the whole program is the parent process of the workers. As long as the program runs, the deamon progresses will keep staying there. So for each epoch, there is a possibility to add a few dangling processes.

@zhreshold

This comment has been minimized.

Copy link
Member

zhreshold commented Nov 26, 2018

This PR is good to merge. Before merging, I need to add a note that for statement (2), built-in structures are thread-safe for single operation, so it was actually good.

However, introducing lock here will actually protect if it goes to separate operations in the future and adding peace of mind. So I think the lock is good here.

Thanks @YutingZhang for contribution.

@zhreshold zhreshold merged commit 7b1e7a5 into apache:master Nov 26, 2018

14 checks passed

ci/jenkins/incubator-mxnet Job succeeded
Details
ci/jenkins/mxnet-validation/centos-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/centos-gpu Job succeeded
Details
ci/jenkins/mxnet-validation/clang Job succeeded
Details
ci/jenkins/mxnet-validation/edge Job succeeded
Details
ci/jenkins/mxnet-validation/miscellaneous Job succeeded
Details
ci/jenkins/mxnet-validation/sanity Job succeeded
Details
ci/jenkins/mxnet-validation/unix-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/unix-gpu Job succeeded
Details
ci/jenkins/mxnet-validation/website Job succeeded
Details
ci/jenkins/mxnet-validation/windows-cpu Job succeeded
Details
ci/jenkins/mxnet-validation/windows-gpu Job succeeded
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@YutingZhang

This comment has been minimized.

Copy link
Contributor

YutingZhang commented Nov 26, 2018

@zhreshold Thank you for merging it.
I could also remember that single operation of dict should be thread-safe. However, I did get weird problems as described in statement (2). So I started doubting that ...

@YutingZhang

This comment has been minimized.

Copy link
Contributor

YutingZhang commented Nov 26, 2018

Submitted a similar PR for gluoncv: dmlc/gluon-cv#486

zhreshold added a commit to dmlc/gluon-cv that referenced this pull request Nov 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment