Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Improving multi-processing reliability for gluon DataLoader #13318
I found some multi-processing-related issues in the Gluon DataLoader.
This problem barely happens during training. In this case, there is a decent time interval between the last-batch data prefetching and the _MultiWorkerIter's shutting down).
To fix this, I explicitly terminate the worker processes inside the shutdown function.
To prevent this, I use a scope lock to guard the dict access.
(Brief description on what this PR is about)
Please feel free to remove inapplicable items for your PR.
@YutingZhang I think it's okay to terminate worker processes right after shutdown, but I don't understand why you mentioned "The shutdown mechanism could not guarantee that all worker processes can be terminated".
In some cases, if worker has a propagated key queue(i.e., workers are busy), it's likely workers need longer time to exit, but the terminate signal (None, None) would make sure these daemon processes will quit when they finish their jobs, or get killed by the main process.
So I am curious do you know what caused the dangling processes?
@zhreshold It was actually also a bit confusing to me, but that was what happened.
Is there any size limit or get-put sync of the
By the way, I tried to join the workers before sending
If the above guess is true, a possibly more decent solution is to add the logic of joining workers in the
@YutingZhang Oh yes, "Is it possible that the worker got stuck at the data_queue.put" this is possible. If you destroy the queue before workers exit, it may cause problems.
Now I get your point, thanks @YutingZhang
The solution is fairly simple, make sure shutdown is killing all workers and fetcher.
@ThomasDelteil Yes. I think it solves the problem. I met this type of problems before, but I could not remember the exact cause now (I did the fixes for my code a while ago). It is possibly due to the incorrect ordering of shutting down the workers and the fether_loop.
I tried your example. I could replicate the problem, and the problem was solved by my fixes. How about to try it out at your side?
@zhreshold Right. If the
This PR is good to merge. Before merging, I need to add a note that for statement (2), built-in structures are thread-safe for single operation, so it was actually good.
However, introducing lock here will actually protect if it goes to separate operations in the future and adding peace of mind. So I think the lock is good here.
Thanks @YutingZhang for contribution.