-
Notifications
You must be signed in to change notification settings - Fork 6.8k
dataloader crashes with threads and slow downs with processes #13945
Comments
@mfiore Thanks for raising this. I'm labelling it so that other community members can help resolve this. @mxnet-label-bot Add [Bug, Data-loading, Python] |
ThreadPool support was introduced in this PR #13606 cc @zhreshold |
Is your dataset small, do you know how many batches in each epoch? If it's small, the prefetching step will push all workloads and you will need to wait until the first worker finish its job. Regarding your question,
|
@zhreshold thanks for your answer! |
It's fixed by overriding pickling behavior of RecordIO files without bothering the mutex. I am cloing this now. Please ping me if the problem persists |
Can you please elaborate. I am running into a similar issue. How should I override the pickling behavior? |
@mathephysicist The multiprocessing access to same recordio file is fixed already in master. |
Description
(sorry, long issue, should it be split in two for multiprocess and threads? I posted them together since I thought they might be related since part of the code is shared)
Hello, I'm trying to train an ssd network using gluoncv. My dataset is a record file loaded with RecordFileDetection and i'm using gluon.data.DataLoader with SSDDefaultTrainTransform (took most of the code from the sample script on gluoncv at https://github.com/dmlc/gluon-cv/blob/master/scripts/detection/ssd/train_ssd.py).
There are some heavy slowdowns while iterating the batch. I've tried with different batch sizes and num workers. If I measure the time to load a batch in the loop, it is normally in the range of 0.02s, but has some random spikes of 4, 5 or even 7 seconds.
I've tried then using thread_pool=True in my dataloader. In this case reading from the record io file makes the program crash.
Environment info (Required)
Error Message:
Error message when training with thread_pool=True
Minimum reproducible example
(put use_threads=True for the threading issue)
What have you tried to solve it?
With
thread_pool=False
:I've tried changing num workers. If I use num_workers=0 the slowdowns don't seem to happen (it's always slow of course =) ). Even with two num workers I start encountering the issue, which doesn't seem to change when I keep increasing them. I've compared the time to iterate on 20 batches with batch_size 32 with the training script from the deprecated mxnet ssd repository, and it's less than half as fast. After measuring the execution time in the code the issue seems to be related to the pickle.loads time on line 443 of dataloader.py
With
thread_pool=True
I've tried removing prefetching and the error disappear, but the speed is pretty slow (3 sec for batch).
Looking a bit in the dataloader code I've found the following things:
dataloader.py
I see that the self._data_buffer is filled but many results are not successful. Trying to get them gives me one of the following errors:there of 5 seconds the program doesn't crash
The text was updated successfully, but these errors were encountered: