New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't break loop when using > 1 worker with DataLoader #13126

Open
ThomasDelteil opened this Issue Nov 5, 2018 · 2 comments

Comments

Projects
None yet
4 participants
@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Nov 5, 2018

Description

When using more than 1 worker with a DataLoader and starting the iteration and breaking the loop straight away, the worker crash and throw an error.

Environment info (Required)

MXNet 1.3.1

Minimum reproducible example

train_dataset = mx.gluon.data.vision.MNIST(train=True).transform_first(mx.gluon.data.vision.transforms.ToTensor())
train_data = mx.gluon.data.DataLoader(train_dataset, shuffle=True, last_batch='rollover', batch_size=100, num_workers=2)

for data, label in train_data:
    break
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/anaconda3/lib/python3.6/threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/gluon/data/dataloader.py", line 195, in fetcher_loop
    idx, batch = data_queue.get()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mxnet/gluon/data/dataloader.py", line 57, in rebuild_ndarray
    fd = fd.detach()
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/home/ubuntu/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 153, in recvfds
    msg, ancdata, flags, addr = sock.recvmsg(1, socket.CMSG_LEN(bytes_size))
ConnectionResetError: [Errno 104] Connection reset by peer

@leleamol

This comment has been minimized.

Copy link
Contributor

leleamol commented Nov 5, 2018

@ThomasDelteil Thanks for submitting this issue. I am labeling this issue so that MXNet Community members can help resolve it.

@mxnet-lebel-bot [Gluon, Data-loading]

@harshp8l

This comment has been minimized.

Copy link
Contributor

harshp8l commented Nov 6, 2018

@mxnet-label-bot [Gluon, Data-loading]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment