Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

OSError: [Errno 24] Too many open files #158

Closed
zimenglan-sysu-512 opened this issue Nov 14, 2018 · 3 comments
Closed

OSError: [Errno 24] Too many open files #158

zimenglan-sysu-512 opened this issue Nov 14, 2018 · 3 comments
Labels
question Further information is requested

Comments

@zimenglan-sysu-512
Copy link
Contributor

zimenglan-sysu-512 commented Nov 14, 2018

❓ Questions and Help

After merge the commit fix maskrnn typo (#154) , when i run the training procedure, it always encounters the problem as below:

 Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 234, in _feed
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 243, in reduce_storage
RuntimeError: unable to open shared memory object </torch_30997_2076642173> in read-write mode
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 149, in _serve
    send(conn, destination_pid)
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 176, in send_handle
    with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
  File "/usr/lib/python3.6/socket.py", line 460, in fromfd
    nfd = dup(fd)
OSError: [Errno 24] Too many open files

Traceback (most recent call last):
  File "tools/train_net.py", line 170, in <module>
    main()
  File "tools/train_net.py", line 163, in main
    model = train(cfg, args.local_rank, args.distributed)
  File "tools/train_net.py", line 73, in train
    arguments,
  File "maskrcnn-benchmark/maskrcnn_benchmark/engine/trainer.py", line 60, in do_train
    for iteration, (images, targets, _) in enumerate(data_loader, start_iter):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 631, in __next__
    idx, batch = self._get_batch()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 610, in _get_batch
    return self.data_queue.get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.6/dist-packages/torch/multiprocessing/reductions.py", line 204, in rebuild_storage_fd
    fd = df.detach()
  File "/usr/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
    return recvfds(s, 1)[0]
  File "/usr/lib/python3.6/multiprocessing/reduction.py", line 155, in recvfds
    raise EOFError
EOFError

anyone know to fix it?
thanks.

@zimenglan-sysu-512
Copy link
Contributor Author

zimenglan-sysu-512 commented Nov 14, 2018

i follow OSError: Too many open files #396 to add these two lines to /etc/security/limits.conf.

*               soft    nofile         65535
*               hard    nofile         65535

then reboot to solve it.

@fmassa fmassa added the question Further information is requested label Nov 14, 2018
@yaohuaxin
Copy link

do we really need to open so many files?

@fmassa
Copy link
Contributor

fmassa commented Feb 28, 2019

@yaohuaxin this is due to how DataLoader with multiple worker threads work, with some particular combination of settings

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants