"EOFError" during training for both Matterport and Gibson datasets #26

frankthecr0c · 2021-01-18T10:16:26Z

Hi all, I successfully installed NeuralSlam, and I can run the test code. Everything seems to work.
Since I want to train the Network with different datasets, I downloaded Matterport3D and Gibson.
Following your tutorial, with minor changes, I can run the training code with both datasets but at a certain (random) episode i receive the following error message:

File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/lince/Documenti/MythicalSeg/EnvGibsonSingle/ds_study_neuralslam_chaplot/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
command, data = connection_read_fn()
File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Since it happens with both datasets, I investigated the memory consumption ( my computer has two Nvidia GeForce RTX 2080 TI): the memory used during the training phase was acceptable, even when the error occurred.

I tried different configurations, in particular, the last one i used for Gibson is the following: --exp_name gibson_orig --save_periodic 2500 --slam_memory_size 100000 --exp_name gibson_orig --save_periodic 2500 --slam_memory_size 100000

Do you have any suggestions?

I really appreciate any help you can provide.

GracefulMan · 2021-01-26T12:58:39Z

i also get the same error when i use Gibson datasets to train the model. I don't know how to solve this error.

I0126 20:40:37.990522  8736 Simulator.cpp:42] Deconstructing Simulator
I0126 20:40:37.990520  8739 Simulator.cpp:42] Deconstructing Simulator
I0126 20:40:37.990571  8737 Simulator.cpp:42] Deconstructing Simulator
I0126 20:40:37.990518  8738 Simulator.cpp:42] Deconstructing Simulator
I0126 20:40:37.994117  8736 SemanticScene.h:40] Deconstructing SemanticScene
I0126 20:40:37.994128  8739 SemanticScene.h:40] Deconstructing SemanticScene
I0126 20:40:37.994138  8736 SceneManager.h:24] Deconstructing SceneManager
I0126 20:40:37.994138  8739 SceneManager.h:24] Deconstructing SceneManager
I0126 20:40:37.994135  8737 SemanticScene.h:40] Deconstructing SemanticScene
I0126 20:40:37.994141  8736 SceneGraph.h:20] Deconstructing SceneGraph
I0126 20:40:37.994143  8739 SceneGraph.h:20] Deconstructing SceneGraph
I0126 20:40:37.994148  8737 SceneManager.h:24] Deconstructing SceneManager
I0126 20:40:37.994151  8737 SceneGraph.h:20] Deconstructing SceneGraph
I0126 20:40:37.994186  8738 SemanticScene.h:40] Deconstructing SemanticScene
I0126 20:40:37.994195  8738 SceneManager.h:24] Deconstructing SceneManager
I0126 20:40:37.994202  8738 SceneGraph.h:20] Deconstructing SceneGraph
I0126 20:40:38.046005  8737 Renderer.cpp:38] Deconstructing Renderer
I0126 20:40:38.075238  8737 WindowlessContext.cpp:240] Deconstructing GL context
I0126 20:40:38.085269  8736 Renderer.cpp:38] Deconstructing Renderer
I0126 20:40:38.085384  8736 WindowlessContext.cpp:240] Deconstructing GL context
I0126 20:40:38.087087  8739 Renderer.cpp:38] Deconstructing Renderer
I0126 20:40:38.087186  8739 WindowlessContext.cpp:240] Deconstructing GL context
I0126 20:40:38.092942  8738 Renderer.cpp:38] Deconstructing Renderer
I0126 20:40:38.093048  8738 WindowlessContext.cpp:240] Deconstructing GL context
Process ForkServerProcess-4:
Process ForkServerProcess-3:
Process ForkServerProcess-2:
Process ForkServerProcess-1:
Killed
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/media/mhy/data/neural_slam/navigation_via_reinforcement_learning/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
    command, data = connection_read_fn()
Traceback (most recent call last):
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
EOFError
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/media/mhy/data/neural_slam/navigation_via_reinforcement_learning/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
    command, data = connection_read_fn()
  File "/media/mhy/data/neural_slam/navigation_via_reinforcement_learning/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
    command, data = connection_read_fn()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
EOFError
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/media/mhy/data/neural_slam/navigation_via_reinforcement_learning/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
    command, data = connection_read_fn()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

ZhuFengdaaa · 2021-02-02T13:29:48Z

Try the solution here. I have had the same problem and I observed that it was due to out of CPU memory. I fix this by closing the envs for some iterations and create a new instance again.

frankthecr0c · 2021-02-03T15:56:38Z

@ZhuFengdaaa Thank you for the information. I'll try both your suggestion and the habitat-sim update (in two different and isolated environments).

GracefulMan · 2021-02-04T13:13:33Z

Try the solution here. I have had the same problem and I observed that it was due to out of CPU memory. I fix this by close the envs for some iterations and create a new instance again.

Thank you for ur reply, i will try to solve this problem via method you have provided!!!

frankthecr0c · 2021-02-19T15:32:35Z

@ZhuFengdaaa Following your suggestions we are investigating the origin of the problem. As mentioned on your linked issue, we analyzed the output of the dmesg command after the error. It seems to be related to the virtual memory (RAM). During the latest run, we also monitored the GPU memory and the usage never reached critical values.
I'll keep you updated. In the next days we are planning to upgrade the version of habitat-sim.
Thank you

[20192.302390] [ 3865] 1000 3865 3860896 643675 9486336 122973 0 python
[20192.302391] [ 3887] 1000 3887 915700 89989 1896448 2 0 tensorboard
[20192.302392] [ 4424] 1000 4424 215421 3829 348160 0 0 gnome-calendar
[20192.302393] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=3823,uid=1000
[20192.302422] Out of memory: Killed process 3823 (python) total-vm:64489336kB, anon-rss:43841244kB, file-rss:68236kB, shmem-rss:2056kB, UID:1000 pgtables:96844kB oom_score_adj:0
[20193.242442] oom_reaper: reaped process 3823 (python), now anon-rss:0kB, file-rss:74732kB, shmem-rss:2056kB

devendrachaplot · 2021-04-24T19:10:51Z

A way to reduce the memory footprint during training is to reduce the memory size for training the Neural SLAM module using the slam_memory_size argument: --slam_memory_size 50000.

Please update the thread if you found a solution.

devendrachaplot closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"EOFError" during training for both Matterport and Gibson datasets #26

"EOFError" during training for both Matterport and Gibson datasets #26

frankthecr0c commented Jan 18, 2021 •

edited

Loading

GracefulMan commented Jan 26, 2021

ZhuFengdaaa commented Feb 2, 2021 •

edited

Loading

frankthecr0c commented Feb 3, 2021

GracefulMan commented Feb 4, 2021

frankthecr0c commented Feb 19, 2021

devendrachaplot commented Apr 24, 2021

"EOFError" during training for both Matterport and Gibson datasets #26

"EOFError" during training for both Matterport and Gibson datasets #26

Comments

frankthecr0c commented Jan 18, 2021 • edited Loading

GracefulMan commented Jan 26, 2021

ZhuFengdaaa commented Feb 2, 2021 • edited Loading

frankthecr0c commented Feb 3, 2021

GracefulMan commented Feb 4, 2021

frankthecr0c commented Feb 19, 2021

devendrachaplot commented Apr 24, 2021

frankthecr0c commented Jan 18, 2021 •

edited

Loading

ZhuFengdaaa commented Feb 2, 2021 •

edited

Loading