Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"EOFError" during training for both Matterport and Gibson datasets #26

Closed
frankthecr0c opened this issue Jan 18, 2021 · 6 comments
Closed

Comments

@frankthecr0c
Copy link

frankthecr0c commented Jan 18, 2021

Hi all, I successfully installed NeuralSlam, and I can run the test code. Everything seems to work.
Since I want to train the Network with different datasets, I downloaded Matterport3D and Gibson.
Following your tutorial, with minor changes, I can run the training code with both datasets but at a certain (random) episode i receive the following error message:

File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/lince/Documenti/MythicalSeg/EnvGibsonSingle/ds_study_neuralslam_chaplot/Neural-SLAM/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
command, data = connection_read_fn()
File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/connection.py", line 250, in recv
buf = self._recv_bytes()
File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/lince/anaconda3/envs/NeuralSlamChapl/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError

Since it happens with both datasets, I investigated the memory consumption ( my computer has two Nvidia GeForce RTX 2080 TI): the memory used during the training phase was acceptable, even when the error occurred.

I tried different configurations, in particular, the last one i used for Gibson is the following: --exp_name gibson_orig --save_periodic 2500 --slam_memory_size 100000 --exp_name gibson_orig --save_periodic 2500 --slam_memory_size 100000

Do you have any suggestions?

I really appreciate any help you can provide.

@GracefulMan
Copy link

i also get the same error when i use Gibson datasets to train the model. I don't know how to solve this error.

I0126 20:40:37.990522  8736 Simulator.cpp:42] Deconstructing Simulator
I0126 20:40:37.990520  8739 Simulator.cpp:42] Deconstructing Simulator
I0126 20:40:37.990571  8737 Simulator.cpp:42] Deconstructing Simulator
I0126 20:40:37.990518  8738 Simulator.cpp:42] Deconstructing Simulator
I0126 20:40:37.994117  8736 SemanticScene.h:40] Deconstructing SemanticScene
I0126 20:40:37.994128  8739 SemanticScene.h:40] Deconstructing SemanticScene
I0126 20:40:37.994138  8736 SceneManager.h:24] Deconstructing SceneManager
I0126 20:40:37.994138  8739 SceneManager.h:24] Deconstructing SceneManager
I0126 20:40:37.994135  8737 SemanticScene.h:40] Deconstructing SemanticScene
I0126 20:40:37.994141  8736 SceneGraph.h:20] Deconstructing SceneGraph
I0126 20:40:37.994143  8739 SceneGraph.h:20] Deconstructing SceneGraph
I0126 20:40:37.994148  8737 SceneManager.h:24] Deconstructing SceneManager
I0126 20:40:37.994151  8737 SceneGraph.h:20] Deconstructing SceneGraph
I0126 20:40:37.994186  8738 SemanticScene.h:40] Deconstructing SemanticScene
I0126 20:40:37.994195  8738 SceneManager.h:24] Deconstructing SceneManager
I0126 20:40:37.994202  8738 SceneGraph.h:20] Deconstructing SceneGraph
I0126 20:40:38.046005  8737 Renderer.cpp:38] Deconstructing Renderer
I0126 20:40:38.075238  8737 WindowlessContext.cpp:240] Deconstructing GL context
I0126 20:40:38.085269  8736 Renderer.cpp:38] Deconstructing Renderer
I0126 20:40:38.085384  8736 WindowlessContext.cpp:240] Deconstructing GL context
I0126 20:40:38.087087  8739 Renderer.cpp:38] Deconstructing Renderer
I0126 20:40:38.087186  8739 WindowlessContext.cpp:240] Deconstructing GL context
I0126 20:40:38.092942  8738 Renderer.cpp:38] Deconstructing Renderer
I0126 20:40:38.093048  8738 WindowlessContext.cpp:240] Deconstructing GL context
Process ForkServerProcess-4:
Process ForkServerProcess-3:
Process ForkServerProcess-2:
Process ForkServerProcess-1:
Killed
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/media/mhy/data/neural_slam/navigation_via_reinforcement_learning/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
    command, data = connection_read_fn()
Traceback (most recent call last):
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
EOFError
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/media/mhy/data/neural_slam/navigation_via_reinforcement_learning/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
    command, data = connection_read_fn()
  File "/media/mhy/data/neural_slam/navigation_via_reinforcement_learning/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
    command, data = connection_read_fn()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
EOFError
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/media/mhy/data/neural_slam/navigation_via_reinforcement_learning/env/habitat/habitat_api/habitat/core/vector_env.py", line 208, in _worker_env
    command, data = connection_read_fn()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/mhy/anaconda3/envs/neural_slam/lib/python3.7/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError

@ZhuFengdaaa
Copy link

ZhuFengdaaa commented Feb 2, 2021

Try the solution here. I have had the same problem and I observed that it was due to out of CPU memory. I fix this by closing the envs for some iterations and create a new instance again.

@frankthecr0c
Copy link
Author

@ZhuFengdaaa Thank you for the information. I'll try both your suggestion and the habitat-sim update (in two different and isolated environments).

@GracefulMan
Copy link

Try the solution here. I have had the same problem and I observed that it was due to out of CPU memory. I fix this by close the envs for some iterations and create a new instance again.

Thank you for ur reply, i will try to solve this problem via method you have provided!!!

@frankthecr0c
Copy link
Author

@ZhuFengdaaa Following your suggestions we are investigating the origin of the problem. As mentioned on your linked issue, we analyzed the output of the dmesg command after the error. It seems to be related to the virtual memory (RAM). During the latest run, we also monitored the GPU memory and the usage never reached critical values.
I'll keep you updated. In the next days we are planning to upgrade the version of habitat-sim.
Thank you

[20192.302390] [ 3865] 1000 3865 3860896 643675 9486336 122973 0 python
[20192.302391] [ 3887] 1000 3887 915700 89989 1896448 2 0 tensorboard
[20192.302392] [ 4424] 1000 4424 215421 3829 348160 0 0 gnome-calendar
[20192.302393] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/user@1000.service,task=python,pid=3823,uid=1000
[20192.302422] Out of memory: Killed process 3823 (python) total-vm:64489336kB, anon-rss:43841244kB, file-rss:68236kB, shmem-rss:2056kB, UID:1000 pgtables:96844kB oom_score_adj:0
[20193.242442] oom_reaper: reaped process 3823 (python), now anon-rss:0kB, file-rss:74732kB, shmem-rss:2056kB

@devendrachaplot
Copy link
Owner

A way to reduce the memory footprint during training is to reduce the memory size for training the Neural SLAM module using the slam_memory_size argument: --slam_memory_size 50000.

Please update the thread if you found a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants