Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run tasks on Docker failed. No module named 'megaverse.extension' #15

Open
GoingMyWay opened this issue Jul 6, 2021 · 14 comments
Open

Comments

@GoingMyWay
Copy link

GoingMyWay commented Jul 6, 2021

I used the Dockerfile.base to create an image and then inside the docker container, I ran the following command and failed.

(sample-factory) I have no name!@738e10851140:/workspace$ python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=APPO --gamma=0.997 --use_rnn=True --rnn_num_layers=2 --num_workers=12 --num_envs_per_worker=2 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --actor_worker_gpus 0 --num_policies=1 --with_pbt=False --max_grad_norm=0.0 --exploration_loss=symmetric_kl --exploration_loss_coeff=0.001 --megaverse_num_simulation_threads=1 --megaverse_use_vulkan=False --policy_workers_per_policy=2 --learner_main_loop_num_cores=1 --reward_clip=30 --env=megaverse_TowerBuilding --experiment=test_cli
Traceback (most recent call last):
  File "/miniconda/envs/sample-factory/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/miniconda/envs/sample-factory/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/workspace/megaverse/megaverse_rl/train.py", line 11, in <module>
    from megaverse_rl.megaverse_utils import register_env
  File "/workspace/megaverse/megaverse_rl/megaverse_utils.py", line 4, in <module>
    from megaverse.megaverse_env import MegaverseEnv, make_env_multitask
  File "/workspace/megaverse/megaverse/megaverse_env.py", line 8, in <module>
    from megaverse.extension.megaverse import MegaverseGym, set_megaverse_log_level
ModuleNotFoundError: No module named 'megaverse.extension'
@GoingMyWay GoingMyWay changed the title Run tasks on Docker failed. Run tasks on Docker failed. No module named 'megaverse.extension' Jul 6, 2021
@alex-petrenko
Copy link
Owner

@GoingMyWay thank you for reporting!
@BoyuanLong can you please take a look?

@BoyuanLong
Copy link
Collaborator

@GoingMyWay Could you share the directory structures (inmegaverse/megaverse and megaverse/megaverse/extension specifically) and the commit id for megaverse?

@GoingMyWay
Copy link
Author

GoingMyWay commented Jul 7, 2021

@GoingMyWay Could you share the directory structures (inmegaverse/megaverse and megaverse/megaverse/extension specifically) and the commit id for megaverse?

Hi, inside the Docker container, it is

(sample-factory) I have no name!@738e10851140:/workspace/megaverse/megaverse$ ls -l
total 16
-rw-rw-r-- 1 1004 1008    0 Jul  6 02:47 __init__.py
drwxr-xr-x 2 1004 1008 4096 Jul  6 04:07 __pycache__
-rw-rw-r-- 1 1004 1008 6282 Jul  6 02:47 megaverse_env.py
drwxrwxr-x 2 1004 1008 4096 Jul  6 02:47 tests

and

(sample-factory) I have no name!@738e10851140:/workspace/megaverse$ ls -l megaverse/extension
ls: cannot access 'megaverse/extension': No such file or directory

The commit id

(sample-factory) I have no name!@738e10851140:/workspace/megaverse$ git log
commit f5d0b3ff61d39e669be826d8a1b60331c60fc40e (HEAD -> master, origin/master, origin/HEAD)
Author: Aleksei Petrenko <petrenko@usc.edu>
Date:   Thu Jul 1 01:22:08 2021 -0700

    Instructions

commit 1affd2c4caf8b436215d4f194846d60286fb2951
Author: Aleksei Petrenko <petrenko@usc.edu>
Date:   Wed Jun 30 17:42:02 2021 -0700

    Update README.md

commit 22d84af6241092f1806eb6bf404fdcc9dd61090f
Author: Aleksei Petrenko <petrenko@usc.edu>
Date:   Wed Jun 30 17:41:45 2021 -0700

    Update README.md

commit 9a16cf9d89088c8d0ba3352299c43a4de0278464
Author: Aleksei Petrenko <petrenko@usc.edu>
Date:   Wed Jun 30 17:40:56 2021 -0700

    Update README.md

@BoyuanLong
Copy link
Collaborator

@GoingMyWay It seems that setup.py in pip install -e . didn't run properly.

In /workspace/megaverse, could you run git submodule update --init --recursive (if you haven't init this) and python setup.py develop and share the output?

@GoingMyWay
Copy link
Author

GoingMyWay commented Jul 7, 2021

Output of python setup.py develop

https://drive.google.com/file/d/13zosZBq9Kl254Rqnw22x9DmMgHq6rp1L/view?usp=sharing

and output of

python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=APPO --gamma=0.997 --use_rnn=True --rnn_num_layers=2 --num_workers=12 --num_envs_per_worker=2 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --actor_worker_gpus 0 --num_policies=1 --with_pbt=False --max_grad_norm=0.0 --exploration_loss=symmetric_kl --exploration_loss_coeff=0.001 --megaverse_num_simulation_threads=1 --megaverse_use_vulkan=False --policy_workers_per_policy=2 --learner_main_loop_num_cores=1 --reward_clip=30 --env=megaverse_TowerBuilding --experiment=test_cli
(sample-factory) root@0eec227bf950:/workspace/megaverse# python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=APPO --gamma=0.997 --use_rnn=True --rnn_num_layers=2 --num_workers=12 --num_envs_per_worker=2 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --actor_worker_gpus 0 --num_policies=1 --with_pbt=False --max_grad_norm=0.0 --exploration_loss=symmetric_kl --exploration_loss_coeff=0.001 --megaverse_num_simulation_threads=1 --megaverse_use_vulkan=False --policy_workers_per_policy=2 --learner_main_loop_num_cores=1 --reward_clip=30 --env=megaverse_TowerBuilding --experiment=test_cli
[2021-07-07 05:53:59,795][04708] Default env families supported: ['doom_*', 'atari_*', 'dmlab_*', 'mujoco_*', 'MiniGrid*']
[2021-07-07 05:53:59,795][04708] Env registry entry created: megaverse_
[2021-07-07 05:54:00,442][04708] Saved parameter configuration for experiment test_cli not found!
[2021-07-07 05:54:00,443][04708] Starting experiment from scratch!
[2021-07-07 05:54:02,598][04708] Queried available GPUs: 0,1,2,3,4,5,6,7

[2021-07-07 05:54:02,599][04708] Using scenario towerbuilding
[2021-07-07 05:54:02,613][04708] Using a total of 302 trajectory buffers
[2021-07-07 05:54:02,614][04708] Allocating shared memory for trajectories
Bus error (core dumped)

GPU Info

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                  Off |
| N/A   35C    P0    34W / 250W |   1445MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                  Off |
| N/A   33C    P0    25W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:08:00.0 Off |                  Off |
| N/A   33C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:09:00.0 Off |                  Off |
| N/A   34C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  On   | 00000000:85:00.0 Off |                  Off |
| N/A   36C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                  Off |
| N/A   36C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  On   | 00000000:89:00.0 Off |                  Off |
| N/A   30C    P0    22W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  On   | 00000000:8A:00.0 Off |                  Off |
| N/A   32C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

@GoingMyWay
Copy link
Author

GoingMyWay commented Jul 7, 2021

@GoingMyWay It seems that setup.py in pip install -e . didn't run properly.

In /workspace/megaverse, could you run git submodule update --init --recursive (if you haven't init this) and python setup.py develop and share the output?

Hi, @BoyuanLong I updated the output. Thanks in advance.

@BoyuanLong
Copy link
Collaborator

@GoingMyWay Cool. Thanks!

This problem is caused by not giving the docker container enough memory to run the task. Could you add --shm-size 8G (or higher) when using docker run command? We've just updated the readme to make this more clear.

@GoingMyWay
Copy link
Author

@GoingMyWay Cool. Thanks!

This problem is caused by not giving the docker container enough memory to run the task. Could you add --shm-size 8G (or higher) when using docker run command? We've just updated the readme to make this more clear.

Thanks. I added it and it is running now? BTW, this repo is very cool, will you release more baselines on RL and MARL?

@GoingMyWay
Copy link
Author

GoingMyWay commented Jul 7, 2021

I ran the command

python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=APPO --gamma=0.997 --use_rnn=True --rnn_num_layers=2 --num_workers=12 --num_envs_per_worker=2 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --actor_worker_gpus 0 --num_policies=1 --with_pbt=False --max_grad_norm=0.0 --exploration_loss=symmetric_kl --exploration_loss_coeff=0.001 --megaverse_num_simulation_threads=1 --megaverse_use_vulkan=False --policy_workers_per_policy=2 --learner_main_loop_num_cores=1 --reward_clip=30 --env=megaverse_TowerBuilding --experiment=test_cli

and the GPU and CPU usages are not high.

GPUs

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                  Off |
| N/A   34C    P0    34W / 250W |   4760MiB / 16160MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                  Off |
| N/A   33C    P0    25W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:08:00.0 Off |                  Off |
| N/A   33C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:09:00.0 Off |                  Off |
| N/A   34C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  On   | 00000000:85:00.0 Off |                  Off |
| N/A   36C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                  Off |
| N/A   36C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  On   | 00000000:89:00.0 Off |                  Off |
| N/A   29C    P0    22W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  On   | 00000000:8A:00.0 Off |                  Off |
| N/A   32C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
|  GPU   PID     USER    GPU MEM  %CPU  %MEM      TIME  COMMAND                                                                                               |
|    0 15322     root    1105MiB   3.3   1.1     05:42  python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=A  |
|    0 15357     root    1105MiB   3.5   1.1     05:35  python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=A  |
|    0 15358     root    1105MiB   3.5   1.1     05:35  python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=A  |

CPUs and RAM

image

@BoyuanLong
Copy link
Collaborator

@GoingMyWay
Re: Resource usage
It could be normal, but some sanity checks are:

  1. Give the container full access to the resources
  2. Let it run for a while and compare the results and stats with the paper
  3. Try some other environments

Re: Baselines
Glad to hear that! In the short term, we are not planning to add new baselines besides the ones in the paper, but hopefully in the future! Meanwhile, contributions are welcome! Feel free to submit any PR if you think is cool or necessary.

@GoingMyWay
Copy link
Author

@GoingMyWay
Re: Resource usage
It could be normal, but some sanity checks are:

  1. Give the container full access to the resources
  2. Let it run for a while and compare the results and stats with the paper
  3. Try some other environments

Re: Baselines
Glad to hear that! In the short term, we are not planning to add new baselines besides the ones in the paper, but hopefully in the future! Meanwhile, contributions are welcome! Feel free to submit any PR if you think is cool or necessary.

Thanks. Hope the docs would be updated soon. Scenarios in this repo are of great research potential. It can benefit much to RL and MARL community if it is easy for users to use.

@GoingMyWay
Copy link
Author

Hi, when runing the tasks, there is an exception

[2021-07-07 10:40:36,827][08441] Visible devices: 1
[2021-07-07 10:40:36,828][08441] Unknown exception in rollout worker
Traceback (most recent call last):
  File "/workspace/sample-factory/sample_factory/algorithms/appo/actor_worker.py", line 890, in _run
    self._handle_reset()
  File "/workspace/sample-factory/sample_factory/algorithms/appo/actor_worker.py", line 783, in _handle_reset
    for split_idx, env_runner in enumerate(self.env_runners):
TypeError: 'NoneType' object is not iterable
[2021-07-07 10:40:36,840][08439] Visible devices: 1

Is it a vital issue?

@erikwijmans
Copy link
Collaborator

Likely it is. IIRC that happens when something crashes during environment setup, there is likely another error from the same worker higher up.

@alex-petrenko
Copy link
Owner

@erikwijmans is right. The loop is trying to iterate self.env_runners which is None. It is, of course, not supposed to be None. Very likely something happened earlier in the log, i.e. a crash during the environment construction.

One reason this could happen is, for example, a crash in one of the Megaverse constructors.
If you can find the original error message, please post it, it'd be very helpful!
Sometimes the initialization can crash when you're trying to create too many Vulkan contexts. Try to decrease the total number of Megaverse processes you're creating, you can compensate by increasing megaverse_num_envs_per_instance to keep the total number of simulated environments high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants