Run tasks on Docker failed. No module named 'megaverse.extension' #15

GoingMyWay · 2021-07-06T04:11:29Z

I used the Dockerfile.base to create an image and then inside the docker container, I ran the following command and failed.

(sample-factory) I have no name!@738e10851140:/workspace$ python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=APPO --gamma=0.997 --use_rnn=True --rnn_num_layers=2 --num_workers=12 --num_envs_per_worker=2 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --actor_worker_gpus 0 --num_policies=1 --with_pbt=False --max_grad_norm=0.0 --exploration_loss=symmetric_kl --exploration_loss_coeff=0.001 --megaverse_num_simulation_threads=1 --megaverse_use_vulkan=False --policy_workers_per_policy=2 --learner_main_loop_num_cores=1 --reward_clip=30 --env=megaverse_TowerBuilding --experiment=test_cli
Traceback (most recent call last):
  File "/miniconda/envs/sample-factory/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/miniconda/envs/sample-factory/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/workspace/megaverse/megaverse_rl/train.py", line 11, in <module>
    from megaverse_rl.megaverse_utils import register_env
  File "/workspace/megaverse/megaverse_rl/megaverse_utils.py", line 4, in <module>
    from megaverse.megaverse_env import MegaverseEnv, make_env_multitask
  File "/workspace/megaverse/megaverse/megaverse_env.py", line 8, in <module>
    from megaverse.extension.megaverse import MegaverseGym, set_megaverse_log_level
ModuleNotFoundError: No module named 'megaverse.extension'

The text was updated successfully, but these errors were encountered:

alex-petrenko · 2021-07-06T06:08:13Z

@GoingMyWay thank you for reporting!
@BoyuanLong can you please take a look?

BoyuanLong · 2021-07-07T03:53:13Z

@GoingMyWay Could you share the directory structures (inmegaverse/megaverse and megaverse/megaverse/extension specifically) and the commit id for megaverse?

GoingMyWay · 2021-07-07T05:16:50Z

@GoingMyWay Could you share the directory structures (inmegaverse/megaverse and megaverse/megaverse/extension specifically) and the commit id for megaverse?

Hi, inside the Docker container, it is

(sample-factory) I have no name!@738e10851140:/workspace/megaverse/megaverse$ ls -l
total 16
-rw-rw-r-- 1 1004 1008    0 Jul  6 02:47 __init__.py
drwxr-xr-x 2 1004 1008 4096 Jul  6 04:07 __pycache__
-rw-rw-r-- 1 1004 1008 6282 Jul  6 02:47 megaverse_env.py
drwxrwxr-x 2 1004 1008 4096 Jul  6 02:47 tests

and

(sample-factory) I have no name!@738e10851140:/workspace/megaverse$ ls -l megaverse/extension
ls: cannot access 'megaverse/extension': No such file or directory

The commit id

(sample-factory) I have no name!@738e10851140:/workspace/megaverse$ git log
commit f5d0b3ff61d39e669be826d8a1b60331c60fc40e (HEAD -> master, origin/master, origin/HEAD)
Author: Aleksei Petrenko <petrenko@usc.edu>
Date:   Thu Jul 1 01:22:08 2021 -0700

    Instructions

commit 1affd2c4caf8b436215d4f194846d60286fb2951
Author: Aleksei Petrenko <petrenko@usc.edu>
Date:   Wed Jun 30 17:42:02 2021 -0700

    Update README.md

commit 22d84af6241092f1806eb6bf404fdcc9dd61090f
Author: Aleksei Petrenko <petrenko@usc.edu>
Date:   Wed Jun 30 17:41:45 2021 -0700

    Update README.md

commit 9a16cf9d89088c8d0ba3352299c43a4de0278464
Author: Aleksei Petrenko <petrenko@usc.edu>
Date:   Wed Jun 30 17:40:56 2021 -0700

    Update README.md

BoyuanLong · 2021-07-07T05:25:17Z

@GoingMyWay It seems that setup.py in pip install -e . didn't run properly.

In /workspace/megaverse, could you run git submodule update --init --recursive (if you haven't init this) and python setup.py develop and share the output?

GoingMyWay · 2021-07-07T05:55:44Z

Output of python setup.py develop

https://drive.google.com/file/d/13zosZBq9Kl254Rqnw22x9DmMgHq6rp1L/view?usp=sharing

and output of

python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=APPO --gamma=0.997 --use_rnn=True --rnn_num_layers=2 --num_workers=12 --num_envs_per_worker=2 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --actor_worker_gpus 0 --num_policies=1 --with_pbt=False --max_grad_norm=0.0 --exploration_loss=symmetric_kl --exploration_loss_coeff=0.001 --megaverse_num_simulation_threads=1 --megaverse_use_vulkan=False --policy_workers_per_policy=2 --learner_main_loop_num_cores=1 --reward_clip=30 --env=megaverse_TowerBuilding --experiment=test_cli

(sample-factory) root@0eec227bf950:/workspace/megaverse# python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=APPO --gamma=0.997 --use_rnn=True --rnn_num_layers=2 --num_workers=12 --num_envs_per_worker=2 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --actor_worker_gpus 0 --num_policies=1 --with_pbt=False --max_grad_norm=0.0 --exploration_loss=symmetric_kl --exploration_loss_coeff=0.001 --megaverse_num_simulation_threads=1 --megaverse_use_vulkan=False --policy_workers_per_policy=2 --learner_main_loop_num_cores=1 --reward_clip=30 --env=megaverse_TowerBuilding --experiment=test_cli
[2021-07-07 05:53:59,795][04708] Default env families supported: ['doom_*', 'atari_*', 'dmlab_*', 'mujoco_*', 'MiniGrid*']
[2021-07-07 05:53:59,795][04708] Env registry entry created: megaverse_
[2021-07-07 05:54:00,442][04708] Saved parameter configuration for experiment test_cli not found!
[2021-07-07 05:54:00,443][04708] Starting experiment from scratch!
[2021-07-07 05:54:02,598][04708] Queried available GPUs: 0,1,2,3,4,5,6,7

[2021-07-07 05:54:02,599][04708] Using scenario towerbuilding
[2021-07-07 05:54:02,613][04708] Using a total of 302 trajectory buffers
[2021-07-07 05:54:02,614][04708] Allocating shared memory for trajectories
Bus error (core dumped)

GPU Info

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                  Off |
| N/A   35C    P0    34W / 250W |   1445MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                  Off |
| N/A   33C    P0    25W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:08:00.0 Off |                  Off |
| N/A   33C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:09:00.0 Off |                  Off |
| N/A   34C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  On   | 00000000:85:00.0 Off |                  Off |
| N/A   36C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                  Off |
| N/A   36C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  On   | 00000000:89:00.0 Off |                  Off |
| N/A   30C    P0    22W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  On   | 00000000:8A:00.0 Off |                  Off |
| N/A   32C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

GoingMyWay · 2021-07-07T05:57:46Z

@GoingMyWay It seems that setup.py in pip install -e . didn't run properly.

In /workspace/megaverse, could you run git submodule update --init --recursive (if you haven't init this) and python setup.py develop and share the output?

Hi, @BoyuanLong I updated the output. Thanks in advance.

BoyuanLong · 2021-07-07T06:02:32Z

@GoingMyWay Cool. Thanks!

This problem is caused by not giving the docker container enough memory to run the task. Could you add --shm-size 8G (or higher) when using docker run command? We've just updated the readme to make this more clear.

GoingMyWay · 2021-07-07T06:06:25Z

@GoingMyWay Cool. Thanks!

This problem is caused by not giving the docker container enough memory to run the task. Could you add --shm-size 8G (or higher) when using docker run command? We've just updated the readme to make this more clear.

Thanks. I added it and it is running now? BTW, this repo is very cool, will you release more baselines on RL and MARL?

GoingMyWay · 2021-07-07T06:10:42Z

I ran the command

python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=APPO --gamma=0.997 --use_rnn=True --rnn_num_layers=2 --num_workers=12 --num_envs_per_worker=2 --ppo_epochs=1 --rollout=32 --recurrence=32 --batch_size=2048 --actor_worker_gpus 0 --num_policies=1 --with_pbt=False --max_grad_norm=0.0 --exploration_loss=symmetric_kl --exploration_loss_coeff=0.001 --megaverse_num_simulation_threads=1 --megaverse_use_vulkan=False --policy_workers_per_policy=2 --learner_main_loop_num_cores=1 --reward_clip=30 --env=megaverse_TowerBuilding --experiment=test_cli

and the GPU and CPU usages are not high.

GPUs

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:04:00.0 Off |                  Off |
| N/A   34C    P0    34W / 250W |   4760MiB / 16160MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-PCIE...  On   | 00000000:05:00.0 Off |                  Off |
| N/A   33C    P0    25W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-PCIE...  On   | 00000000:08:00.0 Off |                  Off |
| N/A   33C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-PCIE...  On   | 00000000:09:00.0 Off |                  Off |
| N/A   34C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-PCIE...  On   | 00000000:85:00.0 Off |                  Off |
| N/A   36C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-PCIE...  On   | 00000000:86:00.0 Off |                  Off |
| N/A   36C    P0    26W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-PCIE...  On   | 00000000:89:00.0 Off |                  Off |
| N/A   29C    P0    22W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-PCIE...  On   | 00000000:8A:00.0 Off |                  Off |
| N/A   32C    P0    24W / 250W |      0MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
|  GPU   PID     USER    GPU MEM  %CPU  %MEM      TIME  COMMAND                                                                                               |
|    0 15322     root    1105MiB   3.3   1.1     05:42  python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=A  |
|    0 15357     root    1105MiB   3.5   1.1     05:35  python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=A  |
|    0 15358     root    1105MiB   3.5   1.1     05:35  python -m megaverse_rl.train --train_for_seconds=360000000 --train_for_env_steps=2000000000 --algo=A  |

CPUs and RAM

BoyuanLong · 2021-07-07T06:35:23Z

@GoingMyWay
Re: Resource usage
It could be normal, but some sanity checks are:

Give the container full access to the resources
Let it run for a while and compare the results and stats with the paper
Try some other environments

Re: Baselines
Glad to hear that! In the short term, we are not planning to add new baselines besides the ones in the paper, but hopefully in the future! Meanwhile, contributions are welcome! Feel free to submit any PR if you think is cool or necessary.

GoingMyWay · 2021-07-07T07:39:26Z

@GoingMyWay
Re: Resource usage
It could be normal, but some sanity checks are:

Give the container full access to the resources

Let it run for a while and compare the results and stats with the paper

Try some other environments

Re: Baselines
Glad to hear that! In the short term, we are not planning to add new baselines besides the ones in the paper, but hopefully in the future! Meanwhile, contributions are welcome! Feel free to submit any PR if you think is cool or necessary.

Thanks. Hope the docs would be updated soon. Scenarios in this repo are of great research potential. It can benefit much to RL and MARL community if it is easy for users to use.

GoingMyWay · 2021-07-07T10:46:00Z

Hi, when runing the tasks, there is an exception

[2021-07-07 10:40:36,827][08441] Visible devices: 1
[2021-07-07 10:40:36,828][08441] Unknown exception in rollout worker
Traceback (most recent call last):
  File "/workspace/sample-factory/sample_factory/algorithms/appo/actor_worker.py", line 890, in _run
    self._handle_reset()
  File "/workspace/sample-factory/sample_factory/algorithms/appo/actor_worker.py", line 783, in _handle_reset
    for split_idx, env_runner in enumerate(self.env_runners):
TypeError: 'NoneType' object is not iterable
[2021-07-07 10:40:36,840][08439] Visible devices: 1

Is it a vital issue?

erikwijmans · 2021-07-07T17:40:53Z

Likely it is. IIRC that happens when something crashes during environment setup, there is likely another error from the same worker higher up.

alex-petrenko · 2021-07-08T08:02:04Z

@erikwijmans is right. The loop is trying to iterate self.env_runners which is None. It is, of course, not supposed to be None. Very likely something happened earlier in the log, i.e. a crash during the environment construction.

One reason this could happen is, for example, a crash in one of the Megaverse constructors.
If you can find the original error message, please post it, it'd be very helpful!
Sometimes the initialization can crash when you're trying to create too many Vulkan contexts. Try to decrease the total number of Megaverse processes you're creating, you can compensate by increasing megaverse_num_envs_per_instance to keep the total number of simulated environments high.

GoingMyWay changed the title ~~Run tasks on Docker failed.~~ Run tasks on Docker failed. No module named 'megaverse.extension' Jul 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run tasks on Docker failed. No module named 'megaverse.extension' #15

Run tasks on Docker failed. No module named 'megaverse.extension' #15

GoingMyWay commented Jul 6, 2021 •

edited

alex-petrenko commented Jul 6, 2021

BoyuanLong commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021 •

edited

BoyuanLong commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021 •

edited

GoingMyWay commented Jul 7, 2021 •

edited

BoyuanLong commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021 •

edited

BoyuanLong commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021

erikwijmans commented Jul 7, 2021

alex-petrenko commented Jul 8, 2021

Run tasks on Docker failed. No module named 'megaverse.extension' #15

Run tasks on Docker failed. No module named 'megaverse.extension' #15

Comments

GoingMyWay commented Jul 6, 2021 • edited

alex-petrenko commented Jul 6, 2021

BoyuanLong commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021 • edited

BoyuanLong commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021 • edited

GoingMyWay commented Jul 7, 2021 • edited

BoyuanLong commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021 • edited

BoyuanLong commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021

GoingMyWay commented Jul 7, 2021

erikwijmans commented Jul 7, 2021

alex-petrenko commented Jul 8, 2021

GoingMyWay commented Jul 6, 2021 •

edited

GoingMyWay commented Jul 7, 2021 •

edited

GoingMyWay commented Jul 7, 2021 •

edited

GoingMyWay commented Jul 7, 2021 •

edited

GoingMyWay commented Jul 7, 2021 •

edited