Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failed #48

Open
herveyrobot opened this issue Jul 21, 2023 · 7 comments
Open

Test failed #48

herveyrobot opened this issue Jul 21, 2023 · 7 comments

Comments

@herveyrobot
Copy link

herveyrobot commented Jul 21, 2023

i can't run the test code (ai2thor-rearrangement# allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py
).
thanks for any answers.

[07/20 19:14:38 INFO:] Running with args Namespace(experiment='baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py', eval=False, config_kwargs=None, extra_tag='', output_dir='rearrange_out', save_dir_fmt=<SaveDirFormat.FLAT: 'FLAT'>, seed=None, experiment_base='.', checkpoint=None, infer_output_dir=False, approx_ckpt_step_interval=None, restart_pipeline=False, deterministic_cudnn=False, max_sampler_processes_per_worker=None, deterministic_agents=False, log_level='info', disable_tensorboard=False, disable_config_saving=False, collect_valid_results=False, valid_on_initial_weights=False, test_expert=False, distributed_ip_and_port='127.0.0.1:0', machine_id=0, callbacks='', enable_crash_recovery=False, test_date=None, approx_ckpt_steps_count=None, skip_checkpoints=0) [main.py: 452]
fatal: not a git repository (or any of the parent directories): .git
[07/20 19:14:39 WARNING:] Failed to get a git diff of the current project. Is it possible that /root/ai2thor-rearrangement is not under version control? [runner.py: 892]
[07/20 19:14:39 INFO:] Config files saved to rearrange_out/used_configs/OnePhaseRGBResNetDagger_40proc/2023-07-20_19-14-39 [runner.py: 935]
[07/20 19:14:39 INFO:] Using 1 train workers on devices (device(type='cpu'),) [runner.py: 317]
[07/20 19:14:39 INFO:] Using local worker ids [0] (total 1 workers in machine 0) [runner.py: 326]
[07/20 19:14:39 INFO:] Started 1 train processes [runner.py: 595]
[07/20 19:14:39 INFO:] No processes allocated to validation, no validation will be run. [runner.py: 626]
[07/20 19:14:41 INFO:] train 0 args {'experiment_name': 'OnePhaseRGBResNetDagger_40proc', 'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7f8809b19670>, 'callback_sensors': [], 'results_queue': <multiprocessing.queues.Queue object at 0x7f8809b196d0>, 'checkpoints_queue': None, 'checkpoints_dir': 'rearrange_out/checkpoints/OnePhaseRGBResNetDagger_40proc/2023-07-20_19-14-39', 'seed': 1118467761, 'deterministic_cudnn': False, 'mp_ctx': <multiprocessing.context.SpawnContext object at 0x7f86d3c559a0>, 'num_workers': 1, 'device': device(type='cpu'), 'distributed_ip': '127.0.0.1', 'distributed_port': 0, 'max_sampler_processes_per_worker': None, 'save_ckpt_after_every_pipeline_stage': True, 'initial_model_state_dict': '[SUPPRESSED]', 'first_local_worker_id': 0, 'distributed_preemption_threshold': 0.7, 'try_restart_after_task_error': False, 'mode': 'train', 'worker_id': 0} [runner.py: 416]
[07/20 19:14:41 ERROR:] [train worker 0] Encountered TypeError, exiting. [engine.py: 1858]
[07/20 19:14:41 ERROR:] Traceback (most recent call last):
File "/root/ai2thor-rearrangement/baseline_configs/rearrange_base.py", line 292, in stagewise_task_sampler_args
x_displays = get_open_x_displays(throw_error_if_empty=True)
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact_plugins/ithor_plugin/ithor_util.py", line 88, in get_open_x_displays
raise IOError(
OSError: Could not find any open X-displays on which to run AI2-THOR processes. Please see the AI2-THOR installation instructions at https://allenact.org/installation/installation-framework/#installation-of-ithor-ithor-plugin for information as to how to start such displays.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 1850, in train
self.run_pipeline(valid_on_initial_weights=valid_on_initial_weights)
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 1506, in run_pipeline
self.initialize_storage_and_viz(
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 483, in initialize_storage_and_viz
observations = self.vector_tasks.get_observations()
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 327, in vector_tasks
sampler_fn_args=self.get_sampler_fn_args(seeds),
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 382, in get_sampler_fn_args
return [
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/engine.py", line 383, in
fn(
File "/root/ai2thor-rearrangement/baseline_configs/rearrange_base.py", line 374, in train_task_sampler_args
**cls.stagewise_task_sampler_args(
File "/root/ai2thor-rearrangement/baseline_configs/rearrange_base.py", line 308, in stagewise_task_sampler_args
[d != torch.device("cpu") and d >= 0 for d in devices]
TypeError: 'NoneType' object is not iterable
[engine.py: 1861]
[07/20 19:14:41 ERROR:] Encountered Exception. Terminating runner. [runner.py: 1467]
[07/20 19:14:41 ERROR:] Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[runner.py: 1468]
Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[07/20 19:14:41 INFO:] Terminating train 0 [runner.py: 1543]
[07/20 19:14:41 INFO:] Joining train 0 [runner.py: 1543]
[07/20 19:14:41 INFO:] Closed train 0 [runner.py: 1543]

@herveyrobot
Copy link
Author

[07/20 19:56:19 ERROR:] [train worker 0] Encountered TypeError, exiting. [engine.py: 1858]
[07/20 19:56:19 ERROR:] Traceback (most recent call last):
File "/root/ai2thor-rearrangement/baseline_configs/rearrange_base.py", line 292, in stagewise_task_sampler_args
x_displays = get_open_x_displays(throw_error_if_empty=True)
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact_plugins/ithor_plugin/ithor_util.py", line 88, in get_open_x_displays
raise IOError(
OSError: Could not find any open X-displays on which to run AI2-THOR processes. Please see the AI2-THOR installation instructions at https://allenact.org/installation/installation-framework/#installation-of-ithor-ithor-plugin for information as to how to start such displays.

[07/20 19:56:19 ERROR:] Encountered Exception. Terminating runner. [runner.py: 1467]
[07/20 19:56:19 ERROR:] Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[runner.py: 1468]
Traceback (most recent call last):
File "/opt/miniconda3/envs/rearrange/lib/python3.9/site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated

@nbqu
Copy link

nbqu commented Jul 21, 2023

Unfortunately I have the same problem with the author, I'm also trying to use in docker enviornment.
I got similar error message, and I'll show what I tried:

  1. run scripts/startx.py from allenact
    I downloaded the script and run it to start xserver. Since xorg package was not installed in docker image so I installed it through apt-get install xorg, and this doesn't help me.

X.Org X Server 1.20.13
X Protocol Version 11, Revision 0
Build Operating System: linux Ubuntu
Current Operating System: Linux 2ff19c400e05 5.15.0-69-generic #76-Ubuntu SMP Fri Mar 17 17:19:29 UTC 2023 x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-69-generic root=UUID=e60fe97a-2dcc-41fd-8ab7-c8b75a10503d ro splash quiet biosdevname=0 net.ifnames=0 nouveau.noaccel=1 rdblacklist=nouveau rd.driver.blacklist=nouveau nouveau.modeset=0 vt.handoff=7
Build Date: 29 March 2023 12:53:02PM
xorg-server 2:1.20.13-1ubuntu1~20.04.8 (For technical support please see http://www.ubuntu.com/support)
Current version of pixman: 0.38.4
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Fri Jul 21 05:19:13 2023
(++) Using config file: "/tmp/tmpm2oq_3l6"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE)
Fatal server error:
(EE) parse_vt_settings: Cannot open /dev/tty0 (No such file or directory)
(EE)
(EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.

Based on the error I re-run the container with adding --device /dev/tty0, same happens.

  1. run ai2thor-xorg from ai2thor
    Seems this script behaves similar to that of allenact, it didn't work for me either. Same error message.

I'm running this container on headless remote GPU cluster. Could you suggest something I can do?

@herveyrobot
Copy link
Author

do you solve it?

@Lucaweihs
Copy link
Contributor

Hi @herveyrobot and @nbqu ,

The example.py code was written with the assumption that it would be run on a mac, sorry about that! As the error message notes:

[d != torch.device("cpu") and d >= 0 for d in devices]
TypeError: 'NoneType' object is not iterable
[engine.py: 1861]

the parameter devices passed to the TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args function is None by default which is fine when you're not using Linux but causes issues otherwise. Can you try changing:

task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1,
)

to

task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1, devices=[0]
)

and trying again?

@herveyrobot
Copy link
Author

Hi @herveyrobot and @nbqu ,

The example.py code was written with the assumption that it would be run on a mac, sorry about that! As the error message notes:

[d != torch.device("cpu") and d >= 0 for d in devices]
TypeError: 'NoneType' object is not iterable
[engine.py: 1861]

the parameter devices passed to the TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args function is None by default which is fine when you're not using Linux but causes issues otherwise. Can you try changing:

task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1,
)

to

task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1, devices=[0]
)

and trying again?

i have the same error with ubuntu 18.04.

@Wallong
Copy link

Wallong commented Aug 12, 2023

Hi @herveyrobot and @nbqu ,
The example.py code was written with the assumption that it would be run on a mac, sorry about that! As the error message notes:

[d != torch.device("cpu") and d >= 0 for d in devices]
TypeError: 'NoneType' object is not iterable
[engine.py: 1861]

the parameter devices passed to the TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args function is None by default which is fine when you're not using Linux but causes issues otherwise. Can you try changing:

task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1,
)

to

task_sampler_params = TwoPhaseRGBBaseExperimentConfig.stagewise_task_sampler_args(
    stage="train", process_ind=0, total_processes=1, devices=[0]
)

and trying again?

i have the same error with ubuntu 18.04.

Hi, I have the same problem. It shows the docker can't find an X display. So I add -v /tmp/.X11-unix:/tmp/.X11-unix -e DISPLAY=:0 to the docker command when runs one. And It works for me.

@Lucaweihs
Copy link
Contributor

Hi @Wallong ,

Thanks for pointing out this fix! Can I ask what your entire docker run command looks like?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants