Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xdpyinfo: unable to open display ":0.1". #7

Closed
nnsriram97 opened this issue Apr 2, 2021 · 11 comments
Closed

xdpyinfo: unable to open display ":0.1". #7

nnsriram97 opened this issue Apr 2, 2021 · 11 comments
Assignees

Comments

@nnsriram97
Copy link

Hi,

I am facing an issue while trying to run the baseline models allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py.

xdpyinfo:  unable to open display ":0.1".
Process ForkServerProcess-2:1:
Traceback (most recent call last):
.
.
.
AssertionError: Invalid DISPLAY :0.1 - cannot find X server with xdpyinfo
04/01 17:30:22 ERROR: Encountered Exception. Terminating train worker 1	[engine.py: 1319]

Any suggestions to solve this? I can run python example.py successfully though.

@Lucaweihs
Copy link
Contributor

Lucaweihs commented Apr 2, 2021

Hi @nnsriram97,

Can you give some more information about your machine specs (e.g. # GPUs, operating system)?

The way the current code is set up to run an experiment is as follows:

  1. Count the number of GPUs on your machine by running torch.cuda.device_count()
  2. For each of the above GPUs, assume there is an x-display running on :0.0, :0.1, ..., :0.NUM_GPUS_ON_YOUR_MACHINE_MINUS_ONE (these x-displays are required for running AI2-THOR on Linux machines).
  3. Set up THOR processes on each of the above x-displays.
  4. Train the agent using the above GPUs for model inference/backprop and THOR simulation.

From your error message it looks like it can't find the x-display for your 1st GPU. We have a script in AllenAct that will automatically start x-displays on each of your GPUs, see our installation instructions and the script itself (you might have to close any display that's already open on :0.0 or edit the script to not start a new display there). Alternatively if you have a display already running on :0.0 and don't want to start new ones, you could simply have all the THOR processes run on a single GPU (you might run out of GPU memory in this case). To do this simply modify the lines here to simply be x_display = ":0.0".

If you want to temporarily use a smaller number of training processes (e.g. 1 for debugging and checking that things work) you can simply change the line here to be nprocesses = 1.

@nnsriram97
Copy link
Author

Hi @Lucaweihs,

I have 2 GTX 1080Ti's (Cuda 8.0, Nvidia-driver-460) on Ubuntu 16.04 with a display attached to it. I can successfully run the training code with x_display=0.0 and nprocesses = 1 but I cannot run it with default settings.

Using x_display=0.0 and setting nprocesses = cls.num_train_processes() if torch.cuda.is_available() else 1 runs but the program stops suddenly (as you mentioned it must be a memory issue), but setting nprocesses=15 or any number < 20, I can see the simulator output without hanging. Output of nvidia-smi when its running..

|    0   N/A  N/A      5120      C   Train-0                           889MiB |
|    0   N/A  N/A      5532      G   ...3c3596803c491c3da8d43eb2c       70MiB |
|    0   N/A  N/A      5533      G   ...3c3596803c491c3da8d43eb2c       72MiB |
.
.
.

|    0   N/A  N/A      6000      G   ...3c3596803c491c3da8d43eb2c       32MiB |
|    1   N/A  N/A      5121      C   Train-1                           889MiB |
|    1   N/A  N/A      5122      C   Valid-0                           889MiB |
+-----------------------------------------------------------------------------+

Using x_display = "0.{}".format(devices[process_ind % len(devices)]) does not work and rasises the issue as mentioned before. I also tried launching it on a headless server through slurm but got a similar issue xdpyinfo: unable to open display ":0.0".

Does not being able to launch x_display on 0.1 mean only one GPU is being used? Because I see Train-1 running on GPU 1. If that's the case can you suggest ways to run the code utilizing both GPUs with maximum compute? Also is it possible to run jobs on a headless server without sudo access?

@Lucaweihs
Copy link
Contributor

It's strange that the xdpyinfo problem persists if an X-display is set up on :0.1 on GPU 1, just to double check, can you run DISPLAY=:0.1 glxgears? If everything is running appropriately, you should (after ~5 seconds) see something like this:

$ DISPLAY=:0.1 glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
111939 frames in 5.0 seconds = 22387.727 FPS
114639 frames in 5.0 seconds = 22927.768 FPS

and nvidia-smi should show some small memory usage on the GPU (~4mb).

Does not being able to launch x_display on 0.1 mean only one GPU is being used?

Thankfully no, you're still using both GPUs for inference/backprop but all of the THOR instances will be using a single GPU which can be a bit slower / use up valuable GPU memory.

Also is it possible to run jobs on a headless server without sudo access?

This has been a problem that the AI2-THOR team has been trying to resolve for a while, the short answer: not yet, you'll need your system administrator to set up the x-displays if you don't have sudo access yourself.

@nnsriram97
Copy link
Author

Running DISPLAY=:0.1 glxgears throws Error: couldn't open display :0.1, while DISPLAY=:0.0 glxgears runs without any issue

@Lucaweihs
Copy link
Contributor

Lucaweihs commented Apr 5, 2021

Given that glxgears doesn't run this suggests to me that you likely don't have an x-server running on :0.1. Do you have sudo access to start such a server? Recall that AllenAct has a instructions/script for starting x-displays on all GPUs:

We have a script in AllenAct that will automatically start x-displays on each of your GPUs, see our installation instructions and the script itself (you might have to close any display that's already open on :0.0 or edit the script to not start a new display there).

@nnsriram97
Copy link
Author

Thanks for pointing me to the script. I've sudo access but had a monitor attached to my pc and had Xorg running for display. Stopping the monitor service through sudo service lightdm stop and running startx.py worked. Here's a log of nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9917      G   /usr/lib/xorg/Xorg                 41MiB |
|    0   N/A  N/A     10005      C   Train-0                          1299MiB |
|    0   N/A  N/A     11210      G   ...3c3596803c491c3da8d43eb2c       67MiB |
.
.
.
|    0   N/A  N/A     13459      G   ...3c3596803c491c3da8d43eb2c       67MiB |
|    1   N/A  N/A      9917      G   /usr/lib/xorg/Xorg                  8MiB |
|    1   N/A  N/A     10006      C   Train-1                           889MiB |
|    1   N/A  N/A     10007      C   Valid-0                           889MiB |
+-----------------------------------------------------------------------------+

But I see all the thor instances running on GPU 0 while train-1 and valid-0 running on GPU 1. Is that normal?

Also, can you suggest how to debug/view the simulator output for some particular instance of training?

@Lucaweihs
Copy link
Contributor

Interesting, after doing the above you should be see ai2thor processes on both gpus. Can you confirm that:

  • DISPLAY=:0.0 glxgears and DISPLAY=:0.1 glxgears both work for you (you should also see the glxgears process using gpu memory when using nvidia-smi on the appropriate gpu)
  • you've changed x_display = "0.0" back to x_display = "0.{}".format(devices[process_ind % len(devices)]) within the configuration file
    ?

Also:

  • Can you tell me which version of ai2thor you have installed? I.e. the output of pip list | grep ai2thor. To be safe it might be good to update to the latest version: pip install --upgrade ai2thor.

When running things on headless servers I often like to double check that the ai2thor processes are simulating by using a VNC. To do this yourself you can the following (you'll need to sudo apt install x11vnc wm2):

export DISPLAY=:0.0
nohup x11vnc -noxdamage -display :0.0 -nopw -once -xrandr -noxrecord -forever -grabalways --httpport 5900&
wm2&

which will start the vnc server on the server. You can then connect to this server locally by installing a VNC viewer (e.g. https://www.realvnc.com/en/connect/download/viewer/) and then setting up a new connection (cmd+N on Mac) with properties that look something like this (note that 5900 is the http port specified in the above block):
Screen Shot 2021-04-06 at 1 35 30 PM

@Lucaweihs Lucaweihs self-assigned this Apr 6, 2021
@nnsriram97
Copy link
Author

  • DISPLAY=:0.0 glxgears and DISPLAY=:0.1 glxgears both work for you (you should also see the glxgears process using gpu memory when using nvidia-smi on the appropriate gpu)

I can successfully run the above commands and I see some GPU memory being used by glxgears

*you've changed x_display = "0.0" back to x_display = "0.{}".format(devices[process_ind % len(devices)]) within the configuration file?

Yes, I had changed it back to default in rearrange_base.py

*Can you tell me which version of ai2thor you have installed? I.e. the output of pip list | grep ai2thor. To be safe it might be good to update to the latest version: pip install --upgrade ai2thor.

I had ai2thor - 2.7.2 installed but then upgraded to 2.7.4. Thanks!

Please refer to issue #3 for status related to running the code.

@Lucaweihs
Copy link
Contributor

Closing this as I believe things are training at reasonable FPS for you now, let me know if not!

@nnsriram97
Copy link
Author

nnsriram97 commented Jun 2, 2021

[06/01 17:46:37 INFO:] Running with args Namespace(approx_ckpt_step_interval=None, approx_ckpt_steps_count=None, checkpoint=None, config_kwargs=None, deterministic_agents=False, deterministic_cudnn=False, disable_config_saving=False, disable_tensorboard=False, eval=False, experiment='baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py', experiment_base='.', extra_tag='', log_level='info', max_sampler_processes_per_worker=None, output_dir='rearrange_out', restart_pipeline=False, seed=None, skip_checkpoints=0, test_date=None)	[main.py: 352]
[06/01 17:46:38 INFO:] Git diff saved to rearrange_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37[runner.py: 544]
[06/01 17:46:38 INFO:] Config files saved to rearrange_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37	[runner.py: 592]
[06/01 17:46:38 INFO:] Using 2 train workers on devices (device(type='cuda', index=0), device(type='cuda', index=1))	[runner.py: 205]
[06/01 17:46:38 INFO:] Started 2 train processes	[runner.py: 364]
[06/01 17:46:38 INFO:] Using 1 valid workers on devices (device(type='cuda', index=1),)	[runner.py: 205]
[06/01 17:46:38 INFO:] Started 1 valid processes	[runner.py: 390]
[06/01 17:46:39 INFO:] train 1 args {'experiment_name': 'OnePhaseRGBResNetDagger_40proc', 'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7b88750>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7b88d90>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee118c7d0>, 'checkpoints_dir': 'rearrange_out/checkpoints/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37', 'seed': 1470811490, 'deterministic_cudnn': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee118cc50>, 'num_workers': 2, 'device': device(type='cuda', index=1), 'distributed_port': 51435, 'max_sampler_processes_per_worker': None, 'initial_model_state_dict': '[SUPRESSED]', 'mode': 'train', 'worker_id': 1}	[runner.py: 258]
[06/01 17:46:39 INFO:] valid 0 args {'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7b88690>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7b88cd0>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee118c750>, 'seed': 12345, 'deterministic_cudnn': False, 'deterministic_agents': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee118cb90>, 'device': device(type='cuda', index=1), 'max_sampler_processes_per_worker': None, 'mode': 'valid', 'worker_id': 0}	[runner.py: 273]
[06/01 17:46:39 INFO:] train 0 args {'experiment_name': 'OnePhaseRGBResNetDagger_40proc', 'config': <baseline_configs.one_phase.one_phase_rgb_resnet_dagger.OnePhaseRGBResNetDaggerExperimentConfig object at 0x7efee7ba7650>, 'results_queue': <multiprocessing.queues.Queue object at 0x7efee7ba7c90>, 'checkpoints_queue': <multiprocessing.queues.Queue object at 0x7efee11a5d50>, 'checkpoints_dir': 'rearrange_out/checkpoints/OnePhaseRGBResNetDagger_40proc/2021-06-01_17-46-37', 'seed': 1470811490, 'deterministic_cudnn': False, 'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7efee11abad0>, 'num_workers': 2, 'device': device(type='cuda', index=0), 'distributed_port': 51435, 'max_sampler_processes_per_worker': None, 'initial_model_state_dict': '[SUPRESSED]', 'mode': 'train', 'worker_id': 0}	[runner.py: 258]
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating train worker 0	[engine.py: 1326]
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating train worker 1	[engine.py: 1326]
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
  File "/allenact/algorithms/onpolicy_sync/engine.py", line 1312, in train
    else cast(ActorCriticModel, self.actor_critic.module),
.
.
.
    raise error.DisplayConnectionError(self.display_name, r.reason)
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
	[engine.py: 1329]
.
.
.
    raise error.DisplayConnectionError(self.display_name, r.reason)
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
.
.
.
Xlib.error.DisplayConnectionError: Can't connect to display ":1002": b'No protocol specified\n'
[06/01 17:46:41 ERROR:] Encountered Exception. Terminating runner.	[runner.py: 936]
[06/01 17:46:41 ERROR:] Traceback (most recent call last):
  File "...site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 899, in log
    package[1] - 1
Exception: Train worker 1 abnormally terminated
	[runner.py: 937]
Traceback (most recent call last):
  File "...site-packages/allenact/algorithms/onpolicy_sync/runner.py", line 899, in log
    package[1] - 1
Exception: Train worker 1 abnormally terminated
[06/01 17:46:41 INFO:] Closing train 0	[runner.py: 1012]
[06/01 17:46:41 INFO:] Joining train 0	[runner.py: 1012]
[06/01 17:46:41 INFO:] Closed train 0	[runner.py: 1012]
[06/01 17:46:41 INFO:] Joining train 1	[runner.py: 1012]
[06/01 17:46:41 INFO:] Closed train 1	[runner.py: 1012]
[06/01 17:46:41 INFO:] Closing valid 0	[runner.py: 1012]
[06/01 17:46:41 INFO:] Joining valid 0	[runner.py: 1012]
[06/01 17:46:41 INFO:] KeyboardInterrupt. Terminating valid worker 0	[engine.py: 1596]
[06/01 17:46:41 INFO:] Closed valid 0	[runner.py: 1012]

After updating to the latest version of the rearrangement repo and allenact, running baseline models throws an error. My system details: Ubuntu 16.04, display attached, 2 GPUs and glxgears successfully runs on DISPLAY:=0.0

Update

Issue solved. Doing ls /tmp/.X11-unix/ gave X0 X1002 as outputs. Assigning open_display_strs = ['0'] in ithor_util.py solved the issue by trying to only use the attached display.

@Lucaweihs
Copy link
Contributor

@nnsriram97 glad to hear you found a solution! We tried to make this "easier" for people by automatically discovering the x-display but it looks like this didn't like the X1002 display. Any idea what X1002 might be from? This would help us avoid this in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants