Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only one valid platform is required to run AI2-THOR #374

Closed
YYDS-cc opened this issue Aug 31, 2023 · 10 comments
Closed

Only one valid platform is required to run AI2-THOR #374

YYDS-cc opened this issue Aug 31, 2023 · 10 comments

Comments

@YYDS-cc
Copy link

YYDS-cc commented Aug 31, 2023

When I run command

PYTHONPATH=. python allenact/main.py training_a_pointnav_model -o storage/robothor-pointnav-rgb-resnet-resnet -b projects/tutorials

on a remote server with an attached display, I get error

Exception: The following builds were found, but had missing dependencies. Only one valid platform is required to run AI2-THOR.
Platform Linux64 failed validation with the following errors: Invalid display: :0.0. Failed to connect Can't connect to display ":0.0": b'No protocol specified\n'
Linux64 requires a X11 server to be running with GLX. The following valid displays were found :13.0

How can I solve this issue?
plz help me,
thanks!

@YYDS-cc
Copy link
Author

YYDS-cc commented Aug 31, 2023

Additionally, when I run
python main.py object_nav_ithor_ppo_one_object -b projects/tutorials -s 12345
the monitor goes black momentarily, I know this is to open the search window, but after the monitor is back up, the terminal's info is no longer updated.
I have also run
sudo python scripts/startx.py &
but it doesn't do anything.

@jordis-ai2
Copy link
Collaborator

Hi @YDDS-cc,

Given your setup, I think it would be worth it to try using THOR in headless mode. For that, you need to pass a gpu_device instead of an x_display (using the CloudRendering platform). You can see an example here:

Let us know if this unblocked you!

@YYDS-cc
Copy link
Author

YYDS-cc commented Sep 1, 2023

Hi @jordis-ai2 ,
i try to change the headless to True, it doesn't working.

And i also try to comment out these code, It's still not working.

Did I change the code in the wrong place?

@jordis-ai2
Copy link
Collaborator

jordis-ai2 commented Sep 1, 2023

I think it I need to see the output you get when using headless mode. Can you copy it here?

@YYDS-cc
Copy link
Author

YYDS-cc commented Sep 1, 2023

[09/01 17:24:13 INFO:] Running with args Namespace(approx_ckpt_step_interval=None, ... ,[main.py: 452]
[09/01 17:24:18 INFO:] Git diff saved to experiment_output/used_configs/ObjectNavThorPPO/2023-09-01_17-24-15 [runner.py: 890]
[09/01 17:24:18 INFO:] Config files saved to experiment_output/used_configs/ObjectNavThorPPO/2023-09-01_17-24-15 [runner.py: 935]
[09/01 17:24:18 INFO:] Using 1 train workers on devices (device(type='cuda', index=0),) [runner.py: 317]
[09/01 17:24:19 INFO:] there are 1 belief models: ['single_belief'] [visual_nav_models.py: 116]
[09/01 17:24:19 INFO:] Using local worker ids [0] (total 1 workers in machine 0) [runner.py: 326]
[09/01 17:24:19 INFO:] Started 1 train processes [runner.py: 595]
[09/01 17:24:19 INFO:] Using 1 valid workers on devices (device(type='cuda', index=1),) [runner.py: 317]
[09/01 17:24:19 INFO:] Started 1 valid processes [runner.py: 622]
[09/01 17:24:21 INFO:] valid 0 args [...][runner.py: 433]
[09/01 17:24:21 INFO:] train 0 args [...] [runner.py: 416]
[09/01 17:24:22 INFO:] there are 1 belief models: ['single_belief'] [visual_nav_models.py: 116]
[09/01 17:24:22 INFO:] there are 1 belief models: ['single_belief'] [visual_nav_models.py: 116]
[09/01 17:24:29 INFO:] Starting 0-th VectorSampledTask worker with args [...]
[09/01 17:24:31 INFO:] Starting 0-th SingleProcessVectorSampledTasks generator with args [...]
[09/01 17:24:31 INFO:] Starting 1-th VectorSampledTask worker with args [...]
[09/01 17:24:33 INFO:] Starting 0-th SingleProcessVectorSampledTasks generator with args [...]
[09/01 17:29:33 ERROR:] [train worker 0 ] Encountered TimeoutError , exiting. [engine.py: 1858]
File "/allenact/allenact/algorithms/onpolicy_sync/vector_sampled_tasks.py", Line 272,in read_with_timeout
raise TimeError(
TimeouError: Did not receive output from 'VectorSampledTask' worker for 300 seconds.
[engine.py: 1861]
[09/01 17:29:34 ERROR:] Encountered Exception. Terminating runner. [runner.py: 1467]
[09/01 17:29:34 ERROR:] Traceback (most recent call last):
File "/allenact/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[runner.py: 1468]
Traceback (most recent call last):
File "/allenact/allenact/algorithms/onpolicy_sync/runner.py", line 1434, in log_and_close
raise Exception(
Exception: Train worker 0 abnormally terminated
[09/01 17:29:34 INFO:] Terminating train 0 [runner.py: 1543]
[09/01 17:29:34 INFO:] Terminating valid 0 [runner.py: 1543]
[09/01 17:29:34 INFO:] Termination signal sent to worker Train-0. Worker Train-0 is already closed, exiting. [runner.py: 348]
[09/01 17:29:34 INFO:] Joining train 0 [runner.py: 1543]
[09/01 17:29:34 INFO:] Termination signal sent to worker Valid-0. Forcing worker Valid-0 to close and exiting. [runner.py: 353]
[09/01 17:29:35 INFO:] Closed train 0 [runner.py: 1543]
[09/01 17:29:35 INFO:] Joining valid 0 [runner.py: 1543]
[09/01 17:29:35 INFO:] Closed valid 0 [runner.py: 1543]

@jordis-ai2
Copy link
Collaborator

If you do export ALLENACT_DEBUG_VST_TIMEOUT=1000 before calling the command you are currently using to start your experiment, does it also fail (just after a longer period of waiting)?

@YYDS-cc
Copy link
Author

YYDS-cc commented Sep 1, 2023

Changing the waiting time doesn't work.
Actually, export ALLENACT_DEBUG_VST_TIMEOUT=1000 can't change the waiting time, it is still 300 seconds.
So I made the change in

for space in read_fn(timeout_to_use=5 * self.read_timeout if self.read_timeout is not None else None) # type: ignore

and I still get the same error, only the waiting time has changed.

@jordis-ai2
Copy link
Collaborator

I assume at this point you must have already tried starting a standalone THOR controller to ensure everything is correctly installed, but just in case you haven't, can you try to run a script like:

from ai2thor.platform import CloudRendering
from ai2thor.controller import Controller
import cv2

c = Controller(platform=CloudRendering, gpu_device=0)
cv2.imwrite("/path/to/debug_output_image.png", c.last_event.frame[:,:,::-1])
c.stop()

?

@YYDS-cc
Copy link
Author

YYDS-cc commented Sep 1, 2023

The new code install the thor-CloudRendering platform and come a new issue, i meet the issue before when i run the PointNav task with command PYTHONPATH=. python allenact/main.py training_a_pointnav_model -o storage/robothor-pointnav-rgb-resnet-resnet -b projects/tutorials .

issue: RuntimeError: vulkaninfo failed to run, please ask your administrator to install vulkaninfo (e.g. on Ubuntu systems this requires running sudo apt install vulkan-tools).

But when i run the command sudo apt install vulkan-tools,
the server can't locate the package vulkan-tools
After using the sudo apt-get update, it still doesn't work.

I installed the same environment on my PC according to the tutorial (ubuntu18.04), both PointNav Task and ObjectNav Task have no problem.

@jordis-ai2
Copy link
Collaborator

https://packages.ubuntu.com/search?keywords=vulkan-tools has a list of packages for different Ubuntu versions. It's possible that third parties provide vulkan-tools for other/older versions.

It sounds like this is out-of-scope for AllenAct, so I'm closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants