Training got stuck after creating Vector Sample Tasks #309

npmhung · 2021-09-07T20:58:22Z

As mention in the title, when I tried to run the following command:

python main.py projects/objectnav_baselines/experiments/robothor/objectnav_robothor_rgb_resnetgru_ddppo.py

the training process just got stuck at the following step forever:

This never happens in my personal desktop.

I couldn't figure out what the potential problem is.

Server configuration:
Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz
48 CPU(s)
2x Tesla K80

Lucaweihs · 2021-09-07T21:43:23Z

Hi @npmhung,

Just to double check, can you try starting AI2-THOR instances on each of your x-displays and confirming that everything works as expected:

from ai2thor.controller import Controller
c = Controller(x_display="0.0")
c.step("RotateRight")
print(f"For display 0.0, action was successful: {c.last_event.metadata['lastActionSuccess']}.")
c.stop()

c = Controller(x_display="0.1")
c.step("RotateRight")
print(f"For display 0.1, action was successful: {c.last_event.metadata['lastActionSuccess']}.")

The above should print success messages for both displays. If that doesn't work then you should double check that you have started the x-display (this can be done by running sudo ai2thor-xorg start).

If the above isn't the problem, can you try reducing the number of training processes to 1 and seeing if training still hangs?

One last thing: assuming your questions were answered for issue #308, can you close it?

npmhung · 2021-09-07T21:48:02Z

I got the following message:

For display 0.0, action was successful: True.
For display 0.1, action was successful: True.

npmhung · 2021-09-07T21:53:53Z

Yes, it still hangs with 1 process.

Lucaweihs · 2021-09-07T21:57:07Z

Ok great, in the image you linked, there is a key/value pair with the key named "env_args" can you copy the dictionary there into a new variable and try the following:

from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = ... # The dictionary from the image

env = RoboThorEnvironment(**env_args)

env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")

Lucaweihs · 2021-09-07T22:05:46Z

I did a bit of debugging and it seems that AI2-THOR does not like it when you use odd integers in the height/width of the window. Can you change the window size from 300x225 to 304x228 and try again?

npmhung · 2021-09-07T22:07:14Z

Ok, I will try that.

Besides, fyi, this line "env = RoboThorEnvironment(**env_args)" hangs on my server.

npmhung · 2021-09-07T22:12:34Z

I tried as you suggested. However, changing the resolution didn't help at all in my case.

Lucaweihs · 2021-09-07T22:16:12Z

Can you double check that this is also the case with env = RoboThorEnvironment(**env_args), i.e.

from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = ... # The dictionary from the image

env_args["height"] = 228
env_args["width"] = 304
env = RoboThorEnvironment(**env_args)

env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")

npmhung · 2021-09-07T22:23:32Z

This line "env = RoboThorEnvironment(**env_args)" still hangs on my side.

Lucaweihs · 2021-09-08T16:09:40Z

The env = RoboThorEnvironment(**env_args) definitely seems to be the issue. Can you confirm that:

env = RoboThorEnvironment()
env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")

works?

If so, could you try the following:

new_env_args = {}
for key, val in env_args.items():
    print(f"Trying to add key {key} with value {val}")
    new_env_args[key] = val
    env = RoboThorEnvironment(**new_env_args)
    env.step(action="RotateRight")
    assert env.last_event.metadata['lastActionSuccess']

    print(f"Env successfully started with env_args == {new_env_args}")
    env.stop()

This should eventually hang and tell us what is causing the issue.

npmhung · 2021-09-08T20:11:10Z

I ran your code, and the key commit_id is causing the issue. Only after removing that key, the code finished successfully.

Lucaweihs · 2021-09-08T20:28:21Z

I see, this commit id is what determines the AI2-THOR build that should be used. It's unlikely but possible that that this file was corrupted while downloading. Can you try just starting an AI2-THOR controller with this commit id:

from ai2thor.controller import Controller
Controller(commit_id="bad5bc2b250615cb766ffb45d455c211329af17e")

and seeing if that hangs for you?

If so, I'd suggest deleting the

thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.lock

directory/file within the ~/.ai2thor/releases directory and then trying to run Controller(commit_id="bad5bc2b250615cb766ffb45d455c211329af17e") again (it should download an ~500mb file before starting).

npmhung · 2021-09-08T20:40:02Z

It couldn't start downloading and get stuck even after deleting those 2 files.

Could it be that the server's firewall is preventing me from downloading?

npmhung · 2021-09-08T20:49:01Z

I can try to upload those files from my desktop if that could help?

Lucaweihs · 2021-09-08T20:54:44Z

It's possible that the firewall is causing issues, here's a potential workaround:

cd ~/.ai2thor/releases
wget http://s3-us-west-2.amazonaws.com/ai2-thor-public/builds/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
unzip thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip -d thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e

This should manually download the files and unzip them into the correct location. If you can't download the zip file with wget then you'll probably have to download it locally and upload it.

Lucaweihs · 2021-09-08T20:55:31Z

You might also want to check the md5 hash of the zip file to make sure it was downloaded correctly:

$ md5sum thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
dfc4dc0f7bfdb2254221ae35fc712363  thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip

npmhung · 2021-09-08T21:34:40Z

I downloaded and checked the md5 hash sum. The data is correct, but the Controller couldn't load it still.

Lucaweihs · 2021-09-08T22:06:15Z

This is quite odd, especially as you're able to successfully run other builds. Can you try running

cd ~/.ai2thor/releases/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
DISPLAY=:0.0 ./thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e  -screen-fullscreen 0 -screen-quality 1 -screen-width 300 -screen-height 300

and then logging into the server from another terminal window and checking that

nvidia-smi

shows around 34mb of memory being used by the AI2-THOR unity process? Here's what it looks like for me:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4624      G   /usr/lib/xorg/Xorg                 58MiB |
|    0   N/A  N/A     14800      G   ...b766ffb45d455c211329af17e       34MiB |

the process with name ...b766ffb45d455c211329af17e is unity process.

@ekolve - would you have any ideas what might be causing the issue?

ekolve · 2021-09-08T22:18:55Z

One thing you could try checking is whether there are any other python processes running that may have ai2thor loaded. There are a few places where we take exclusive locks to ensure that only one process downloads a release as well as when we prune releases. The block of code that gets locked is very small and should never fail or hang, but its worth checking. If there are other processes, try killing them. As well, could you report what version of ai2thor you are running by running this:

print(ai2thor.__version__)

npmhung · 2021-09-09T06:03:53Z

@Lucaweihs
The Display command shows up in the nvidia-smi just like what you show me.

@ekolve
I'm using ai2thor 3.3.5

Lucaweihs · 2021-09-09T15:51:13Z

This is a bit of an extreme measure, but could you:

Delete the ~/.ai2thor directory.
Delete and then reinstall the virtual environment you're using using the --no-cache-dir option when you pip install ai2thor.
Run

from ai2thor.controller import Controller
Controller(x_display="0.0", commit_id="bad5bc2b250615cb766ffb45d455c211329af17e")

and check if it does/doesn't hang.

My best guess (suggested by @jordis-ai2 ) is that perhaps the initial call to use a window display with height 225 is being cached somewhere and is causing the problem. If the above still hangs can you try running

from ai2thor.controller import Controller
Controller(x_display="0.0", commit_id="dd25cb479958e915e2ed1282062345b0f81dc4e2")

to see if that also hangs?

ekolve · 2021-09-09T17:16:46Z

In addition to @Lucaweihs's instructions, if it does hang enter CTRL-C to interrupt the process. You should hopefully get a stack trace from python where the process was hung.

Lucaweihs · 2022-01-24T21:16:54Z

I'm going to close this issue. Please feel free to reopen if you're still having trouble.

Lucaweihs self-assigned this Sep 7, 2021

Lucaweihs closed this as completed Jan 24, 2022

7hinkDifferent mentioned this issue Aug 16, 2023

Training Process Hangs when Training with GPU allenai/ai2thor-rearrangement#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training got stuck after creating Vector Sample Tasks #309

Training got stuck after creating Vector Sample Tasks #309

npmhung commented Sep 7, 2021

Lucaweihs commented Sep 7, 2021 •

edited

Loading

npmhung commented Sep 7, 2021 •

edited

Loading

npmhung commented Sep 7, 2021

Lucaweihs commented Sep 7, 2021 •

edited

Loading

Lucaweihs commented Sep 7, 2021 •

edited

Loading

npmhung commented Sep 7, 2021

npmhung commented Sep 7, 2021

Lucaweihs commented Sep 7, 2021

npmhung commented Sep 7, 2021

Lucaweihs commented Sep 8, 2021

npmhung commented Sep 8, 2021

Lucaweihs commented Sep 8, 2021

npmhung commented Sep 8, 2021 •

edited

Loading

npmhung commented Sep 8, 2021

Lucaweihs commented Sep 8, 2021

Lucaweihs commented Sep 8, 2021

npmhung commented Sep 8, 2021

Lucaweihs commented Sep 8, 2021

ekolve commented Sep 8, 2021 •

edited

Loading

npmhung commented Sep 9, 2021

Lucaweihs commented Sep 9, 2021

ekolve commented Sep 9, 2021

Lucaweihs commented Jan 24, 2022

Training got stuck after creating Vector Sample Tasks #309

Training got stuck after creating Vector Sample Tasks #309

Comments

npmhung commented Sep 7, 2021

Lucaweihs commented Sep 7, 2021 • edited Loading

npmhung commented Sep 7, 2021 • edited Loading

npmhung commented Sep 7, 2021

Lucaweihs commented Sep 7, 2021 • edited Loading

Lucaweihs commented Sep 7, 2021 • edited Loading

npmhung commented Sep 7, 2021

npmhung commented Sep 7, 2021

Lucaweihs commented Sep 7, 2021

npmhung commented Sep 7, 2021

Lucaweihs commented Sep 8, 2021

npmhung commented Sep 8, 2021

Lucaweihs commented Sep 8, 2021

npmhung commented Sep 8, 2021 • edited Loading

npmhung commented Sep 8, 2021

Lucaweihs commented Sep 8, 2021

Lucaweihs commented Sep 8, 2021

npmhung commented Sep 8, 2021

Lucaweihs commented Sep 8, 2021

ekolve commented Sep 8, 2021 • edited Loading

npmhung commented Sep 9, 2021

Lucaweihs commented Sep 9, 2021

ekolve commented Sep 9, 2021

Lucaweihs commented Jan 24, 2022

Lucaweihs commented Sep 7, 2021 •

edited

Loading

npmhung commented Sep 7, 2021 •

edited

Loading

Lucaweihs commented Sep 7, 2021 •

edited

Loading

Lucaweihs commented Sep 7, 2021 •

edited

Loading

npmhung commented Sep 8, 2021 •

edited

Loading

ekolve commented Sep 8, 2021 •

edited

Loading