Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training got stuck after creating Vector Sample Tasks #309

Closed
npmhung opened this issue Sep 7, 2021 · 23 comments
Closed

Training got stuck after creating Vector Sample Tasks #309

npmhung opened this issue Sep 7, 2021 · 23 comments
Assignees

Comments

@npmhung
Copy link

npmhung commented Sep 7, 2021

As mention in the title, when I tried to run the following command:

python main.py projects/objectnav_baselines/experiments/robothor/objectnav_robothor_rgb_resnetgru_ddppo.py

the training process just got stuck at the following step forever:

image

This never happens in my personal desktop.

I couldn't figure out what the potential problem is.

Server configuration:
Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz
48 CPU(s)
2x Tesla K80

@Lucaweihs Lucaweihs self-assigned this Sep 7, 2021
@Lucaweihs
Copy link
Collaborator

Lucaweihs commented Sep 7, 2021

Hi @npmhung,

Just to double check, can you try starting AI2-THOR instances on each of your x-displays and confirming that everything works as expected:

from ai2thor.controller import Controller
c = Controller(x_display="0.0")
c.step("RotateRight")
print(f"For display 0.0, action was successful: {c.last_event.metadata['lastActionSuccess']}.")
c.stop()

c = Controller(x_display="0.1")
c.step("RotateRight")
print(f"For display 0.1, action was successful: {c.last_event.metadata['lastActionSuccess']}.")

The above should print success messages for both displays. If that doesn't work then you should double check that you have started the x-display (this can be done by running sudo ai2thor-xorg start).

If the above isn't the problem, can you try reducing the number of training processes to 1 and seeing if training still hangs?

One last thing: assuming your questions were answered for issue #308, can you close it?

@npmhung
Copy link
Author

npmhung commented Sep 7, 2021

I got the following message:

For display 0.0, action was successful: True.
For display 0.1, action was successful: True.

@npmhung
Copy link
Author

npmhung commented Sep 7, 2021

Yes, it still hangs with 1 process.

@Lucaweihs
Copy link
Collaborator

Lucaweihs commented Sep 7, 2021

Ok great, in the image you linked, there is a key/value pair with the key named "env_args" can you copy the dictionary there into a new variable and try the following:

from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = ... # The dictionary from the image

env = RoboThorEnvironment(**env_args)

env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")

@Lucaweihs
Copy link
Collaborator

Lucaweihs commented Sep 7, 2021

I did a bit of debugging and it seems that AI2-THOR does not like it when you use odd integers in the height/width of the window. Can you change the window size from 300x225 to 304x228 and try again?

@npmhung
Copy link
Author

npmhung commented Sep 7, 2021

Ok, I will try that.

Besides, fyi, this line "env = RoboThorEnvironment(**env_args)" hangs on my server.

@npmhung
Copy link
Author

npmhung commented Sep 7, 2021

I tried as you suggested. However, changing the resolution didn't help at all in my case.

@Lucaweihs
Copy link
Collaborator

Can you double check that this is also the case with env = RoboThorEnvironment(**env_args), i.e.

from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = ... # The dictionary from the image

env_args["height"] = 228
env_args["width"] = 304
env = RoboThorEnvironment(**env_args)

env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")

@npmhung
Copy link
Author

npmhung commented Sep 7, 2021

This line "env = RoboThorEnvironment(**env_args)" still hangs on my side.

@Lucaweihs
Copy link
Collaborator

The env = RoboThorEnvironment(**env_args) definitely seems to be the issue. Can you confirm that:

env = RoboThorEnvironment()
env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.")

works?

If so, could you try the following:

new_env_args = {}
for key, val in env_args.items():
    print(f"Trying to add key {key} with value {val}")
    new_env_args[key] = val
    env = RoboThorEnvironment(**new_env_args)
    env.step(action="RotateRight")
    assert env.last_event.metadata['lastActionSuccess']

    print(f"Env successfully started with env_args == {new_env_args}")
    env.stop()

This should eventually hang and tell us what is causing the issue.

@npmhung
Copy link
Author

npmhung commented Sep 8, 2021

I ran your code, and the key commit_id is causing the issue. Only after removing that key, the code finished successfully.

@Lucaweihs
Copy link
Collaborator

I see, this commit id is what determines the AI2-THOR build that should be used. It's unlikely but possible that that this file was corrupted while downloading. Can you try just starting an AI2-THOR controller with this commit id:

from ai2thor.controller import Controller
Controller(commit_id="bad5bc2b250615cb766ffb45d455c211329af17e")

and seeing if that hangs for you?

If so, I'd suggest deleting the

thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.lock

directory/file within the ~/.ai2thor/releases directory and then trying to run Controller(commit_id="bad5bc2b250615cb766ffb45d455c211329af17e") again (it should download an ~500mb file before starting).

@npmhung
Copy link
Author

npmhung commented Sep 8, 2021

It couldn't start downloading and get stuck even after deleting those 2 files.

Could it be that the server's firewall is preventing me from downloading?

@npmhung
Copy link
Author

npmhung commented Sep 8, 2021

I can try to upload those files from my desktop if that could help?

@Lucaweihs
Copy link
Collaborator

It's possible that the firewall is causing issues, here's a potential workaround:

cd ~/.ai2thor/releases
wget http://s3-us-west-2.amazonaws.com/ai2-thor-public/builds/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
unzip thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip -d thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e

This should manually download the files and unzip them into the correct location. If you can't download the zip file with wget then you'll probably have to download it locally and upload it.

@Lucaweihs
Copy link
Collaborator

You might also want to check the md5 hash of the zip file to make sure it was downloaded correctly:

$ md5sum thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
dfc4dc0f7bfdb2254221ae35fc712363  thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip

@npmhung
Copy link
Author

npmhung commented Sep 8, 2021

I downloaded and checked the md5 hash sum. The data is correct, but the Controller couldn't load it still.

@Lucaweihs
Copy link
Collaborator

This is quite odd, especially as you're able to successfully run other builds. Can you try running

cd ~/.ai2thor/releases/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
DISPLAY=:0.0 ./thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e  -screen-fullscreen 0 -screen-quality 1 -screen-width 300 -screen-height 300

and then logging into the server from another terminal window and checking that

nvidia-smi

shows around 34mb of memory being used by the AI2-THOR unity process? Here's what it looks like for me:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4624      G   /usr/lib/xorg/Xorg                 58MiB |
|    0   N/A  N/A     14800      G   ...b766ffb45d455c211329af17e       34MiB |

the process with name ...b766ffb45d455c211329af17e is unity process.

@ekolve - would you have any ideas what might be causing the issue?

@ekolve
Copy link

ekolve commented Sep 8, 2021

One thing you could try checking is whether there are any other python processes running that may have ai2thor loaded. There are a few places where we take exclusive locks to ensure that only one process downloads a release as well as when we prune releases. The block of code that gets locked is very small and should never fail or hang, but its worth checking. If there are other processes, try killing them. As well, could you report what version of ai2thor you are running by running this:

print(ai2thor.__version__)

@npmhung
Copy link
Author

npmhung commented Sep 9, 2021

@Lucaweihs
The Display command shows up in the nvidia-smi just like what you show me.

@ekolve
I'm using ai2thor 3.3.5

@Lucaweihs
Copy link
Collaborator

This is a bit of an extreme measure, but could you:

  1. Delete the ~/.ai2thor directory.
  2. Delete and then reinstall the virtual environment you're using using the --no-cache-dir option when you pip install ai2thor.
  3. Run
from ai2thor.controller import Controller
Controller(x_display="0.0", commit_id="bad5bc2b250615cb766ffb45d455c211329af17e")

and check if it does/doesn't hang.

My best guess (suggested by @jordis-ai2 ) is that perhaps the initial call to use a window display with height 225 is being cached somewhere and is causing the problem. If the above still hangs can you try running

from ai2thor.controller import Controller
Controller(x_display="0.0", commit_id="dd25cb479958e915e2ed1282062345b0f81dc4e2")

to see if that also hangs?

@ekolve
Copy link

ekolve commented Sep 9, 2021

In addition to @Lucaweihs's instructions, if it does hang enter CTRL-C to interrupt the process. You should hopefully get a stack trace from python where the process was hung.

@Lucaweihs
Copy link
Collaborator

I'm going to close this issue. Please feel free to reopen if you're still having trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants