-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training got stuck after creating Vector Sample Tasks #309
Comments
Hi @npmhung, Just to double check, can you try starting AI2-THOR instances on each of your x-displays and confirming that everything works as expected: from ai2thor.controller import Controller
c = Controller(x_display="0.0")
c.step("RotateRight")
print(f"For display 0.0, action was successful: {c.last_event.metadata['lastActionSuccess']}.")
c.stop()
c = Controller(x_display="0.1")
c.step("RotateRight")
print(f"For display 0.1, action was successful: {c.last_event.metadata['lastActionSuccess']}.") The above should print success messages for both displays. If that doesn't work then you should double check that you have started the x-display (this can be done by running If the above isn't the problem, can you try reducing the number of training processes to 1 and seeing if training still hangs? One last thing: assuming your questions were answered for issue #308, can you close it? |
I got the following message: For display 0.0, action was successful: True. |
Yes, it still hangs with 1 process. |
Ok great, in the image you linked, there is a key/value pair with the key named from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = ... # The dictionary from the image
env = RoboThorEnvironment(**env_args)
env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.") |
I did a bit of debugging and it seems that AI2-THOR does not like it when you use odd integers in the height/width of the window. Can you change the window size from 300x225 to 304x228 and try again? |
Ok, I will try that. Besides, fyi, this line "env = RoboThorEnvironment(**env_args)" hangs on my server. |
I tried as you suggested. However, changing the resolution didn't help at all in my case. |
Can you double check that this is also the case with from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = ... # The dictionary from the image
env_args["height"] = 228
env_args["width"] = 304
env = RoboThorEnvironment(**env_args)
env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.") |
This line "env = RoboThorEnvironment(**env_args)" still hangs on my side. |
The env = RoboThorEnvironment()
env.step(action="RotateRight")
print(f"Action was successful: {env.last_event.metadata['lastActionSuccess']}.") works? If so, could you try the following: new_env_args = {}
for key, val in env_args.items():
print(f"Trying to add key {key} with value {val}")
new_env_args[key] = val
env = RoboThorEnvironment(**new_env_args)
env.step(action="RotateRight")
assert env.last_event.metadata['lastActionSuccess']
print(f"Env successfully started with env_args == {new_env_args}")
env.stop() This should eventually hang and tell us what is causing the issue. |
I ran your code, and the key commit_id is causing the issue. Only after removing that key, the code finished successfully. |
I see, this commit id is what determines the AI2-THOR build that should be used. It's unlikely but possible that that this file was corrupted while downloading. Can you try just starting an AI2-THOR controller with this commit id: from ai2thor.controller import Controller
Controller(commit_id="bad5bc2b250615cb766ffb45d455c211329af17e") and seeing if that hangs for you? If so, I'd suggest deleting the thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.lock directory/file within the |
It couldn't start downloading and get stuck even after deleting those 2 files. Could it be that the server's firewall is preventing me from downloading? |
I can try to upload those files from my desktop if that could help? |
It's possible that the firewall is causing issues, here's a potential workaround: cd ~/.ai2thor/releases
wget http://s3-us-west-2.amazonaws.com/ai2-thor-public/builds/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
unzip thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip -d thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e This should manually download the files and unzip them into the correct location. If you can't download the zip file with |
You might also want to check the md5 hash of the zip file to make sure it was downloaded correctly: $ md5sum thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip
dfc4dc0f7bfdb2254221ae35fc712363 thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e.zip |
I downloaded and checked the md5 hash sum. The data is correct, but the Controller couldn't load it still. |
This is quite odd, especially as you're able to successfully run other builds. Can you try running cd ~/.ai2thor/releases/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
DISPLAY=:0.0 ./thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e -screen-fullscreen 0 -screen-quality 1 -screen-width 300 -screen-height 300 and then logging into the server from another terminal window and checking that nvidia-smi shows around 34mb of memory being used by the AI2-THOR unity process? Here's what it looks like for me: +-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 4624 G /usr/lib/xorg/Xorg 58MiB |
| 0 N/A N/A 14800 G ...b766ffb45d455c211329af17e 34MiB | the process with name @ekolve - would you have any ideas what might be causing the issue? |
One thing you could try checking is whether there are any other python processes running that may have ai2thor loaded. There are a few places where we take exclusive locks to ensure that only one process downloads a release as well as when we prune releases. The block of code that gets locked is very small and should never fail or hang, but its worth checking. If there are other processes, try killing them. As well, could you report what version of ai2thor you are running by running this:
|
@Lucaweihs @ekolve |
This is a bit of an extreme measure, but could you:
from ai2thor.controller import Controller
Controller(x_display="0.0", commit_id="bad5bc2b250615cb766ffb45d455c211329af17e") and check if it does/doesn't hang. My best guess (suggested by @jordis-ai2 ) is that perhaps the initial call to use a window display with height 225 is being cached somewhere and is causing the problem. If the above still hangs can you try running from ai2thor.controller import Controller
Controller(x_display="0.0", commit_id="dd25cb479958e915e2ed1282062345b0f81dc4e2") to see if that also hangs? |
In addition to @Lucaweihs's instructions, if it does hang enter CTRL-C to interrupt the process. You should hopefully get a stack trace from python where the process was hung. |
I'm going to close this issue. Please feel free to reopen if you're still having trouble. |
As mention in the title, when I tried to run the following command:
python main.py projects/objectnav_baselines/experiments/robothor/objectnav_robothor_rgb_resnetgru_ddppo.py
the training process just got stuck at the following step forever:
This never happens in my personal desktop.
I couldn't figure out what the potential problem is.
Server configuration:
Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz
48 CPU(s)
2x Tesla K80
The text was updated successfully, but these errors were encountered: