Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference using AllenAct gets stuck at some point #3

Closed
senadkurtisi opened this issue Feb 25, 2021 · 15 comments
Closed

Inference using AllenAct gets stuck at some point #3

senadkurtisi opened this issue Feb 25, 2021 · 15 comments
Assignees

Comments

@senadkurtisi
Copy link

Executing the "Pretrained Phase-1" segment of the README gets stuck at some point.

This is the exact command.
allenact baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py -c pretrained_model_ckpts/exp_OnePhaseRGBResNetDagger_40proc__time_2021-02-07_11-25-27__stage_00__steps_000075001830.pt -t 2021-02-07_11-25-27

I have attached the screenshot of the stopping point.
capture1

I am using Ubuntu virtual machine (VMware). It isn't connected to the GPU, so maybe that is the issue? Since ai2thor needs to start the gui for the visualization of the scenes?

Is it possible to use the environment and train the models without the gui?
If that is the case then maybe it would be possible to run scripts by using Windows Subsystem for Linux? (wsl2)

@Lucaweihs
Copy link
Contributor

Hi @senadkurtisi,

Thanks for reaching out!

Unfortunately, for the moment, you will need a GPU to be able to launch AI2-THOR (the Unity game engine requires one to render the images given to the agent). I've had a chat with (@ekolve) about how we could get things working for you and I think the main takeaway is: the WSL looks promising but we have been targeting Unix systems and so don't really have enough experience with Windows to give exact steps for how to get things working (and so this would likely require some tinkering on your end).

As a rough outline, we believe you would need to follow steps roughly the following steps

  1. Install the WSL cuda driver
  2. Follow online guides for getting WSL GUIs working (e.g. see this one).
  3. Verify that the GUIs work, e.g. you can install glxgears (a small / self-contained program that just renders some gears turning) and verify that it runs.
  4. Check that AI2-THOR now starts.
  5. Check that the example.py script runs.
  6. Finally try running inference again.

If you have any questions along the way please let us know! If you also manage to get things working it would be very helpful if you could post your steps (I imagine others would likely not have to dual-boot Ubuntu if possible!).

@senadkurtisi
Copy link
Author

Hi @senadkurtisi,

Thanks for reaching out!

Unfortunately, for the moment, you will need a GPU to be able to launch AI2-THOR (the Unity game engine requires one to render the images given to the agent). I've had a chat with (@ekolve) about how we could get things working for you and I think the main takeaway is: the WSL looks promising but we have been targeting Unix systems and so don't really have enough experience with Windows to give exact steps for how to get things working (and so this would likely require some tinkering on your end).

As a rough outline, we believe you would need to follow steps roughly the following steps

  1. Install the WSL cuda driver
  2. Follow online guides for getting WSL GUIs working (e.g. see this one).
  3. Verify that the GUIs work, e.g. you can install glxgears (a small / self-contained program that just renders some gears turning) and verify that it runs.
  4. Check that AI2-THOR now starts.
  5. Check that the example.py script runs.
  6. Finally try running inference again.

If you have any questions along the way please let us know! If you also manage to get things working it would be very helpful if you could post your steps (I imagine others would likely not have to dual-boot Ubuntu if possible!).

Hi @Lucaweihs

Thank you for looking into our problem!
Curently I am unable to test this out but me or one of my teammates will get back to you as soon as possible (probably in this thread)!

@mattdeitke
Copy link
Member

I think I can add a tad bit of context, since I use Windows as my main OS and have tried a bunch of ways to get it working with current AI2-THOR builds, although none are particularly satisfactory.


Around 3-4 months ago, I tried the WSL CUDA driver route, although, and even now, it is still a beta release and requires you to install a beta version of Windows OS (i.e., Windows Insider Build Preview). It's probably not obvious, but once you install a beta version of Windows, there is no way to go back to the stable version of Windows, before the next stable version has been released, without completely restoring the computer and reinstalling Windows.

At least back then, there were several known issues from NVIDIA with the beta WSL CUDA driver that did not allow Unity windows to open, whereas simply opening an app like glxgears -- which they promote as being openable -- is much easier and less useful. I haven't really kept up with the progress since then, because I ended up having several issues with Windows' insider build and had to completely restore and reinstall the old stable version.


There is another alternative that sorta works, but is terribly slow, which is to download an entire virtual desktop environment that runs through WSL2's GUI setup. Then from that virtual desktop, you can basically open up any app like you would on Ubuntu. However, it's substantially slower than working with WSL2 or Windows directly. (My ballpark estimate would be that it slowed AI2-THOR down by 20x+.) I do not recommend this route at all.


(This is not your use case, but if you developed on the back-end of AI2-THOR using Unity, there is a way to connect the python controller with Unity, where it doesn't have to create a build and install locally: allenai/ai2thor#428 (comment). However, this only allows you to use 1 AI2-THOR Controller at a time, which is a serious bottleneck in terms of training anything.)


If you try the WSL2 route again now and it ends up working, this would definitely be exciting! Although, it would definitely surprise me. I personally have dual booted, ssh into separate computers, or use Google Colab if I ever need to run a local AI2-THOR build, which is kinda a pain :(

Also, there are some workarounds for using AI2-THOR without using a GPU. For instance, you can get AI2-THOR working on Google Colab just by using its CPU, without a dedicated GPU:
https://colab.research.google.com/drive/1VyvpUahrlakrlwebuuFZl73ioqCuVF33?usp=sharing

It uses an X virtual framebuffer (XVFB) to read off images from AI2-THOR. That said, Colab doesn't currently support using directories, and training anything rearrangement related in Colab would be extremely tough.

@senadkurtisi
Copy link
Author

Hi @senadkurtisi,

Thanks for reaching out!

Unfortunately, for the moment, you will need a GPU to be able to launch AI2-THOR (the Unity game engine requires one to render the images given to the agent). I've had a chat with (@ekolve) about how we could get things working for you and I think the main takeaway is: the WSL looks promising but we have been targeting Unix systems and so don't really have enough experience with Windows to give exact steps for how to get things working (and so this would likely require some tinkering on your end).

As a rough outline, we believe you would need to follow steps roughly the following steps

  1. Install the WSL cuda driver
  2. Follow online guides for getting WSL GUIs working (e.g. see this one).
  3. Verify that the GUIs work, e.g. you can install glxgears (a small / self-contained program that just renders some gears turning) and verify that it runs.
  4. Check that AI2-THOR now starts.
  5. Check that the example.py script runs.
  6. Finally try running inference again.

If you have any questions along the way please let us know! If you also manage to get things working it would be very helpful if you could post your steps (I imagine others would likely not have to dual-boot Ubuntu if possible!).

Hi,
I did not have any luck with setting things up following those steps.
We have a couple of Linux machines, so it's not that important to us. Nevertheless, thank you for your help.

Should I close the issue?

@Lucaweihs
Copy link
Contributor

Sorry that the above steps didn't work, hopefully AI2-THOR will have a windows solution soon :)! I'll close this issue for now but let me know if you ever have any issues when using your Linux machines.

@nnsriram97
Copy link

I am facing a similar issue when trying to run on Ubuntu 16.04 with 2 GPUs. I see the execution getting stuck at some point for both training from scratch and doing inference using the phase-1 pre-trained model. I've been running the scripts for a day now and I neither see any checkpoints or tensorboard outputs in training nor outputs metrics for testing...

Here's a screenshot of training:

Screen Shot 2021-04-06 at 12 54 26 PM

Here's an output from my htop. I'm running both training and inference scripts (pre-trained model) right now but I see only train-0 using 100% of one CPU core and other processes (train-1, valid-0, test-0..) not utilizing any memory at all.
Screen Shot 2021-04-06 at 12 59 51 PM

Any suggestions on what should be done?

@Lucaweihs Lucaweihs reopened this Apr 6, 2021
@Lucaweihs
Copy link
Contributor

Hi @nnsriram97, just double checking given our discussion on the other issue: have you managed to get training running at all (e.g. when running with a smaller number of processes for debugging)? In general you should start seeing logs within a 5-10 minutes so not seeing anything after a day is definitely a sign of something being off. Generally I wouldn't encourage running both training/inference at once (having so many THOR processes open at once can cause issues).

One other (possibly related) note: when cancelling distributed training with ctrl+C, depending on the OS / machine setup, there may be some straggling THOR/Training processes that don't close cleanly (e.g. they'll still be visible when running nvidia-smi or ps aux) which can interfere with future runs. When I see this happening I'll run something like the following

ps aux | grep Train- | awk '{print $2}' | xargs kill -9 # the -9 option here is pretty aggressive, might be worth removing it initially

which will grab all of the process id's of processes having Train- in their name and then pass those ids to kill. If want to kill the THOR processes also I'd run something like

ps aux | grep 43eb2c | awk '{print $2}' | xargs kill -9

which is similar to the above but instead kills processes with 43eb2c in their name (this is part of the THOR build's hash for rearrangement).

@nnsriram97
Copy link

Method 1 with a physical display attached: I can successfully run the code with x_display = 0.0 and nprocesses=1 (max sightly >15) with a monitor attached instead of trying to start x server through startx.py Attached is gif of one of the processes running during the training.

Apr-06-2021 17-30-43

Also, I can view logs printed on the screen and on tensorboard when run in this configuration.

Method 2 with a physical display disabled: After the above experiment, I disabled the monitor’s Xorg through sudo service lightdm stop and ran startx.py which runs xorg on both GPU’s (nvidia-smi confirms) and also glxgears runs successfully for both displays (0.0 and 0.1).

I have limited nprocesses=2 for debugging while x_display is kept to default x_display = "0.{}".format(devices[process_ind % len(devices)]). I can now see Train-0 on GPU 0 and Valid-0 on GPU 1 but Train-0 uses 100% of one CPU core contrary to what was happening when the physical display was attached (the above experiment). I waited for half an hour and don’t see any logs on tensorboard or anything on the screen. The execution seems to be stuck with Found Path: ……. I see a static image when trying to view the remote display through NoMachine (screenshot attached below)

Screen Shot 2021-04-06 at 5 59 46 PM

Any idea what's happening?

Thank you for the above suggestions! I generally clear the running processes before trying to rerun the experiment.

@nnsriram97
Copy link

Update:

Limiting x_display=0.0 and nprocesses<=18 even without a physical display attached works fine in my system. Increasing nprocesses>18 shows simulator outputs but hangs after a while.

Changing x_display to x_display = "0.{}".format(devices[process_ind % len(devices)]) does not work for any number of nprocesses. Here I've limited nprocesses=2 for debugging purposes. Shown below is the output of nvidia-smi.

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       507      C   Train-0                           903MiB |
|    0   N/A  N/A       695      G   ...3c3596803c491c3da8d43eb2c       54MiB |
|    0   N/A  N/A       771      G   ...3c3596803c491c3da8d43eb2c       54MiB |
|    0   N/A  N/A      4785      G   /usr/lib/xorg/Xorg                 22MiB |
|    1   N/A  N/A       508      C   Train-1                           889MiB |
|    1   N/A  N/A       509      C   Valid-0                           889MiB |
|    1   N/A  N/A      4785      G   /usr/lib/xorg/Xorg                  8MiB |
|    1   N/A  N/A      4818      C   /usr/NX/bin/nxnode.bin            152MiB |
+-----------------------------------------------------------------------------+

As you can see the thor unity instance is only running on GPU 0 (x_display 0.0) and there's no thor instance running on GPU 1. For every process, I observe two Found path:... being printed but in this case, it only prints 3 while it should print 4 as nprocesses=2. I believe the code hangs when trying to render simulator output on 0.1. Any suggestions on how to resolve this?

Also,
Do you think running thor instances on display 0.1 will increase the number of processes that I can run? I see even with x_display=0.0 train-0 and train-1 are distributed to run on multiple GPUs

@Lucaweihs
Copy link
Contributor

Lucaweihs commented Apr 7, 2021

(Pinging @ekolve in case he has seen anything like this before.)

Interesting, yes it definitely seems that the AI2-THOR processes are having trouble starting on 0.1. To verify that this is an issue with AI2-THOR (and not some multiprocessing issue) can you run the following:

from ai2thor.controller import Controller
c = Controller(commit_id="62bba7e2537fb6aaf2ed19125b9508c8b99bced3", x_display="0.1")

and then check with nvidia-smi whether or not the THOR process is taking up any memory on GPU 1?

Do you think running thor instances on display 0.1 will increase the number of processes that I can run? I see even with x_display=0.0 train-0 and train-1 are distributed to run on multiple GPUs

Every system is a a bit different so it's hard to say for certain. That said, from my experience, generally the main downside of having all the ai2thor processes on a single GPU is that the memory use imbalance can limit the size of one's model. If you are getting reasonable FPS numbers with ~18 processes though then I wouldn't worry about it, on a similar 2 GPU workstation, here's what I see:

$ allenact baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py -o tmp_out
04/07 10:00:53 INFO: Running with args Namespace(approx_ckpt_steps_count=None, checkpoint=None, deterministic_agents=False, deterministic_cudnn=False, disable_config_saving=False, disable_tensorboard=False, experiment='baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py', experiment_base='/home/lucaw/allenact-rearrangement/ai2thor-rearrangement', extra_tag='', gp=None, log_level='info', max_sampler_processes_per_worker=None, output_dir='tmp_out', restart_pipeline=False, seed=None, skip_checkpoints=0, test_date=None) [main.py: 266]
04/07 10:00:55 INFO: Git diff saved to tmp_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-04-07_10-00-54  [runner.py: 497]
04/07 10:00:55 INFO: Config files saved to tmp_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-04-07_10-00-54      [runner.py: 534]
04/07 10:00:55 INFO: Using 2 train workers on devices (device(type='cuda', index=0), device(type='cuda', index=1))      [runner.py: 154]
04/07 10:00:55 INFO: Started 2 train processes  [runner.py: 313]
04/07 10:00:55 INFO: Using 1 valid workers on devices (device(type='cuda', index=1),)   [runner.py: 154]
04/07 10:00:55 INFO: Started 1 valid processes  [runner.py: 339]
...
... Lots of logs describing the processes being started
...
04/07 10:08:31 INFO: train 20480 steps 0 offpolicy: total_loss 3.32 lr 0.0003 imitation_loss/expert_cross_entropy 3.32 unshuffle/change_energy 1.89 unshuffle/end_energy 0.146 unshuffle/energy_prop 0.0833 unshuffle/ep_length 28.5 unshuffle/num_broken 0 unshuffle/num_changed 2.23 unshuffle/num_fixed 2.08 unshuffle/num_initially_misplaced 2.24 unshuffle/num_misplaced 0.168 unshuffle/num_newly_misplaced 0.00308 unshuffle/prop_fixed 0.917 unshuffle/prop_fixed_strict 0.917 unshuffle/prop_misplaced 0.0842 unshuffle/reward 1.65 unshuffle/start_energy 1.94 unshuffle/success 0.851 teacher_ratio/enforced 1 teacher_ratio/sampled 1 elapsed_time 456s    [runner.py: 645]
04/07 10:14:13 INFO: train 40960 steps 0 offpolicy: total_loss 2.48 lr 0.0003 imitation_loss/expert_cross_entropy 2.48 unshuffle/change_energy 1.94 unshuffle/end_energy 0.162 unshuffle/energy_prop 0.0729 unshuffle/ep_length 34.6 unshuffle/num_broken 0.00169 unshuffle/num_changed 2.29 unshuffle/num_fixed 2.17 unshuffle/num_initially_misplaced 2.35 unshuffle/num_misplaced 0.185 unshuffle/num_newly_misplaced 0.00508 unshuffle/prop_fixed 0.927 unshuffle/prop_fixed_strict 0.923 unshuffle/prop_misplaced 0.0748 unshuffle/reward 1.73 unshuffle/start_energy 2.02 unshuffle/success 0.854 teacher_ratio/enforced 1 teacher_ratio/sampled 1 elapsed_time 343s approx_fps 59.8      [runner.py: 645]

so an FPS of about 60 at the start of training (in general this will increase during training as cache's are populated and episode lengths get longer). A few subtle notes:

  • Above metrics are super high (success ~85%) because the imitation learning baselines start with the agent following the expert's actions (annealed to not taking expert actions after the first few million steps, see the paper for details).
  • The IL baselines are just generally slower than the pure RL baselines (as computing the expert action at every steps is somewhat costly).

Final thought: I'd recommend grabbing the latest changes on the main branch (note that master was renamed to main) as it uses a version of THOR which is more deterministic (useful when running evaluations).

@nnsriram97
Copy link

from ai2thor.controller import Controller
c = Controller(commit_id="62bba7e2537fb6aaf2ed19125b9508c8b99bced3", x_display="0.1")

Running the above I see the following thor instances in nvidia-smi for few seconds and then they disappear. The code also hangs with printing Found path: ... once.

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       507      C   Train-0                           903MiB |
|    0   N/A  N/A      4785      G   /usr/lib/xorg/Xorg                 20MiB |
|    0   N/A  N/A      6128      G   ...af2ed19125b9508c8b99bced3        2MiB |
|    1   N/A  N/A      4785      G   /usr/lib/xorg/Xorg                  8MiB |
|    1   N/A  N/A      6128      G   ...af2ed19125b9508c8b99bced3        2MiB |
+-----------------------------------------------------------------------------+  

As seen one of the instances is being initialized on display 0.0 while the other on display 0.1. Could that be an issue?

Every system is a bit different so it's hard to say for certain. That said, from my experience, generally the main downside of having all the ai2thor processes on a single GPU is that the memory use imbalance can limit the size of one's model. If you are getting reasonable FPS numbers with ~18 processes though then I wouldn't worry about it, on a similar 2 GPU workstation, here's what I see:

Thanks for your thoughts. I do see an initial frame rate of around 60 fps and success ~86%. I'll continue with my experiments if being able to run on display 0.1 does not provide any significant boosts.

@Lucaweihs
Copy link
Contributor

@nnsriram97 I'm going to close this issue for now as you seem to be getting reasonable FPS rates. Please do follow up if anything else seems awry!

@StOnEGiggity
Copy link

Hi, I face a similiar issue on a single node from the cluster. There is only one GPU. I use DISPLAY=:0.0 glxgears, and everything seems ok. However, when I try to train from scratch with the command from README.md, the training processes seem stuck in the begining.
1

I also try the obove method to change x_display and num_process but it doesn't work. Because I have no sudo privileges, I cannot run startx.py directly. Is there any suggestion? Thank you very much.

@Lucaweihs
Copy link
Contributor

Hi @StOnEGiggity,

Can you send a message in the #rearrangement-challenge channel in the Ask PRIOR slack. These problems are frequently machine specific and it helps to have some faster back-and-forth about possible problems.

@StOnEGiggity
Copy link

Hi, thanks for your reply. I find the program is stuck before the log should print "Find Path:". So I remove the original data from ~/.ai2thor, and the program seems normal. Although I don't find the actual reason, I guess it could come from my personal environment setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants