New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference using AllenAct gets stuck at some point #3
Comments
Hi @senadkurtisi, Thanks for reaching out! Unfortunately, for the moment, you will need a GPU to be able to launch AI2-THOR (the Unity game engine requires one to render the images given to the agent). I've had a chat with (@ekolve) about how we could get things working for you and I think the main takeaway is: the WSL looks promising but we have been targeting Unix systems and so don't really have enough experience with Windows to give exact steps for how to get things working (and so this would likely require some tinkering on your end). As a rough outline, we believe you would need to follow steps roughly the following steps
If you have any questions along the way please let us know! If you also manage to get things working it would be very helpful if you could post your steps (I imagine others would likely not have to dual-boot Ubuntu if possible!). |
Hi @Lucaweihs Thank you for looking into our problem! |
I think I can add a tad bit of context, since I use Windows as my main OS and have tried a bunch of ways to get it working with current AI2-THOR builds, although none are particularly satisfactory. Around 3-4 months ago, I tried the WSL CUDA driver route, although, and even now, it is still a beta release and requires you to install a beta version of Windows OS (i.e., Windows Insider Build Preview). It's probably not obvious, but once you install a beta version of Windows, there is no way to go back to the stable version of Windows, before the next stable version has been released, without completely restoring the computer and reinstalling Windows. At least back then, there were several known issues from NVIDIA with the beta WSL CUDA driver that did not allow Unity windows to open, whereas simply opening an app like There is another alternative that sorta works, but is terribly slow, which is to download an entire virtual desktop environment that runs through WSL2's GUI setup. Then from that virtual desktop, you can basically open up any app like you would on Ubuntu. However, it's substantially slower than working with WSL2 or Windows directly. (My ballpark estimate would be that it slowed AI2-THOR down by 20x+.) I do not recommend this route at all. (This is not your use case, but if you developed on the back-end of AI2-THOR using Unity, there is a way to connect the python controller with Unity, where it doesn't have to create a build and install locally: allenai/ai2thor#428 (comment). However, this only allows you to use 1 AI2-THOR Controller at a time, which is a serious bottleneck in terms of training anything.) If you try the WSL2 route again now and it ends up working, this would definitely be exciting! Although, it would definitely surprise me. I personally have dual booted, ssh into separate computers, or use Google Colab if I ever need to run a local AI2-THOR build, which is kinda a pain :( Also, there are some workarounds for using AI2-THOR without using a GPU. For instance, you can get AI2-THOR working on Google Colab just by using its CPU, without a dedicated GPU: It uses an X virtual framebuffer (XVFB) to read off images from AI2-THOR. That said, Colab doesn't currently support using directories, and training anything rearrangement related in Colab would be extremely tough. |
Hi, Should I close the issue? |
Sorry that the above steps didn't work, hopefully AI2-THOR will have a windows solution soon :)! I'll close this issue for now but let me know if you ever have any issues when using your Linux machines. |
Hi @nnsriram97, just double checking given our discussion on the other issue: have you managed to get training running at all (e.g. when running with a smaller number of processes for debugging)? In general you should start seeing logs within a 5-10 minutes so not seeing anything after a day is definitely a sign of something being off. Generally I wouldn't encourage running both training/inference at once (having so many THOR processes open at once can cause issues). One other (possibly related) note: when cancelling distributed training with
which will grab all of the process id's of processes having
which is similar to the above but instead kills processes with |
Method 1 with a physical display attached: I can successfully run the code with Also, I can view logs printed on the screen and on tensorboard when run in this configuration. Method 2 with a physical display disabled: After the above experiment, I disabled the monitor’s Xorg through I have limited Any idea what's happening? Thank you for the above suggestions! I generally clear the running processes before trying to rerun the experiment. |
Update:Limiting Changing
As you can see the thor unity instance is only running on GPU 0 ( Also, |
(Pinging @ekolve in case he has seen anything like this before.) Interesting, yes it definitely seems that the AI2-THOR processes are having trouble starting on from ai2thor.controller import Controller
c = Controller(commit_id="62bba7e2537fb6aaf2ed19125b9508c8b99bced3", x_display="0.1") and then check with
Every system is a a bit different so it's hard to say for certain. That said, from my experience, generally the main downside of having all the ai2thor processes on a single GPU is that the memory use imbalance can limit the size of one's model. If you are getting reasonable FPS numbers with ~18 processes though then I wouldn't worry about it, on a similar 2 GPU workstation, here's what I see: $ allenact baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py -o tmp_out
04/07 10:00:53 INFO: Running with args Namespace(approx_ckpt_steps_count=None, checkpoint=None, deterministic_agents=False, deterministic_cudnn=False, disable_config_saving=False, disable_tensorboard=False, experiment='baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py', experiment_base='/home/lucaw/allenact-rearrangement/ai2thor-rearrangement', extra_tag='', gp=None, log_level='info', max_sampler_processes_per_worker=None, output_dir='tmp_out', restart_pipeline=False, seed=None, skip_checkpoints=0, test_date=None) [main.py: 266]
04/07 10:00:55 INFO: Git diff saved to tmp_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-04-07_10-00-54 [runner.py: 497]
04/07 10:00:55 INFO: Config files saved to tmp_out/used_configs/OnePhaseRGBResNetDagger_40proc/2021-04-07_10-00-54 [runner.py: 534]
04/07 10:00:55 INFO: Using 2 train workers on devices (device(type='cuda', index=0), device(type='cuda', index=1)) [runner.py: 154]
04/07 10:00:55 INFO: Started 2 train processes [runner.py: 313]
04/07 10:00:55 INFO: Using 1 valid workers on devices (device(type='cuda', index=1),) [runner.py: 154]
04/07 10:00:55 INFO: Started 1 valid processes [runner.py: 339]
...
... Lots of logs describing the processes being started
...
04/07 10:08:31 INFO: train 20480 steps 0 offpolicy: total_loss 3.32 lr 0.0003 imitation_loss/expert_cross_entropy 3.32 unshuffle/change_energy 1.89 unshuffle/end_energy 0.146 unshuffle/energy_prop 0.0833 unshuffle/ep_length 28.5 unshuffle/num_broken 0 unshuffle/num_changed 2.23 unshuffle/num_fixed 2.08 unshuffle/num_initially_misplaced 2.24 unshuffle/num_misplaced 0.168 unshuffle/num_newly_misplaced 0.00308 unshuffle/prop_fixed 0.917 unshuffle/prop_fixed_strict 0.917 unshuffle/prop_misplaced 0.0842 unshuffle/reward 1.65 unshuffle/start_energy 1.94 unshuffle/success 0.851 teacher_ratio/enforced 1 teacher_ratio/sampled 1 elapsed_time 456s [runner.py: 645]
04/07 10:14:13 INFO: train 40960 steps 0 offpolicy: total_loss 2.48 lr 0.0003 imitation_loss/expert_cross_entropy 2.48 unshuffle/change_energy 1.94 unshuffle/end_energy 0.162 unshuffle/energy_prop 0.0729 unshuffle/ep_length 34.6 unshuffle/num_broken 0.00169 unshuffle/num_changed 2.29 unshuffle/num_fixed 2.17 unshuffle/num_initially_misplaced 2.35 unshuffle/num_misplaced 0.185 unshuffle/num_newly_misplaced 0.00508 unshuffle/prop_fixed 0.927 unshuffle/prop_fixed_strict 0.923 unshuffle/prop_misplaced 0.0748 unshuffle/reward 1.73 unshuffle/start_energy 2.02 unshuffle/success 0.854 teacher_ratio/enforced 1 teacher_ratio/sampled 1 elapsed_time 343s approx_fps 59.8 [runner.py: 645] so an FPS of about 60 at the start of training (in general this will increase during training as cache's are populated and episode lengths get longer). A few subtle notes:
Final thought: I'd recommend grabbing the latest changes on the |
Running the above I see the following thor instances in
As seen one of the instances is being initialized on
Thanks for your thoughts. I do see an initial frame rate of around 60 fps and success ~86%. I'll continue with my experiments if being able to run on display 0.1 does not provide any significant boosts. |
@nnsriram97 I'm going to close this issue for now as you seem to be getting reasonable FPS rates. Please do follow up if anything else seems awry! |
Hi @StOnEGiggity, Can you send a message in the #rearrangement-challenge channel in the Ask PRIOR slack. These problems are frequently machine specific and it helps to have some faster back-and-forth about possible problems. |
Hi, thanks for your reply. I find the program is stuck before the log should print "Find Path:". So I remove the original data from ~/.ai2thor, and the program seems normal. Although I don't find the actual reason, I guess it could come from my personal environment setting. |
Executing the "Pretrained Phase-1" segment of the README gets stuck at some point.
This is the exact command.
allenact baseline_configs/one_phase/one_phase_rgb_resnet_dagger.py -c pretrained_model_ckpts/exp_OnePhaseRGBResNetDagger_40proc__time_2021-02-07_11-25-27__stage_00__steps_000075001830.pt -t 2021-02-07_11-25-27
I have attached the screenshot of the stopping point.
I am using Ubuntu virtual machine (VMware). It isn't connected to the GPU, so maybe that is the issue? Since ai2thor needs to start the gui for the visualization of the scenes?
Is it possible to use the environment and train the models without the gui?
If that is the case then maybe it would be possible to run scripts by using Windows Subsystem for Linux? (wsl2)
The text was updated successfully, but these errors were encountered: