-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Process Hangs when Training with GPU #43
Comments
I also met the same problem, did you find any solution? @Waynel65 |
Hi @Waynel65, @sqiangcao99, Sorry for the delay, this issue looks like it may have to do with a problem in starting the AI2-THOR process on the Linux machine. Can you confirm that one of the following two ways of starting an AI2-THOR controller works on your machine? Using an X-display (not recommended)To check if you have an x-display running on your GPUs, you can run +-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 54324 G /usr/lib/xorg/Xorg 8MiB | on every GPU. If these are not started then you can run from ai2thor.controller import Controller
c = Controller(x_display="0.0") # Assumes your x-display is targeting "0.0", the above ai2thor-xorg command will start xorg processes targeting "0.{gpu_index}" for each GPU on the machine.
print(c.step("RotateRight").metadata["lastActionSuccess"]) Using CloudRendering (recommended)Assuming you have vulkan installed (installed by default on most machines), you can run from ai2thor.controller import Controller
from ai2thor.platform import CloudRendering
c = Controller(platform=CloudRendering, gpu_device=0)
print(c.step("RotateRight").metadata["lastActionSuccess"]) |
Hi@Lucaweihs, But the problem in the above image still exists. |
@LiYaling-heb, can you give a bit more context regarding:
@sqiangcao99 I just pushed instructions for training with docker, see here. Let me know if you have any trouble building the docker image using the provided Dockerfile. |
Hi @Lucaweihs . I also encountered the same problem as @Waynel65 and @sqiangcao99 (Please let me know if you have found any solution).
|
Hi @7hinkDifferent , I just created a branch with one small commit change (87fa224) which changes things so that we're training with just a single process and are set to log metrics after every rollout. Can you try seeing if this runs for you? |
Hi @Lucaweihs , Some info about my computer
I noticed some relative issues (ObjectNav with RoboTHOR) but the solutions didn't work out (both RoomR/iTHOR and ObjectNav/RoboTHOR). Listed below with my output. TimeOut error when attempting to run pre-trained RoboThor model checkpoint
No window pops out if running for the first time when you need to download the build (f0825767cd50d69f666c7f282e54abfe58f1e917). When I run again, everything goes well. It is strange if I specified the
Yes, I am using x11 instead of wayland. Training got stuck after creating Vector Sample Tasks
I ran the following code successfully and thanks to your effort, commit_id is no longer a problem. from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = {'width': 400, 'height': 300, 'commit_id': 'bad5bc2b250615cb766ffb45d455c211329af17e', 'stochastic': True, 'continuousMode': True, 'applyActionNoise': True, 'rotateStepDegrees': 30.0, 'visibilityDistance': 1.0, 'gridSize': 0.25, 'snapToGrid': False, 'agentMode': 'locobot', 'fieldOfView': 63.453048374758716, 'include_private_scenes': False, 'renderDepthImage': False, 'x_display': '1.0'}
env = RoboThorEnvironment(**env_args)
env.step(action="RotateRight")
print(env.last_event.metadata["lastActionSuccess"])
env.stop()
Run this command +---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 1803 G /usr/lib/xorg/Xorg 1378MiB |
| 0 N/A N/A 1930 G /usr/bin/gnome-shell 436MiB |
| 0 N/A N/A 2596 G ...irefox/2908/usr/lib/firefox/firefox 663MiB |
| 0 N/A N/A 3382 G ...sion,SpareRendererForSitePerProcess 1138MiB |
+---------------------------------------------------------------------------------------+ Lastly, I think maybe I can try with headless mode. Is there a convenient way to start train and eval process with headless mode? Really appreciate your time and patience! |
Hi, I am a CS student exploring this fascinating challenge.
I have been having this issue that happens whenever I try to start a training process by:
allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py
Basically everything works just fine until the log prints:
Starting 0-th SingleProcessVectorSampledTasks generator with args {'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7ff393316940>, 'force_cache_reset': False, 'epochs': inf, 'stage': 'train', 'allowed_scenes': ['FloorPlan315', 'FloorPlan316', 'FloorPlan317', 'FloorPlan318', 'FloorPlan319', 'FloorPlan320', 'FloorPlan401', 'FloorPlan402', 'FloorPlan403', 'FloorPlan404', 'FloorPlan405', 'FloorPlan406', 'FloorPlan407', 'FloorPlan408', 'FloorPlan409', 'FloorPlan410', 'FloorPlan411', 'FloorPlan412', 'FloorPlan413', 'FloorPlan414', 'FloorPlan415', 'FloorPlan416', 'FloorPlan417', 'FloorPlan418', 'FloorPlan419', 'FloorPlan420'], 'scene_to_allowed_rearrange_inds': None, 'seed': 49768732449262694267218046632734480669, 'x_display': '1.0', 'thor_controller_kwargs': {'gpu_device': None, 'platform': None}, 'sensors': [<rearrange.sensors.RGBRearrangeSensor object at 0x7ff43cca3d90>, <rearrange.sensors.UnshuffledRGBRearrangeSensor object at 0x7ff39cad9e20>, <allenact.base_abstractions.sensor.ExpertActionSensor object at 0x7ff3933167c0>]}. [vector_sampled_tasks.py: 1102]
Then the program will hang for a while and then crashes.
This issue seems to only arise when I am training with GPU (i.e. everything works as expected if a device only has CPU available)
I should also mention that I modified the number of processes in the class method
machine_params
inrearrange_base.py
:from this:
nprocesses = cls.num_train_processes() if torch.cuda.is_available() else 1
to:
nprocesses = 3 if torch.cuda.is_available() else 1
Without making this change it would spawn many workers that crash my machine (I would appreciate it if you could let me know what the cause of this might be as well)
I am running
cuda version 11.7
and torch1.13.1+cu117
. I have also updated to the newest release of this codebase.It would be great if I can get some pointers as of where I can look into. Please let me know if more information is needed.
The text was updated successfully, but these errors were encountered: