Training Process Hangs when Training with GPU #43

Waynel65 · 2023-04-20T20:16:06Z

Hi, I am a CS student exploring this fascinating challenge.

I have been having this issue that happens whenever I try to start a training process by:
allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py

Basically everything works just fine until the log prints:

Starting 0-th SingleProcessVectorSampledTasks generator with args {'mp_ctx': <multiprocessing.context.ForkServerContext object at 0x7ff393316940>, 'force_cache_reset': False, 'epochs': inf, 'stage': 'train', 'allowed_scenes': ['FloorPlan315', 'FloorPlan316', 'FloorPlan317', 'FloorPlan318', 'FloorPlan319', 'FloorPlan320', 'FloorPlan401', 'FloorPlan402', 'FloorPlan403', 'FloorPlan404', 'FloorPlan405', 'FloorPlan406', 'FloorPlan407', 'FloorPlan408', 'FloorPlan409', 'FloorPlan410', 'FloorPlan411', 'FloorPlan412', 'FloorPlan413', 'FloorPlan414', 'FloorPlan415', 'FloorPlan416', 'FloorPlan417', 'FloorPlan418', 'FloorPlan419', 'FloorPlan420'], 'scene_to_allowed_rearrange_inds': None, 'seed': 49768732449262694267218046632734480669, 'x_display': '1.0', 'thor_controller_kwargs': {'gpu_device': None, 'platform': None}, 'sensors': [<rearrange.sensors.RGBRearrangeSensor object at 0x7ff43cca3d90>, <rearrange.sensors.UnshuffledRGBRearrangeSensor object at 0x7ff39cad9e20>, <allenact.base_abstractions.sensor.ExpertActionSensor object at 0x7ff3933167c0>]}. [vector_sampled_tasks.py: 1102]

Then the program will hang for a while and then crashes.

This issue seems to only arise when I am training with GPU (i.e. everything works as expected if a device only has CPU available)

I should also mention that I modified the number of processes in the class method machine_params in rearrange_base.py:
from this:
nprocesses = cls.num_train_processes() if torch.cuda.is_available() else 1
to:
nprocesses = 3 if torch.cuda.is_available() else 1

Without making this change it would spawn many workers that crash my machine (I would appreciate it if you could let me know what the cause of this might be as well)

I am running cuda version 11.7 and torch 1.13.1+cu117. I have also updated to the newest release of this codebase.

It would be great if I can get some pointers as of where I can look into. Please let me know if more information is needed.

The text was updated successfully, but these errors were encountered:

sqiangcao99 · 2023-05-06T07:35:15Z

I also met the same problem, did you find any solution? @Waynel65

Lucaweihs · 2023-05-08T21:43:07Z

Hi @Waynel65, @sqiangcao99,

Sorry for the delay, this issue looks like it may have to do with a problem in starting the AI2-THOR process on the Linux machine. Can you confirm that one of the following two ways of starting an AI2-THOR controller works on your machine?

Using an X-display (not recommended)

To check if you have an x-display running on your GPUs, you can run nvidia-smi and then check that there are processes like

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     54324      G   /usr/lib/xorg/Xorg                  8MiB |

on every GPU. If these are not started then you can run sudo $(which ai2thor-xorg) start (assuming you have installed ai2thor and have activated the relevant conda environment if applicable) and this should start these processes. Once this is done you can confirm that AI2-THOR can start by running

from ai2thor.controller import Controller
c = Controller(x_display="0.0") # Assumes your x-display is targeting "0.0", the above ai2thor-xorg command will start xorg processes targeting "0.{gpu_index}" for each GPU on the machine.
print(c.step("RotateRight").metadata["lastActionSuccess"])

Using CloudRendering (recommended)

Assuming you have vulkan installed (installed by default on most machines), you can run

from ai2thor.controller import Controller
from ai2thor.platform import CloudRendering

c = Controller(platform=CloudRendering, gpu_device=0)
print(c.step("RotateRight").metadata["lastActionSuccess"])

LiYaling-heb · 2023-05-09T13:56:51Z

Hi@Lucaweihs,
I have encountered the same situation as them, and I can run both of the codes you provided to get through. Using the x_ display can quickly open the Unity interface, but it ran for nearly 4 hours using another method(Using CloudRendering).

But the problem in the above image still exists.

sqiangcao99 · 2023-05-10T06:21:29Z

Hi@Lucaweihs, thanks for the help.
I tried to run the provided test code using X-display in a container. The following is the container starting command,

sudo docker run -it -v /tmp/.X11-unix/X11:/tmp/.X11-unix/X11 -e DISPLAY=:11.0 --ipc=host --gpus all --privileged ai2thor:20.04.

Although I did not observe the Xorg process, the test code was able to pass, and the Unity window appeared on the host machine.
The GPU status (card 0) is as follows:

But the training still hangs. Could you please provide some advice to help me run this training using docker? Sincerely look forward to your reply.

Lucaweihs · 2023-05-23T23:33:17Z

@LiYaling-heb, can you give a bit more context regarding:

but it ran for nearly 4 hours using another method(Using CloudRendering).
Does this mean it took nearly 4 hours to start up the controller when using CloudRendering?

@sqiangcao99 I just pushed instructions for training with docker, see here. Let me know if you have any trouble building the docker image using the provided Dockerfile.

7hinkDifferent · 2023-08-14T07:20:20Z

Hi @Lucaweihs . I also encountered the same problem as @Waynel65 and @sqiangcao99 (Please let me know if you have found any solution).
I follow the steps as you mentioned and run the code snippets below successfully (which indicates the relative installation is correct?).
But still the training process hangs when I try the allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py command as the tutorial suggests. Do I need to make specific changes or simple configurations to the original codebase to handle this problem?

Using an X-display (not recommended)

...

from ai2thor.controller import Controller
c = Controller(x_display="0.0") # Assumes your x-display is targeting "0.0", the above ai2thor-xorg command will start xorg processes targeting "0.{gpu_index}" for each GPU on the machine.
print(c.step("RotateRight").metadata["lastActionSuccess"])

Using CloudRendering (recommended)

...

from ai2thor.controller import Controller
from ai2thor.platform import CloudRendering

c = Controller(platform=CloudRendering, gpu_device=0)
print(c.step("RotateRight").metadata["lastActionSuccess"])

Lucaweihs · 2023-08-15T21:51:36Z

Hi @7hinkDifferent ,

I just created a branch with one small commit change (87fa224) which changes things so that we're training with just a single process and are set to log metrics after every rollout. Can you try seeing if this runs for you?

7hinkDifferent · 2023-08-16T08:41:30Z

Hi @Lucaweihs ,
I followed your commit change and launched a single process. The unity window popped out with a static scene and nothing happened until I got a timeout error. And there is no new line in ~/.ai2thor/log/unity.log. Please check the following screenshot for the output.

Some info about my computer

Sys: Ubuntu 22 with a monitor.
CPU: AMD Ryzen 9 5950X 16-Core Processor
GPU: RTX 3070, Driver Version: 535.86.05
nvcc -V: 11.5
torch 1.13.1 + cu117
ai2thor 5.0.0 allenact 0.5.3 allenact-plugins 0.5.3

I noticed some relative issues (ObjectNav with RoboTHOR) but the solutions didn't work out (both RoomR/iTHOR and ObjectNav/RoboTHOR). Listed below with my output.

TimeOut error when attempting to run pre-trained RoboThor model checkpoint

If I understood it correctly, you might be running from a terminal emulator on a computer with a screen attached (so xserver is probably already up for you, and therefore you shouldn't need to call ai2thor-xorg start).
With your virtual environment active, if you start a python console and then run
from ai2thor.controller import Controller
c=Controller()
, does a window pop up?

No window pops out if running for the first time when you need to download the build (f0825767cd50d69f666c7f282e54abfe58f1e917). When I run again, everything goes well. It is strange if I specified the x_display parameter (eg. Controller(x_display="1.0")), everything goes well even though running for the first time and need to download the build.

I might be missing some detail, but can you try again after doing something like the first answer in this thread? I'm guessing you're using wayland instead of Xorg xserver.

Yes, I am using x11 instead of wayland.

Training got stuck after creating Vector Sample Tasks

I ran your code, and the key commit_id is causing the issue. Only after removing that key, the code finished successfully.

I ran the following code successfully and thanks to your effort, commit_id is no longer a problem.

from allenact_plugins.robothor_plugin.robothor_environment import RoboThorEnvironment
env_args = {'width': 400, 'height': 300, 'commit_id': 'bad5bc2b250615cb766ffb45d455c211329af17e', 'stochastic': True, 'continuousMode': True, 'applyActionNoise': True, 'rotateStepDegrees': 30.0, 'visibilityDistance': 1.0, 'gridSize': 0.25, 'snapToGrid': False, 'agentMode': 'locobot', 'fieldOfView': 63.453048374758716, 'include_private_scenes': False, 'renderDepthImage': False, 'x_display': '1.0'}

env = RoboThorEnvironment(**env_args)
env.step(action="RotateRight")
print(env.last_event.metadata["lastActionSuccess"])
env.stop()

This is quite odd, especially as you're able to successfully run other builds. Can you try running
cd ~/.ai2thor/releases/thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e
DISPLAY=:0.0 ./thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e  -screen-fullscreen 0 -screen-quality 1 -screen-width 300 -screen-height 300
and then logging into the server from another terminal window and checking that
nvidia-smi
shows around 34mb of memory being used by the AI2-THOR unity process?

Run this command DISPLAY=:0.0 ./thor-Linux64-bad5bc2b250615cb766ffb45d455c211329af17e -screen-fullscreen 0 -screen-quality 1 -screen-width 300 -screen-height 300 but no expected process like ...b766ffb45d455c211329af17e.

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1803      G   /usr/lib/xorg/Xorg                         1378MiB |
|    0   N/A  N/A      1930      G   /usr/bin/gnome-shell                        436MiB |
|    0   N/A  N/A      2596      G   ...irefox/2908/usr/lib/firefox/firefox      663MiB |
|    0   N/A  N/A      3382      G   ...sion,SpareRendererForSitePerProcess     1138MiB |
+---------------------------------------------------------------------------------------+

Lastly, I think maybe I can try with headless mode. Is there a convenient way to start train and eval process with headless mode?
For ObjectNav, I can add --config_kwargs "{'headless': True}" to the command PYTHONPATH=. python allenact/main.py -o storage/objectnav-robothor-rgb-clip-rn50 -b projects/objectnav_baselines/experiments/robothor/clip objectnav_robothor_rgb_clipresnet50gru_ddppo so that no unity window pops out. But still get stuck like this.

For RoomR, simply adding --config_kwargs "{'headless': True}" to the command PYTHONPATH=. allenact -o rearrange_out -b . baseline_configs/one_phase/one_phase_rgb_clipresnet50_dagger.py doesn't work, because TypeError: OnePhaseRGBClipResNet50DaggerExperimentConfig() takes no arguments.

Really appreciate your time and patience!

Lucaweihs mentioned this issue May 8, 2023

TimeoutError for the rearrangement experiments allenai/embodied-clip#12

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Process Hangs when Training with GPU #43

Training Process Hangs when Training with GPU #43

Waynel65 commented Apr 20, 2023 •

edited

Loading

sqiangcao99 commented May 6, 2023

Lucaweihs commented May 8, 2023 •

edited

Loading

LiYaling-heb commented May 9, 2023

sqiangcao99 commented May 10, 2023

Lucaweihs commented May 23, 2023

7hinkDifferent commented Aug 14, 2023

Using an X-display (not recommended)

Using CloudRendering (recommended)

Lucaweihs commented Aug 15, 2023

7hinkDifferent commented Aug 16, 2023

Training Process Hangs when Training with GPU #43

Training Process Hangs when Training with GPU #43

Comments

Waynel65 commented Apr 20, 2023 • edited Loading

sqiangcao99 commented May 6, 2023

Lucaweihs commented May 8, 2023 • edited Loading

Using an X-display (not recommended)

Using CloudRendering (recommended)

LiYaling-heb commented May 9, 2023

sqiangcao99 commented May 10, 2023

Lucaweihs commented May 23, 2023

7hinkDifferent commented Aug 14, 2023

Using an X-display (not recommended)

Using CloudRendering (recommended)

Lucaweihs commented Aug 15, 2023

7hinkDifferent commented Aug 16, 2023

TimeOut error when attempting to run pre-trained RoboThor model checkpoint

Training got stuck after creating Vector Sample Tasks

Waynel65 commented Apr 20, 2023 •

edited

Loading

Lucaweihs commented May 8, 2023 •

edited

Loading