torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #237

ghost · 2023-05-19T08:42:48Z

when i run this command:
torchrun --nproc-per-node 1 --master_port 25641 train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml

this error occurs, how can i fix it?

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 81571) of binary: /home/tiger/miniconda3/envs/minigpt4/bin/python
Traceback (most recent call last):
  File "/home/tiger/miniconda3/envs/minigpt4/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/tiger/miniconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-19_16:43:27
  host      : n136-117-136.byted.org
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 81571)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
========================================================================================================================

The text was updated successfully, but these errors were encountered:

yuanlisky · 2023-05-19T09:30:10Z

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 198787) of binary: /home/ocr/anaconda3/envs/minigpt4/bin/python
Traceback (most recent call last):
  File "/home/ocr/anaconda3/envs/minigpt4/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-19_17:21:14
  host      : ai2
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 198787)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 198787

same error

abhijeetGithu · 2023-05-22T14:14:31Z

The error you mentioned earlier, torch.distributed.elastic.multiprocessing.errors.ChildFailedError, typically occurs when one of the child processes launched by torchrun encounters an error and fails to execute properly.
It is difficult to pinpoint the exact cause of the error. However, here are a few possible reasons and solutions you can consider:
Resource allocation: Ensure that your system has enough resources (e.g., CPU, GPU, memory) to accommodate the requested number of child processes.

Data or code issues: Check if there are any data-related issues, such as corrupted or incompatible data. Also, review your code for any potential issues that could cause errors during training. Make sure your code is compatible with the version of PyTorch and other dependencies you are using.

Debugging the child process: Try to gather more information about the error in the child process. You can modify your code to catch and print out the specific error message or traceback for the failed child process. This will help you narrow down the issue and provide more context for troubleshooting.

Updating PyTorch and dependencies: Make sure you are using the latest version of PyTorch and related dependencies. Check for any updates or bug fixes that may address the issue you're facing. It's also a good practice to ensure that all the dependencies in your environment are compatible with each other.

Check for known issues or bugs: Search online forums, issue trackers, or the official PyTorch documentation for any known issues related to the torch.distributed.elastic.multiprocessing module. It's possible that the error you're encountering is a known issue with an existing solution or workaround.

wujiahongPKU · 2023-05-24T03:59:58Z

I have the same problem, I use v100 to finetune second stage using 7B

ishitaverma · 2023-05-31T21:36:17Z

Is there any solution for this? I am facing the same issue.

ghost · 2023-06-01T04:00:25Z

There are no errors of ddp. No matter what errors occur, this error repostr is always in the ddp. So you should check the real error above these error reports. From: "Ishita ***@***.***> Date: Thu, Jun 1, 2023, 05:36 Subject: [External] Re: [Vision-CAIR/MiniGPT-4] torch.distributed.elastic.multiprocessing.errors.ChildFailedError: (Issue #237) To: ***@***.***> Cc: ***@***.***>, "Author"< ***@***.***> Is there any solution for this? I am facing the same issue. — Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A7ZMO7AXBZZPNLT6AEO2BSLXI62VXANCNFSM6AAAAAAYHPSNQM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ishitaverma · 2023-06-01T16:50:34Z

Thanks!! So how can we find the exact cause of the error? There's no traceback.

…

On Wed, May 31, 2023 at 9:00 PM chengjiaxiang ***@***.***> wrote: There are no errors of ddp. No matter what errors occur, this error repostr is always in the ddp. So you should check the real error above these error reports. From: "Ishita ***@***.***> Date: Thu, Jun 1, 2023, 05:36 Subject: [External] Re: [Vision-CAIR/MiniGPT-4] torch.distributed.elastic.multiprocessing.errors.ChildFailedError: (Issue #237) To: ***@***.***> Cc: ***@***.***>, "Author"< ***@***.***> Is there any solution for this? I am facing the same issue. — Reply to this email directly, view it on GitHub < #237 (comment) >, or unsubscribe < https://github.com/notifications/unsubscribe-auth/A7ZMO7AXBZZPNLT6AEO2BSLXI62VXANCNFSM6AAAAAAYHPSNQM > . You are receiving this because you authored the thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AC3TOFCOMWACR4QI5VLSSVDXJAHWHANCNFSM6AAAAAAYHPSNQM> . You are receiving this because you commented.Message ID: ***@***.***>

ghost · 2023-06-05T09:05:39Z

i have traceback，just scroll up. From: "Ishita ***@***.***> Date: Fri, Jun 2, 2023, 00:50 Subject: [External] Re: [Vision-CAIR/MiniGPT-4] torch.distributed.elastic.multiprocessing.errors.ChildFailedError: (Issue #237) To: ***@***.***> Cc: ***@***.***>, "Author"< ***@***.***> Thanks!! So how can we find the exact cause of the error? There's no traceback.

On Wed, May 31, 2023 at 9:00 PM chengjiaxiang ***@***.***> wrote: There are no errors of ddp. No matter what errors occur, this error

repostr

is always in the ddp. So you should check the real error above these error reports. From: "Ishita ***@***.***> Date: Thu, Jun 1, 2023, 05:36 Subject: [External] Re: [Vision-CAIR/MiniGPT-4] torch.distributed.elastic.multiprocessing.errors.ChildFailedError: (Issue #237) To: ***@***.***> Cc: ***@***.***>, "Author"< ***@***.***> Is there any solution for this? I am facing the same issue. — Reply to this email directly, view it on GitHub <

#237 (comment)

>, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/A7ZMO7AXBZZPNLT6AEO2BSLXI62VXANCNFSM6AAAAAAYHPSNQM

> . You are receiving this because you authored the thread.Message ID: ***@***.***> — Reply to this email directly, view it on GitHub <

#237 (comment)

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AC3TOFCOMWACR4QI5VLSSVDXJAHWHANCNFSM6AAAAAAYHPSNQM

. You are receiving this because you commented.Message ID: ***@***.***>

— Reply to this email directly, view it on GitHub <#237 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A7ZMO7GY37WZFMXHRGUCFMLXJDB6NANCNFSM6AAAAAAYHPSNQM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

BruceZhou95 · 2023-06-09T01:36:01Z

Is there any solution for this? I am facing the same issue.

mabey you can update lower version of torch. its work for me.

IronSpiderMan · 2023-07-22T14:10:40Z

i find the seem issue in hugging face, it's because of ram is not sufficient.
https://discuss.huggingface.co/t/torch-distributed-elastic-multiprocessing-errors-childfailederror/28242

IronSpiderMan · 2023-07-22T14:11:51Z

i find the seem issue in hugging face, it's because of ram is not sufficient. https://discuss.huggingface.co/t/torch-distributed-elastic-multiprocessing-errors-childfailederror/28242

AnustupOCR · 2023-07-26T09:12:56Z

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 0 (pid: 198787) of binary: /home/ocr/anaconda3/envs/minigpt4/bin/python
Traceback (most recent call last):
  File "/home/ocr/anaconda3/envs/minigpt4/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ocr/anaconda3/envs/minigpt4/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
train.py FAILED
-------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-05-19_17:21:14
  host      : ai2
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 198787)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 198787

same error

Hi, I am having the same error while trying to Train TrOCR on multi-gpu single node setup.
My problem is not the RAM as i have 1.8TB available memory, but still i face this error.
Also i would like to point out that this particular error in the quoted reply is not as same as the original one. The exit code here is -9 as opposed to 1 in the original one.
I am also getting -9 in my case, and i am not being able to find any reason behind it.
The error is thrown randomly at the start of some epoch.
Please help me with any possible solutins if you can.

HWH-2000 · 2023-08-06T05:12:39Z

https://discuss.huggingface.co/t/torch-distributed-elastic-multiprocessing-errors-childfailederror/28242/7

Probably extending shm will solve this problem

DengNingyuan · 2023-08-10T03:14:23Z

I solve this problem by change the version of torch, when i use the torch2.0, i meet this question ,after i chinge the version of torch align with environment.yml ,i solve this provlem

THUVAARAGAN · 2023-09-01T10:02:30Z

i am using torch version 2.0.1 but i got the same torch.distributed.elastic.multiprocessing.errors.ChildFailedError
Any suggestions to this error?

djaym7 · 2023-09-22T18:19:55Z

raise ChildFailedError(

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./run.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-09-22_18:11:04
host : xxx
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 1775061)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 1775061

same error

nibrasrakib · 2023-10-20T12:45:02Z

Anyone have any clue about this error? I am facing the same.

joslefaure · 2024-01-24T02:06:01Z

Decreasing the batch size worked for me

jaouiwassim · 2024-03-19T02:16:59Z

I had exiterror=1
I found out that I was running my code in an uncorrect environment, I had defined everything in anaconda before.
conda activate nameEnvironment
to install GPU with pytorch:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
instead of pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
it resolved the problem for me.
Get the suitable installation command here:
https://pytorch.org/get-started/locally/

Edenzzzz · 2024-04-19T06:26:34Z

The error you mentioned earlier, torch.distributed.elastic.multiprocessing.errors.ChildFailedError, typically occurs when one of the child processes launched by torchrun encounters an error and fails to execute properly. It is difficult to pinpoint the exact cause of the error. However, here are a few possible reasons and solutions you can consider: Resource allocation: Ensure that your system has enough resources (e.g., CPU, GPU, memory) to accommodate the requested number of child processes.

Data or code issues: Check if there are any data-related issues, such as corrupted or incompatible data. Also, review your code for any potential issues that could cause errors during training. Make sure your code is compatible with the version of PyTorch and other dependencies you are using.

Debugging the child process: Try to gather more information about the error in the child process. You can modify your code to catch and print out the specific error message or traceback for the failed child process. This will help you narrow down the issue and provide more context for troubleshooting.

Updating PyTorch and dependencies: Make sure you are using the latest version of PyTorch and related dependencies. Check for any updates or bug fixes that may address the issue you're facing. It's also a good practice to ensure that all the dependencies in your environment are compatible with each other.

Check for known issues or bugs: Search online forums, issue trackers, or the official PyTorch documentation for any known issues related to the torch.distributed.elastic.multiprocessing module. It's possible that the error you're encountering is a known issue with an existing solution or workaround.

chatgpt

wangsang123 · 2024-04-22T11:27:29Z

原因

I also encountered this problem. Is there any solution to this problem?

ammaryasirnaich · 2024-04-22T14:22:51Z

原因

I also encountered this problem. Is there any solution to this problem?

Can you share the error that you are getting on the console?
Mostly this error is caused because of running out of enough resources (GPU or memory) while in the training or inference process. It can also happen if the GPUs are not accessible in the cluster.

adipill04 · 2024-04-22T20:11:02Z

It can also happen if the GPUs are not accessible in the cluster.

I know I'm not the original poster of the comment, but this is what I am getting. Any idea what exit code -6 indicates in this case? I am using 250 gb memory and 500 gb disk to run the training job so I wouldn't think it has to do with the resource allocation.

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 16) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.0', 'console_scripts', 'torchrun')())
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
===================================================
./dinov2/train/train.py FAILED
---------------------------------------------------
Failures:
[1]:
  time      : 2024-04-22_19:55:40
  host      : host.edu
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 17)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 17
---------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-22_19:55:40
  host      : host.edu
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 16)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 16
===================================================

PDD0911-HCMUS · 2024-04-25T11:21:13Z

in my case I just modified the command line to run process like this:
python -m torch.distributed.launch --nproc_per_node=2 --use_env main.py
please try add the "--use_env" before the Python file of your process
Hope can help everybody

tami64 · 2024-05-07T20:08:10Z

I had same problem for the following sample:

To train a Swin Transformer on ImageNet from scratch, run:

python -m torch.distributed.launch --nproc_per_node --master_port 12345 main.py \
--cfg --data-path [--batch-size --output --tag ]

I solved it by removing "torch.distributed.launch --nproc_per_node --master_port 12345".

So:
To train a Swin Transformer on ImageNet from scratch, run:
python -m main.py \
--cfg --data-path [--batch-size --output --tag ]

sennnnn · 2024-05-13T12:55:24Z

import torch before import transformers help me solve this problem.

wxj630 mentioned this issue May 7, 2024

训练时CUDA报错 IDEA-CCNL/Taiyi-Diffusion-XL#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #237

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #237

ghost commented May 19, 2023 •

edited by ghost

yuanlisky commented May 19, 2023

abhijeetGithu commented May 22, 2023

wujiahongPKU commented May 24, 2023

ishitaverma commented May 31, 2023

ghost commented Jun 1, 2023 via email

ishitaverma commented Jun 1, 2023 via email

ghost commented Jun 5, 2023 via email

BruceZhou95 commented Jun 9, 2023

IronSpiderMan commented Jul 22, 2023

IronSpiderMan commented Jul 22, 2023

AnustupOCR commented Jul 26, 2023

HWH-2000 commented Aug 6, 2023

DengNingyuan commented Aug 10, 2023

THUVAARAGAN commented Sep 1, 2023

djaym7 commented Sep 22, 2023

nibrasrakib commented Oct 20, 2023

joslefaure commented Jan 24, 2024

jaouiwassim commented Mar 19, 2024

Edenzzzz commented Apr 19, 2024

wangsang123 commented Apr 22, 2024

ammaryasirnaich commented Apr 22, 2024

adipill04 commented Apr 22, 2024 •

edited

PDD0911-HCMUS commented Apr 25, 2024 •

edited

tami64 commented May 7, 2024 •

edited

sennnnn commented May 13, 2024

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #237

torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #237

Comments

ghost commented May 19, 2023 • edited by ghost

yuanlisky commented May 19, 2023

abhijeetGithu commented May 22, 2023

wujiahongPKU commented May 24, 2023

ishitaverma commented May 31, 2023

ghost commented Jun 1, 2023 via email

ishitaverma commented Jun 1, 2023 via email

ghost commented Jun 5, 2023 via email

BruceZhou95 commented Jun 9, 2023

IronSpiderMan commented Jul 22, 2023

IronSpiderMan commented Jul 22, 2023

AnustupOCR commented Jul 26, 2023

HWH-2000 commented Aug 6, 2023

DengNingyuan commented Aug 10, 2023

THUVAARAGAN commented Sep 1, 2023

djaym7 commented Sep 22, 2023

torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

./run.py FAILED

Failures: <NO_OTHER_FAILURES>

nibrasrakib commented Oct 20, 2023

joslefaure commented Jan 24, 2024

jaouiwassim commented Mar 19, 2024

Edenzzzz commented Apr 19, 2024

wangsang123 commented Apr 22, 2024

ammaryasirnaich commented Apr 22, 2024

adipill04 commented Apr 22, 2024 • edited

PDD0911-HCMUS commented Apr 25, 2024 • edited

tami64 commented May 7, 2024 • edited

sennnnn commented May 13, 2024

ghost commented May 19, 2023 •

edited by ghost

Failures:
<NO_OTHER_FAILURES>

adipill04 commented Apr 22, 2024 •

edited

PDD0911-HCMUS commented Apr 25, 2024 •

edited

tami64 commented May 7, 2024 •

edited