New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: #237
Comments
same error |
The error you mentioned earlier, torch.distributed.elastic.multiprocessing.errors.ChildFailedError, typically occurs when one of the child processes launched by torchrun encounters an error and fails to execute properly. Data or code issues: Check if there are any data-related issues, such as corrupted or incompatible data. Also, review your code for any potential issues that could cause errors during training. Make sure your code is compatible with the version of PyTorch and other dependencies you are using. Debugging the child process: Try to gather more information about the error in the child process. You can modify your code to catch and print out the specific error message or traceback for the failed child process. This will help you narrow down the issue and provide more context for troubleshooting. Updating PyTorch and dependencies: Make sure you are using the latest version of PyTorch and related dependencies. Check for any updates or bug fixes that may address the issue you're facing. It's also a good practice to ensure that all the dependencies in your environment are compatible with each other. Check for known issues or bugs: Search online forums, issue trackers, or the official PyTorch documentation for any known issues related to the torch.distributed.elastic.multiprocessing module. It's possible that the error you're encountering is a known issue with an existing solution or workaround. |
I have the same problem, I use v100 to finetune second stage using 7B |
Is there any solution for this? I am facing the same issue. |
There are no errors of ddp. No matter what errors occur, this error repostr
is always in the ddp.
So you should check the real error above these error reports.
From: "Ishita ***@***.***>
Date: Thu, Jun 1, 2023, 05:36
Subject: [External] Re: [Vision-CAIR/MiniGPT-4]
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: (Issue
#237)
To: ***@***.***>
Cc: ***@***.***>, "Author"<
***@***.***>
Is there any solution for this? I am facing the same issue.
—
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A7ZMO7AXBZZPNLT6AEO2BSLXI62VXANCNFSM6AAAAAAYHPSNQM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks!! So how can we find the exact cause of the error? There's no
traceback.
…On Wed, May 31, 2023 at 9:00 PM chengjiaxiang ***@***.***> wrote:
There are no errors of ddp. No matter what errors occur, this error repostr
is always in the ddp.
So you should check the real error above these error reports.
From: "Ishita ***@***.***>
Date: Thu, Jun 1, 2023, 05:36
Subject: [External] Re: [Vision-CAIR/MiniGPT-4]
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: (Issue
#237)
To: ***@***.***>
Cc: ***@***.***>, "Author"<
***@***.***>
Is there any solution for this? I am facing the same issue.
—
Reply to this email directly, view it on GitHub
<
#237 (comment)
>,
or unsubscribe
<
https://github.com/notifications/unsubscribe-auth/A7ZMO7AXBZZPNLT6AEO2BSLXI62VXANCNFSM6AAAAAAYHPSNQM
>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AC3TOFCOMWACR4QI5VLSSVDXJAHWHANCNFSM6AAAAAAYHPSNQM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
i have traceback,just scroll up.
From: "Ishita ***@***.***>
Date: Fri, Jun 2, 2023, 00:50
Subject: [External] Re: [Vision-CAIR/MiniGPT-4]
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: (Issue
#237)
To: ***@***.***>
Cc: ***@***.***>, "Author"<
***@***.***>
Thanks!! So how can we find the exact cause of the error? There's no
traceback.
On Wed, May 31, 2023 at 9:00 PM chengjiaxiang ***@***.***> wrote:
There are no errors of ddp. No matter what errors occur, this error
repostr
is always in the ddp.
So you should check the real error above these error reports.
From: "Ishita ***@***.***>
Date: Thu, Jun 1, 2023, 05:36
Subject: [External] Re: [Vision-CAIR/MiniGPT-4]
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: (Issue
#237)
To: ***@***.***>
Cc: ***@***.***>, "Author"<
***@***.***>
Is there any solution for this? I am facing the same issue.
—
Reply to this email directly, view it on GitHub
<
>,
or unsubscribe
<
https://github.com/notifications/unsubscribe-auth/A7ZMO7AXBZZPNLT6AEO2BSLXI62VXANCNFSM6AAAAAAYHPSNQM
>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub
<
,
or unsubscribe
<
https://github.com/notifications/unsubscribe-auth/AC3TOFCOMWACR4QI5VLSSVDXJAHWHANCNFSM6AAAAAAYHPSNQM
.
You are receiving this because you commented.Message ID:
***@***.***>
—
Reply to this email directly, view it on GitHub
<#237 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A7ZMO7GY37WZFMXHRGUCFMLXJDB6NANCNFSM6AAAAAAYHPSNQM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
mabey you can update lower version of torch. its work for me. |
i find the seem issue in hugging face, it's because of ram is not sufficient. |
|
Hi, I am having the same error while trying to Train TrOCR on multi-gpu single node setup. |
Probably extending shm will solve this problem |
I solve this problem by change the version of torch, when i use the torch2.0, i meet this question ,after i chinge the version of torch align with environment.yml ,i solve this provlem |
i am using torch version 2.0.1 but i got the same torch.distributed.elastic.multiprocessing.errors.ChildFailedError |
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:./run.py FAILEDFailures:
|
Anyone have any clue about this error? I am facing the same. |
Decreasing the batch size worked for me |
I had |
chatgpt |
I also encountered this problem. Is there any solution to this problem? |
Can you share the error that you are getting on the console? |
I know I'm not the original poster of the comment, but this is what I am getting. Any idea what exit code -6 indicates in this case? I am using 250 gb memory and 500 gb disk to run the training job so I wouldn't think it has to do with the resource allocation.
|
in my case I just modified the command line to run process like this: |
I had same problem for the following sample: To train a Swin Transformer on ImageNet from scratch, run: python -m torch.distributed.launch --nproc_per_node --master_port 12345 main.py \ I solved it by removing "torch.distributed.launch --nproc_per_node --master_port 12345". So: |
|
when i run this command:
torchrun --nproc-per-node 1 --master_port 25641 train.py --cfg-path train_configs/minigpt4_stage2_finetune.yaml
this error occurs, how can i fix it?
The text was updated successfully, but these errors were encountered: