LISA can not run on multi-GPU setting #1474

AgentLLM · 2024-04-02T03:17:38Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

The LISA should run on multi-GPU.

Current behaviour

The LISA can only run on single-GPU. Change to the multi-GPU will lead to below bug.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused pa
rameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by                                                                                 
making sure all `forward` function outputs participate in calculating loss.                                                                                                                                            
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and th
e structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).                                                                                                     
Parameter indices which did not receive grad for rank 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 ...                                                                              
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

Steps to reproduce

Below is the multi-GPU config.

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
gpu_ids: 
machine_rank: 0
mixed_precision: bf16
num_machines: 1
num_processes: 2

Config yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
gpu_ids: 
machine_rank: 0
mixed_precision: bf16
num_machines: 1
num_processes: 2

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.9

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

winglian · 2024-04-04T00:35:46Z

Can you try setting ddp_find_unused_parameters: true ?

AgentLLM · 2024-04-04T02:54:30Z

Can you try setting ddp_find_unused_parameters: true ?

Add ddp_find_unused_parameters: true in the lisa.yaml, and have the same bug.

winglian · 2024-04-04T23:50:52Z

are you using FSDP or deepspeed?

winglian · 2024-04-05T00:45:34Z

It seems this might be some DDP specific issue. I've tried a few things like setting a deterministic seed for the random layer picker and adding ddp_find_unused_parameters: true, to no avail.

AgentLLM · 2024-04-05T03:27:36Z

are you using FSDP or deepspeed?

I'm not sure which one I am using. This is my first time using your LLM framework, and I've only added 'ddp_find_unused_parameters: true' to the 'lisa.yaml' file without making any other changes.

lhl · 2024-04-07T18:57:17Z

btw, here's a discussion on deepspeed issues w/ LISA: OptimalScale/LMFlow#726 and a potential workaround: OptimalScale/LMFlow#726 (comment)

AgentLLM added the bug Something isn't working label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LISA can not run on multi-GPU setting #1474

LISA can not run on multi-GPU setting #1474

AgentLLM commented Apr 2, 2024

winglian commented Apr 4, 2024

AgentLLM commented Apr 4, 2024

winglian commented Apr 4, 2024

winglian commented Apr 5, 2024

AgentLLM commented Apr 5, 2024

lhl commented Apr 7, 2024

LISA can not run on multi-GPU setting #1474

LISA can not run on multi-GPU setting #1474

Comments

AgentLLM commented Apr 2, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

winglian commented Apr 4, 2024

AgentLLM commented Apr 4, 2024

winglian commented Apr 4, 2024

winglian commented Apr 5, 2024

AgentLLM commented Apr 5, 2024

lhl commented Apr 7, 2024