Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LISA can not run on multi-GPU setting #1474

Open
6 of 8 tasks
AgentLLM opened this issue Apr 2, 2024 · 6 comments
Open
6 of 8 tasks

LISA can not run on multi-GPU setting #1474

AgentLLM opened this issue Apr 2, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@AgentLLM
Copy link

AgentLLM commented Apr 2, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

The LISA should run on multi-GPU.

Current behaviour

The LISA can only run on single-GPU. Change to the multi-GPU will lead to below bug.

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused pa
rameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by                                                                                 
making sure all `forward` function outputs participate in calculating loss.                                                                                                                                            
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and th
e structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).                                                                                                     
Parameter indices which did not receive grad for rank 1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 ...                                                                              
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error                                               

Steps to reproduce

Below is the multi-GPU config.

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
gpu_ids: 
machine_rank: 0
mixed_precision: bf16
num_machines: 1
num_processes: 2

Config yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
gpu_ids: 
machine_rank: 0
mixed_precision: bf16
num_machines: 1
num_processes: 2

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.9

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@AgentLLM AgentLLM added the bug Something isn't working label Apr 2, 2024
@winglian
Copy link
Collaborator

winglian commented Apr 4, 2024

Can you try setting ddp_find_unused_parameters: true ?

@AgentLLM
Copy link
Author

AgentLLM commented Apr 4, 2024

Can you try setting ddp_find_unused_parameters: true ?

Add ddp_find_unused_parameters: true in the lisa.yaml, and have the same bug.

@winglian
Copy link
Collaborator

winglian commented Apr 4, 2024

are you using FSDP or deepspeed?

@winglian
Copy link
Collaborator

winglian commented Apr 5, 2024

It seems this might be some DDP specific issue. I've tried a few things like setting a deterministic seed for the random layer picker and adding ddp_find_unused_parameters: true, to no avail.

@AgentLLM
Copy link
Author

AgentLLM commented Apr 5, 2024

are you using FSDP or deepspeed?

I'm not sure which one I am using. This is my first time using your LLM framework, and I've only added 'ddp_find_unused_parameters: true' to the 'lisa.yaml' file without making any other changes.

@lhl
Copy link
Contributor

lhl commented Apr 7, 2024

btw, here's a discussion on deepspeed issues w/ LISA: OptimalScale/LMFlow#726 and a potential workaround: OptimalScale/LMFlow#726 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants