Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run it in multi-GPU setting without slurm #14

Open
ShawnKing98 opened this issue Dec 3, 2023 · 2 comments
Open

How to run it in multi-GPU setting without slurm #14

ShawnKing98 opened this issue Dec 3, 2023 · 2 comments

Comments

@ShawnKing98
Copy link

ShawnKing98 commented Dec 3, 2023

Hi,

I tried to run the fine_tune.py script on my lab's server, which is just a normal 4-GPU Ubuntu station without slurm support. When I ran it without distributed training setup, everything was okay. Then I tried to switch to multi-GPU setting and somehow I just couldn't get it work. I have tried the following ways and none of them seemed to work:

  1. accelerate config and accelerate launch fine_tune.py --py_args, which gave me the following error while initializing the accelerator object:

    ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable RANK expected, but not set
    
  2. torchrun fine_tune.py --py_args, which gave me the same error as method 1.

  3. Write another shell script which calls the fine_tune_pascal.sh script 4 times and passes in different $SLURM_ARRAY_TASK_ID, which seems not to be the correct way since every process claimed to be the main process and I guess they were just generating replicate things.

Could you help me out with this? I'm pretty sure my accelerate library setting is okay since I'm able to run their official toy example. Is that because the codes inside if __name__ == "__main__": block is not fully wrapped as a main() function, as instructed by huggingface accelerate? Should I wrap it again?

@ShawnKing98
Copy link
Author

I'll appreciate it if someone could see this and give me some suggestion

@ShawnKing98
Copy link
Author

Have you solved that??

Unfortunately no

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant