-
Notifications
You must be signed in to change notification settings - Fork 353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]Backend 'gpu' failed to initialize: FAILED_PRECONDITION: No visible GPU devices. #452
Comments
@zyc-bit Are you using slurm? @TarzanZhao did you see similar issue when you tried to install Alpa on the slurm cluster? |
@zhisbug Thanks for reply. |
And I set
|
It seems this is a XLA + Slurm issue. XLA has trouble loading CUDA dynamic libs on slurm. To confirm that, could you try run some JAX/XLA program without Alpa and see if it works in your env? |
I did not meet this error before after searching in my personal notes. |
@zhisbug I ran a simple JAX program, and it reported:
So the problem should really have nothing to do with Alap. It looks like that XLA has trouble loading CUDA dynamic libs on slurm. But from the |
I think your JAX/JAXLIB versions are correct (as long as you follow our installation guide). When you request for a job, slurm might have trouble finding the right CUDA path. Do you know your administrater who manages that Slurm? Each Slurm has different ways of installing CUDA. A second way to debug is: instead of asking slurm to run the job, could you ask for an interactive bash session, and try to manually launch in the bash and see if that can help you locate the correct CUDA paths. |
@zhisbug |
When you do this step 3 of this guide: https://alpa-projects.github.io/install/from_source.html#install-from-source jaxlib is compiled and installed. If you are testing with a JAX-only env (w/o Alpa), make sure to follow here (https://github.com/google/jax#pip-installation-gpu-cuda) to install the jaxlib. Jaxlib is required to run jax, regardless of w/ or w/o Alpa. It is the backend of JAX. And you need the CUDA version of JaxLib. No problem. Feel free to keep up updated with your exploration. |
@TarzanZhao |
@zyc-bit In my AWS environment, I use |
@zhuohan123 Thanks a lot for replying. This really helped me. Thank you again. @zhisbug I solved the problem that can't find gpu I mentioned above. The reason is that when compiling the jaxlib, users like me who uses slurm must compiling on This solution above can be used as a reference for other slurm users maybe. And now I ran into a new problem, maybe this problem is a Ray-related problem. (Please forgive me for asking this question here, because after I googled this question, I didn't find a relevant ray answer.) If this question doesn't fit under this issue, please let me know. If this is not a problem caused by Alpa, I will ask in Ray project or community. |
One possibility: it seems that your SRUN is not requesting a sufficient number of CPU threads for ray to work |
@zhisbug Thank you for your reply. And I apologize for my untimely reply to you. |
I found the solution. The error caused by the Proxy of the cluster. So the slurm cluster users should check their own proxy before they using alpa on their slurm cluster. Maybe you can remind people who uses slurm cluster in your doc. |
Hi,
I don't know if it's appropriate to file this as a bug, but it's been bugging me for a long time and I have no way to fix it.
I'm operating on a cluster. Ray saw my GPU but alpa didn't. I followed the installation documentation Install Alpa. And I confirmed I used --enable_cuda when I compiled jax-alpa. When running
tests/test_install.py
errors are reported, you can see the error log attached below for more details.System information and environment
conda list
andpip list
show the Alpa version is 0.0.0)To Reproduce
I ran
and my test_install.sh is:
Log
and my Environment Variables are:
The text was updated successfully, but these errors were encountered: