Cuda/Pytorch/Installation Issues #172

Cweb118 · 2022-07-21T17:15:58Z

Hello! So I have been struggling with a strange issue that I hope you or someone would be able to help me with. Let me start by providing some information:

OS: Ubuntu 20.04.4
GPU: NVIDIA RTX A6000
NVIDIA-SMI/Driver Version: 470.129.06
CUDA Version: 11.4

So I am not sure if this is a problem with how I am attempting to install openfold, or if something else is going on. Essentially after cloning the repo the first thing I would do is run scripts/install_third_party_dependencies.sh. This would then create an environment called openfold_venv, however this environment does not seem to withhold many of the required packages (i.e. torch is absent). Following this with scripts/activate_environment.sh seems to fail. I have tried alternatively used conda env create -f environment.yml, which sets up an environment in a different location. Either way, after setting up the environment I end up with one of the following issues, either during python setup.py install or during inference:

"The detected CUDA version (10.1) mismatches the version that was used to compile
PyTorch (11.2). Please make sure to use the same CUDA versions." (despite torch.version.cuda returning 11.3)
"runtimeerror: Cuda error: no kernal image is available for execution on the device"

These are run into on clean installs with no conda or cudatoolkits installed anywhere else on the machine, so it is rather puzzling. As I said I am not sure if this is due to performing the install sequence incorrectly but I have tried several different solutions and they all seem to circle back to one of these errors.

I apologize as I know this is rather vague, but if you can offer any sort of guidance it would be greatly appreciated!

gahdritz · 2022-07-21T18:54:15Z

Try uninstalling PyTorch from your conda environment and then manually reinstall it using the instructions on the website here: https://pytorch.org/get-started/locally/. LMK if that helps.

Cweb118 · 2022-07-25T17:26:57Z

Ok so I got a fresh install set up and tried to use the pytorch installation for Cuda 11.3. I received the following error:

`(openfold_venv) cweber@Geiger:~/Desktop/openfold$ python3 setup.py install

running install

/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/install.py:37: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  setuptools.SetuptoolsDeprecationWarning,

/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools

/command/easy_install.py:159: EasyInstallDeprecationWarning: easy_install command is deprecated. 

Use build and pip and other standards-based tools.
  EasyInstallDeprecationWarning,

running bdist_egg

running egg_info

writing openfold.egg-info/PKG-INFO

writing dependency_links to openfold.egg-info/dependency_links.txt

writing top-level names to openfold.egg-info/top_level.txt

reading manifest file 'openfold.egg-info/SOURCES.txt'

adding license file 'LICENSE'

writing manifest file 'openfold.egg-info/SOURCES.txt'

installing library code to build/bdist.linux-x86_64/egg

running install_lib

running build_py

running build_ext

Traceback (most recent call last):

  File "setup.py", line 95, in <module>
    'Programming Language :: Python :: 3.7,' 

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
    return distutils.core.setup(**attrs)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/install.py", line 74, in run
    self.do_egg_install()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/install.py", line 116, in do_egg_install
    self.run_command('bdist_egg')

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/install_lib.py", line 11, in run
    self.build()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/command/install_lib.py", line 107, in build
    self.run_command('build_ext')

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 79, in run
    _build_ext.run(self)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 434, in build_extensions
    self._check_cuda_version(compiler_name, compiler_version)

  File "/home/cweber/Desktop/openfold/lib/conda/envs/openfold_venv/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 812, in _check_cuda_version
    raise RuntimeError(CUDA_MISMATCH_MESSAGE.format(cuda_str_version, torch.version.cuda))

RuntimeError: 
The detected CUDA version (10.1) mismatches the version that was used to compile
PyTorch (11.3). Please make sure to use the same CUDA versions.`

gahdritz · 2022-07-26T18:48:33Z

Are you sure that your CUDA version isn't actually 10.1?

Cweb118 · 2022-07-26T18:56:09Z

Unfortunately, yes. Running torch.version.cuda returns 11.3 and there is no evidence of 10.1 on this machine.

gahdritz · 2022-07-26T19:01:10Z

Not even in your virtual environment anywhere? The error message does confirm that it knows that your torch CUDA is 11.3.

bing-song · 2022-07-26T20:56:04Z

I saw the same problem, namely, ""runtimeerror: Cuda error: no kernal image is available for execution on the device" when I run run_pretrained_openfold.py.

I am using Azure GPU and follows installation (Linux). No code has been changed.

Cweb118 · 2022-07-26T21:55:50Z

Not even in your virtual environment anywhere? The error message does confirm that it knows that your torch CUDA is 11.3.

Not that I can see anywhere. After purging the machine of all conda stuff to try and avoid this conflict, the only things I have done are run the install_third_party_dependencies.sh and the torch command you shared. Searching through the packages in the lib does not return any results for versions of cudatoolkit other than for 11.3

lzhangUT · 2022-07-29T07:49:05Z

Any solution to this issue @gahdritz
I have been running into the same issue, and tried to install the CUDA and the whole thing. Following the instructions in the github, I was able to run precomputed_alignments_mmseqs.py, and got the alignments as expected. then I tried to run run_pretrained_openfold.py, my script and the error is like this:

it seems like starting to run model interference, but then ran into this cuda python issue. I also tried to conda install pytorch again, stil the same issue.
and I checked my cuda:

and pytorch, it seems like they are fine.

I am stuck here completely for my projects, any help will be appreciated.

gahdritz · 2022-07-29T16:14:39Z

Could you send the output of pip freeze?

Next, could you try downgrading to torch 1.10.1 and re-running python3 setup.py install? We recently upgraded to torch 1.12.0, so that might be the cause (granted, I can't reproduce this even with torch.1.12.0).

bing-song · 2022-08-01T23:01:51Z

Since this problem is very common to many people, I would like to give my detailed investigation and hopefully it will help to fix the problem soon.

I tested on two GPUs (Tesla K80 and Tesla M60) on with Microsoft Azure Machine Learning Studio. I observed the same problem on both GPUs. Here is the GPU info.

I tested both main and v1.0.0 branch. The issues are different. However, both issues have been reported on this thread and on #161. On v1.0.0, I observed something similar to # 161. On main, I observed what people reported here.

I tested with and without Docker container. The results are the same.

Here is Python and PyTorch information.

For v1.0.0

For main

The command line input with docker containers are

The error message for main is

There is no error message for v1.0.0 since both relax and no-relax pdb file has been produced. However, the pdb file is garbage as shown in the following image.

@gahdritz Let me know if you need more information and how can I help to fix this problem.

gahdritz · 2022-08-02T02:17:32Z

Interesting---this seems to be the first time this is happening on non-Pascal GPUs.

I still can't reproduce this @bing-song, so I'll need some extra help here, if you don't mind. in openfold/utils/kernel/attention_core.py, on the newest version of OF, would you mind printing both attention_logits.device and v.device right before line 53 where it crashes?

Thanks btw for putting this all together!

bing-song · 2022-08-02T16:11:25Z

@gahdritz Here is the prints that I added around line 53 for main branch

Here is the output (Not sure why END is not printed). The device cuda:0 is the correct one.

gahdritz · 2022-08-02T16:23:13Z

What happens if you put torch.cuda.synchronize() right before that matmul, below the custom kernel call?

So strange that the kernel executes multiple times without crashing...

bing-song · 2022-08-02T16:42:41Z

It is the same. Here is the code that includes more prints. I added prints on the matmul on line 38 also. That is fine.

bing-song · 2022-08-02T16:55:13Z

@gahdritz Here is the fasta file for this test.

7s0c_A_unpacked_A
TNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVI
RGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEI
YQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFE

If you have a fasta file that you want me to try on my machine, let me know.

gahdritz · 2022-08-02T17:10:50Z

I don't think this has to do with any particular input sequence, since I can't reproduce this on my machine. One last thing, if you don't mind: could you try running it with that CUDA flag (CUDA_LAUNCH_BLOCKING) mentioned in the error message set to 1?

bing-song · 2022-08-02T17:26:59Z

After set CUDO_LAUCH_BLOCKING=1, I did not see more debug info in the output.

Here is the output.

bing-song · 2022-08-02T18:03:46Z

@gahdritz I am thinking about to open a ssh for my Azure GPU server for you to debug. Do you think this will help?

gahdritz · 2022-08-02T18:05:52Z

Yeah that would be great actually.

bing-song · 2022-08-02T20:54:55Z

@gahdritz Can you let me know how to give you the ssh login info?

gahdritz · 2022-08-02T21:11:41Z

Send it to my Gmail, which is just my GitHub username.

gahdritz · 2022-08-03T04:27:59Z

I think I resolved this in 6c89015. @lzhangUT, @bing-song, @epenning could you verify that the inference script works on your systems now?

bing-song · 2022-08-03T16:44:47Z

@gahdritz Just tried. I checked out the openfold main branch and rebuild docker container and run the inference with the precomputed MSA alignment data. I have exact the same error.

gahdritz · 2022-08-03T16:45:33Z

Could you try without docker?

bing-song · 2022-08-03T20:17:30Z

@gahdritz , it is working well without docker. The predicted structure is good compared with the electron microscopy.

gahdritz · 2022-08-03T20:25:21Z

Excellent. I've since pushed a fix that should work for Docker. Could you give it a try? If that still doesn't work, could you change compute_capability, _ to compute_capability, error on line 55 of setup.py and print error?

bing-song · 2022-08-03T20:55:00Z

Tried docker. It still does not work. The same error message.

If change line 55 as

The setup cannot go through. Here is the error output.

gahdritz · 2022-08-03T20:56:51Z

You did the edit slightly wrong---you should replace compute_capability, _ with compute_capability, error and then print error, not replace it with compute_capability.

bing-song · 2022-08-03T21:31:04Z

not sure what you mean. Do you want the code like this?

gahdritz · 2022-08-03T21:33:38Z

Yes exactly.

bing-song · 2022-08-03T21:48:49Z

With this change, I got the same error message, namely,

gahdritz · 2022-08-03T22:49:33Z

Yes, but did you manage to capture the output of the print(error) line during the Docker build? I think that should tell us what's going wrong on your system.

bing-song · 2022-08-03T23:15:02Z

Yes, I captured the output. Here is ERROR: no CUDA-capable device is detected.

We know this is not true since we can run GPU and get good results without docker.

gahdritz · 2022-08-03T23:30:28Z

Right. Hm. The GPU must not be visible at that stage of the container's construction for whatever reason. As a sanity check, could you enter the resulting container, delete openfold/build and re-run python3 setup.py install? I suspect that the model will work as intended after that.

bing-song · 2022-08-03T23:57:37Z

Yes, that works and make sense. However, it is not a fix.

As I understand, the GPU information is not available during docker image build. It only available during create a docker container. This is the reason you need docker run --gpus ...

gahdritz · 2022-08-04T00:00:22Z

Yes this is unfortunate. Maybe the approach I took of dynamically determining the right GPU architectures to compile for fundamentally doesn't work in this case. Is there any alternative to hard-coding in a bunch of additional architectures, slowing down the build for everyone else? Perhaps I could look for a GPU, and, if one is found, remove other architectures from a long, hardcoded list that would be used otherwise. I need to think about this.

gahdritz · 2022-08-04T21:56:03Z

Ok @bing-song I did the thing in the previous comment. Check out f3814c9. It should now compile kernels for 3.7 and other CC's by default.

lzhangUT · 2022-08-05T05:30:56Z

I think I resolved this in 6c89015. @lzhangUT, @bing-song, @epenning could you verify that the inference script works on your systems now?

I changed the VM from Tesla P40 to V100, now the inference worked fine.

bing-song · 2022-08-08T17:38:59Z

@gahdritz I confirmed that the installation is working on both Docker and ENV for Azure K80.

jonathanking · 2022-10-18T21:12:06Z

Not even in your virtual environment anywhere? The error message does confirm that it knows that your torch CUDA is 11.3.

Not that I can see anywhere. After purging the machine of all conda stuff to try and avoid this conflict, the only things I have done are run the install_third_party_dependencies.sh and the torch command you shared. Searching through the packages in the lib does not return any results for versions of cudatoolkit other than for 11.3

I just want to follow up on this comment by @Cweb118 , since I had the exact same issue as them (version mismatch) but not the issues brought up by others in this thread. My local install nvcc did not match what the openfold_venv conda installed torch was expecting. I used conda to install a newer version of nvcc via conda install cudatoolkit-dev -c conda-forge, which got rid of the mismatch error experienced by myself and @Cweb118. The issue is related to the fact that when installing pytorch via cuda, nvcc itself is not installed, so a locally/previously installed version of nvcc could cause a version mismatch error.

gahdritz mentioned this issue Aug 4, 2022

Get low lddt score while running inference. #189

Closed

epenning mentioned this issue Aug 29, 2022

Unusual predicted structures from pretrained OpenFold on Pascal GPU #161

Closed

gahdritz closed this as completed Aug 30, 2022

Cuda/Pytorch/Installation Issues #172

Cuda/Pytorch/Installation Issues #172

Comments

Cweb118 commented Jul 21, 2022

gahdritz commented Jul 21, 2022 • edited

Cweb118 commented Jul 25, 2022

gahdritz commented Jul 26, 2022

Cweb118 commented Jul 26, 2022

gahdritz commented Jul 26, 2022 • edited

bing-song commented Jul 26, 2022

Cweb118 commented Jul 26, 2022

lzhangUT commented Jul 29, 2022

gahdritz commented Jul 29, 2022 • edited

bing-song commented Aug 1, 2022 • edited

gahdritz commented Aug 2, 2022 • edited

bing-song commented Aug 2, 2022

gahdritz commented Aug 2, 2022 • edited

bing-song commented Aug 2, 2022

bing-song commented Aug 2, 2022

gahdritz commented Aug 2, 2022

bing-song commented Aug 2, 2022

bing-song commented Aug 2, 2022

gahdritz commented Aug 2, 2022

bing-song commented Aug 2, 2022

gahdritz commented Aug 2, 2022

gahdritz commented Aug 3, 2022 • edited

bing-song commented Aug 3, 2022

gahdritz commented Aug 3, 2022

bing-song commented Aug 3, 2022 • edited

gahdritz commented Aug 3, 2022

bing-song commented Aug 3, 2022

gahdritz commented Aug 3, 2022

bing-song commented Aug 3, 2022

gahdritz commented Aug 3, 2022

bing-song commented Aug 3, 2022

gahdritz commented Aug 3, 2022

bing-song commented Aug 3, 2022

gahdritz commented Aug 3, 2022 • edited

bing-song commented Aug 3, 2022

gahdritz commented Aug 4, 2022 • edited

gahdritz commented Aug 4, 2022

lzhangUT commented Aug 5, 2022

bing-song commented Aug 8, 2022

jonathanking commented Oct 18, 2022

gahdritz commented Jul 21, 2022 •

edited

gahdritz commented Jul 26, 2022 •

edited

gahdritz commented Jul 29, 2022 •

edited

bing-song commented Aug 1, 2022 •

edited

gahdritz commented Aug 2, 2022 •

edited

gahdritz commented Aug 2, 2022 •

edited

gahdritz commented Aug 3, 2022 •

edited

bing-song commented Aug 3, 2022 •

edited

gahdritz commented Aug 3, 2022 •

edited

gahdritz commented Aug 4, 2022 •

edited