Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure on Nvidia devices with compute capability 8.6 #55

Closed
miesav opened this issue Jul 25, 2021 · 9 comments
Closed

Failure on Nvidia devices with compute capability 8.6 #55

miesav opened this issue Jul 25, 2021 · 9 comments
Labels
cuda docker An issue with Docker error report Something isn't working

Comments

@miesav
Copy link

miesav commented Jul 25, 2021

Just a FYI. Running on an Ampere A10 with CC 8.6.

I0725 13:13:22.066373 139761799472960 run_docker.py:200] 2021-07-25 03:13:22.065889: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:235] Falling back to the CUDA driver for PTX compilation; ptxas does not support CC 8.6
I0725 13:13:22.066596 139761799472960 run_docker.py:200] 2021-07-25 03:13:22.065923: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:238] Used ptxas at ptxas
I0725 13:13:22.066839 139761799472960 run_docker.py:200] 2021-07-25 03:13:22.066559: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:625] failed to get PTX kernel "shift_right_logical_3" from module: CUDA_ERROR_NOT_FOUND: named symbol not found
I0725 13:13:22.067006 139761799472960 run_docker.py:200] 2021-07-25 03:13:22.066620: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2040] Execution of replica 0 failed: Internal: Could not find the corresponding function
I0725 13:13:22.068676 139761799472960 run_docker.py:200] Traceback (most recent call last):
I0725 13:13:22.068786 139761799472960 run_docker.py:200] File "/app/alphafold/run_alphafold.py", line 303, in
I0725 13:13:22.068876 139761799472960 run_docker.py:200] app.run(main)
I0725 13:13:22.068992 139761799472960 run_docker.py:200] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
I0725 13:13:22.069080 139761799472960 run_docker.py:200] _run_main(main, args)
I0725 13:13:22.069165 139761799472960 run_docker.py:200] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
I0725 13:13:22.069255 139761799472960 run_docker.py:200] sys.exit(main(argv))
I0725 13:13:22.069342 139761799472960 run_docker.py:200] File "/app/alphafold/run_alphafold.py", line 285, in main
I0725 13:13:22.069428 139761799472960 run_docker.py:200] random_seed=random_seed)
I0725 13:13:22.069509 139761799472960 run_docker.py:200] File "/app/alphafold/run_alphafold.py", line 149, in predict_structure
I0725 13:13:22.069588 139761799472960 run_docker.py:200] prediction_result = model_runner.predict(processed_feature_dict)
I0725 13:13:22.069675 139761799472960 run_docker.py:200] File "/app/alphafold/alphafold/model/model.py", line 134, in predict
I0725 13:13:22.069755 139761799472960 run_docker.py:200] result = self.apply(self.params, jax.random.PRNGKey(0), feat)
I0725 13:13:22.069834 139761799472960 run_docker.py:200] File "/opt/conda/lib/python3.7/site-packages/jax/_src/random.py", line 75, in PRNGKey
I0725 13:13:22.069914 139761799472960 run_docker.py:200] k1 = convert(lax.shift_right_logical(seed_arr, lax._const(seed_arr, 32)))
I0725 13:13:22.070003 139761799472960 run_docker.py:200] File "/opt/conda/lib/python3.7/site-packages/jax/_src/lax/lax.py", line 382, in shift_right_logical
I0725 13:13:22.070081 139761799472960 run_docker.py:200] return shift_right_logical_p.bind(x, y)
I0725 13:13:22.070159 139761799472960 run_docker.py:200] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 264, in bind
I0725 13:13:22.070236 139761799472960 run_docker.py:200] out = top_trace.process_primitive(self, tracers, params)
I0725 13:13:22.070315 139761799472960 run_docker.py:200] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 604, in process_primitive
I0725 13:13:22.070394 139761799472960 run_docker.py:200] return primitive.impl(*tracers, **params)
I0725 13:13:22.070472 139761799472960 run_docker.py:200] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 262, in apply_primitive
I0725 13:13:22.070549 139761799472960 run_docker.py:200] return compiled_fun(*args)
I0725 13:13:22.070631 139761799472960 run_docker.py:200] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 378, in _execute_compiled_primitive
I0725 13:13:22.070705 139761799472960 run_docker.py:200] out_bufs = compiled.execute(input_bufs)
I0725 13:13:22.070770 139761799472960 run_docker.py:200] RuntimeError: Internal: Could not find the corresponding function

@miesav
Copy link
Author

miesav commented Jul 25, 2021

c1bf7af 7f0b5ad solve this issue with the Dockerfile changes. Thanks a lot @chrisroat

@RodenLuo
Copy link

Hi, I think I'm having a similar error. I'm on NVIDIA RTX A6000. After I pull the latest code I still have the error. Could you please check and see if it can be solved? Thanks. Below I attached the last few lines of the log. If more info is needed, please let me know.

EEAQITYSWKKDSSPVEGSTNVYTVDTSSVGSQTIEVTATVTAADYNPVTVTKTGNVTVTAKVAPEPEGELPYVHPLPHRSSAYIWCGWWVMDEIQKMTEEGKDWKTDDPDSKYYLHRYTLQKMMKDYPEVDVQESRNGYIIHKTALETGIIYTYP, template: VLEVDTQGTVVCSLDGLFPVSEAQVHLALGDQRLNPTVTYGNDSFSAKASVSVTAEDEGTQRLTCAVILGNQSQETLQTVTIYSFPAPNVILTKPEVSEGTEVTVKCEAHPRAKVTLNGVPAQPLGPRAQLLLKATPEDNGRSFSCSATLEVAGQLIHKNQTRELRVLYGPRLDERDCPGNWTWPENSQQTPMCQAWGNPLPELKCLKDGTFPLPIGESVTVTRDLEGTYLCRAR
I0729 16:15:52.818250 23060273533120 run_docker.py:193] I0729 13:15:52.817516 140379006277376 templates.py:271] Found an exact template
match 1z7z_I.
I0729 16:15:53.303489 23060273533120 run_docker.py:193] I0729 13:15:53.302619 140379006277376 run_alphafold.py:141] Running model model_1
I0729 16:15:55.840626 23060273533120 run_docker.py:193] 2021-07-29 13:15:55.839955: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
I0729 16:15:55.840924 23060273533120 run_docker.py:193] 2021-07-29 13:15:55.840622: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
I0729 16:15:55.841116 23060273533120 run_docker.py:193] Skipping registering GPU devices...
I0729 16:16:00.499923 23060273533120 run_docker.py:193] I0729 13:16:00.498721 140379006277376 model.py:132] Running predict with shape(feat) = {'aatype': (32, 376), 'residue_index': (32, 376), 'seq_length': (32,), 'template_aatype': (32, 4, 376), 'template_all_atom_masks': (32, 4, 376, 37), 'template_all_atom_positions': (32, 4, 376, 37, 3), 'template_sum_probs': (32, 4, 1), 'is_distillation': (32,), 'seq_mask': (32, 376), 'msa_mask': (32, 508, 376), 'msa_row_mask': (32, 508), 'random_crop_to_size_seed': (32, 2), 'template_mask': (32, 4), 'template_pseudo_beta': (32, 4, 376, 3), 'template_pseudo_beta_mask': (32, 4, 376), 'atom14_atom_exists': (32, 376, 14), 'residx_atom14_to_atom37': (32, 376, 14), 'residx_atom37_to_atom14': (32, 376, 37), 'atom37_atom_exists': (32, 376, 37), 'extra_msa': (32, 5120, 376), 'extra_msa_mask': (32, 5120, 376), 'extra_msa_row_mask': (32, 5120), 'bert_mask': (32, 508, 376), 'true_msa': (32, 508, 376), 'extra_has_deletion': (32, 5120, 376), 'extra_deletion_value': (32, 5120, 376), 'msa_feat': (32, 508, 376, 49), 'target_feat': (32, 376, 22)}
I0729 16:16:01.432996 23060273533120 run_docker.py:193] 2021-07-29 13:16:01.432244: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:235] Falling back to the CUDA driver for PTX compilation; ptxas does not support CC 8.6
I0729 16:16:01.433186 23060273533120 run_docker.py:193] 2021-07-29 13:16:01.432302: W external/org_tensorflow/tensorflow/stream_executor/gpu/asm_compiler.cc:238] Used ptxas at ptxas
I0729 16:16:01.435479 23060273533120 run_docker.py:193] 2021-07-29 13:16:01.434778: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:625] failed to get PTX kernel "shift_right_logical_3" from module: CUDA_ERROR_NOT_FOUND: named symbol not found
I0729 16:16:01.435665 23060273533120 run_docker.py:193] 2021-07-29 13:16:01.434875: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2040] Execution of replica 0 failed: Internal: Could not find the corresponding function
I0729 16:16:01.440179 23060273533120 run_docker.py:193] Traceback (most recent call last):
I0729 16:16:01.440402 23060273533120 run_docker.py:193] File "/app/alphafold/run_alphafold.py", line 302, in <module>
I0729 16:16:01.440572 23060273533120 run_docker.py:193] app.run(main)
I0729 16:16:01.440726 23060273533120 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
I0729 16:16:01.440872 23060273533120 run_docker.py:193] _run_main(main, args)
I0729 16:16:01.441013 23060273533120 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
I0729 16:16:01.441150 23060273533120 run_docker.py:193] sys.exit(main(argv))
I0729 16:16:01.441284 23060273533120 run_docker.py:193] File "/app/alphafold/run_alphafold.py", line 284, in main
I0729 16:16:01.441416 23060273533120 run_docker.py:193] random_seed=random_seed)
I0729 16:16:01.441550 23060273533120 run_docker.py:193] File "/app/alphafold/run_alphafold.py", line 148, in predict_structure
I0729 16:16:01.441683 23060273533120 run_docker.py:193] prediction_result = model_runner.predict(processed_feature_dict)
I0729 16:16:01.441817 23060273533120 run_docker.py:193] File "/app/alphafold/alphafold/model/model.py", line 133, in predict
I0729 16:16:01.441951 23060273533120 run_docker.py:193] result = self.apply(self.params, jax.random.PRNGKey(0), feat)
I0729 16:16:01.442085 23060273533120 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/_src/random.py", line 75, in PRNGKey
I0729 16:16:01.442246 23060273533120 run_docker.py:193] k1 = convert(lax.shift_right_logical(seed_arr, lax._const(seed_arr, 32)))
I0729 16:16:01.442380 23060273533120 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/_src/lax/lax.py", line 382, in
shift_right_logical
I0729 16:16:01.442511 23060273533120 run_docker.py:193] return shift_right_logical_p.bind(x, y)
I0729 16:16:01.442642 23060273533120 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 264, in bind
I0729 16:16:01.442774 23060273533120 run_docker.py:193] out = top_trace.process_primitive(self, tracers, params)
I0729 16:16:01.442907 23060273533120 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 604, in process_primitive
I0729 16:16:01.443041 23060273533120 run_docker.py:193] return primitive.impl(*tracers, **params)
I0729 16:16:01.443177 23060273533120 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 262, in apply_primitive
I0729 16:16:01.443312 23060273533120 run_docker.py:193] return compiled_fun(*args)
I0729 16:16:01.443447 23060273533120 run_docker.py:193] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 378, in _execute_compiled_primitive
I0729 16:16:01.443582 23060273533120 run_docker.py:193] out_bufs = compiled.execute(input_bufs)
I0729 16:16:01.443713 23060273533120 run_docker.py:193] RuntimeError: Internal: Could not find the corresponding function

@amrhamedp
Copy link

i have the same problem with ampere cards. Can you please fix this problem?

@tfgg
Copy link
Collaborator

tfgg commented Jul 30, 2021

Could you try changing https://github.com/deepmind/alphafold/blob/main/docker/Dockerfile#L15 to CUDA 11.1? According to https://en.wikipedia.org/wiki/CUDA, 11.1+ is required for CC 8.6 (some of the more recent Ampere series but not the A100).

We will be looking at upgrading the CUDA version in this repository, but this requires careful benchmarking to ensure that there are no performance or accuracy regressions, so it will be faster for you to try this locally for now.

@miesav
Copy link
Author

miesav commented Jul 30, 2021

Thank you @tfgg. I experimented a bit with the Dockerfile initially. I can confirm the current master works with Nvidia GPUs based on the GA100 chip (A100, A30). Upgraded the CUDA toolkit to 11.2 for GPUs based on the GA102 chip (A10, A40, RTX AX000) and everything worked fine so far.

@tfgg tfgg mentioned this issue Jul 30, 2021
@RodenLuo
Copy link

RodenLuo commented Aug 2, 2021

Thank you! I confirm CUDA 11.1 works with NVIDIA RTX A6000. Changing the following line

https://github.com/deepmind/alphafold/blob/b88f8dacef5d94e4d3d49613d08523feb20caec1/docker/Dockerfile#L15

to ARG CUDA=11.1 and then rebuild the docker image solves the error I reported above.

@amrhamedp
Copy link

11.2 has compatibility issues i believe.11.1 works

@jucastil
Copy link

jucastil commented Aug 9, 2021

I god rid of the error also. Thanks! Running dockers on CentOS 7.9.2009 (Core), with NVIDIA GeForce RTX 3090.

@Augustin-Zidek
Copy link
Collaborator

This was fixed in 57a2455.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda docker An issue with Docker error report Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants