CUDA_ERROR_ILLEGAL_ADDRESS error with AlphaFold multimer 2.3.0 #667

sz-1002 · 2022-12-26T08:22:37Z

Hi!

I am trying to run AlphaFold 2.3.0 multimer and encountered this error: Execution of replica 0 failed: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered (see details below). I was wondering if you could help me resolve it? Thank you very much!!

Machine spec etc.:

Ubuntu 20.04 LTS
GPU: nvidia RTX 3090
cuda version: tried both 11.1.1 and 11.4.0 (both gave the same error)
total length of the protein complex: ~2200 aa

When I searched related errors online, it seems there are generally two solutions proposed: (1) change to a newer cuda version https://github.com/deepmind/dm-haiku/issues/204, or (2) disable unified memory https://github.com/deepmind/alphafold/issues/406.

I tried using cuda 11.4.0 instead of 11.1.1 by changing the following lines in Dockerfile, but the same error persists.

ARG CUDA=11.1.1 --->>> ARG CUDA=11.4.0
FROM nvidia/cuda:${CUDA}-cudnn8-runtime-ubuntu18.04 --->>> FROM nvidia/cuda:${CUDA}-cudnn8-runtime-ubuntu20.04
conda install -y -c conda-forge cudatoolkit==${CUDA_VERSION} --->>> conda install -y -c "nvidia/label/cuda-11.4.0" cuda-toolkit

As for (2) disable unified memory, I am worried that this would give me out of memory error given the size of the protein.

Not sure if this is relevant, but this is a recent problem and prediction for this and other similarly-sized complexes worked fine before (was using v2.2.0 before, and I wonder if this is an issue with e.g. version of jax or jaxlib).

Thank you very much!

Error message:

I1226 15:42:37.167795 140027647279936 run_docker.py:255] I1226 06:42:37.167222 140635718940480 amber_minimize.py:407] Minimizing protein, attempt 1 of 100.
I1226 15:42:39.806318 140027647279936 run_docker.py:255] I1226 06:42:39.805861 140635718940480 amber_minimize.py:68] Restraining 17790 / 35357 particles.
I1226 15:45:15.867727 140027647279936 run_docker.py:255] I1226 06:45:15.866998 140635718940480 amber_minimize.py:177] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {}, 'Se_in_MET': [], 'removed_chains': {0: []}}
I1226 15:45:42.889597 140027647279936 run_docker.py:255] 2022-12-26 06:45:42.889173: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
I1226 15:45:42.899475 140027647279936 run_docker.py:255] Traceback (most recent call last):
I1226 15:45:42.899577 140027647279936 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 432, in <module>
I1226 15:45:42.899646 140027647279936 run_docker.py:255] app.run(main)
I1226 15:45:42.899709 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
I1226 15:45:42.899771 140027647279936 run_docker.py:255] _run_main(main, args)
I1226 15:45:42.899834 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
I1226 15:45:42.899896 140027647279936 run_docker.py:255] sys.exit(main(argv))
I1226 15:45:42.899957 140027647279936 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 408, in main
I1226 15:45:42.900018 140027647279936 run_docker.py:255] predict_structure(
I1226 15:45:42.900109 140027647279936 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 243, in predict_structure
I1226 15:45:42.900174 140027647279936 run_docker.py:255] relaxed_pdb_str, _, violations = amber_relaxer.process(
I1226 15:45:42.900234 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/relax/relax.py", line 62, in process
I1226 15:45:42.900292 140027647279936 run_docker.py:255] out = amber_minimize.run_pipeline(
I1226 15:45:42.900353 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 489, in run_pipeline
I1226 15:45:42.900412 140027647279936 run_docker.py:255] ret.update(get_violation_metrics(prot))
I1226 15:45:42.900472 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 357, in get_violation_metrics
I1226 15:45:42.900531 140027647279936 run_docker.py:255] structural_violations, struct_metrics = find_violations(prot)
I1226 15:45:42.900591 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 339, in find_violations
I1226 15:45:42.900651 140027647279936 run_docker.py:255] violations = folding.find_structural_violations(
I1226 15:45:42.900712 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/model/folding.py", line 761, in find_structural_violations
I1226 15:45:42.900774 140027647279936 run_docker.py:255] between_residue_clashes = all_atom.between_residue_clash_loss(
I1226 15:45:42.900835 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/model/all_atom.py", line 783, in between_residue_clash_loss
I1226 15:45:42.900898 140027647279936 run_docker.py:255] dists = jnp.sqrt(1e-10 + jnp.sum(
I1226 15:45:42.900959 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/jax/_src/numpy/reductions.py", line 216, in sum
I1226 15:45:42.901019 140027647279936 run_docker.py:255] return _reduce_sum(a, axis=_ensure_optional_axes(axis), dtype=dtype, out=out,
I1226 15:45:42.901078 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
I1226 15:45:42.901138 140027647279936 run_docker.py:255] return fun(*args, **kwargs)
I1226 15:45:42.901199 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/jax/_src/api.py", line 623, in cache_miss
I1226 15:45:42.901261 140027647279936 run_docker.py:255] out_flat = call_bind_continuation(execute(*args_flat))
I1226 15:45:42.901322 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/jax/_src/dispatch.py", line 895, in _execute_compiled
I1226 15:45:42.901383 140027647279936 run_docker.py:255] out_flat = compiled.execute(in_flat)
I1226 15:45:42.901446 140027647279936 run_docker.py:255] jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
I1226 15:45:42.901507 140027647279936 run_docker.py:255]
I1226 15:45:42.901568 140027647279936 run_docker.py:255] The stack trace below excludes JAX-internal frames.
I1226 15:45:42.901637 140027647279936 run_docker.py:255] The preceding is the original exception that occurred, unmodified.

The text was updated successfully, but these errors were encountered:

leiterenato · 2022-12-27T00:53:18Z

Hi @sz-1002
I am having the same issue on the relaxation step.
#660

I tried with newer version of CUDA (11.6) and had the same issue.

sz-1002 · 2022-12-27T01:44:34Z

Hi @leiterenato
Thanks for letting me know! Hopefully there will be a solution for this.

joshabramson · 2022-12-27T13:25:34Z

a different non-relax model issue seems to have been resolved by using cuda 11.8 #646 (comment)

assuming that doesn't work, can try use_gpu_relax=False, or turn off relax entirely with run_relax=False. we will attempt to address the problem more fully in the new year.

joshabramson · 2023-01-05T10:37:26Z

We think this is due to jax version change from 0.3.17 to 0.3.25. We don't want to revert jax version though, so are looking for workarounds.

alexanderimanicowenrivers · 2023-01-12T16:08:31Z

The latest fix should solve this issue. Thank you for your patience.

sz-1002 · 2023-01-16T00:44:16Z

Fixed in AlphaFold v2.3.1. Thank you!

ChrisLou-bioinfo · 2023-01-16T00:53:58Z

Thank you!

leiterenato mentioned this issue Jan 9, 2023

Relaxation step fails with the new version aligned with AlphaFold v2.3.0 GoogleCloudPlatform/vertex-ai-alphafold-inference-pipeline#21

Closed

sz-1002 mentioned this issue Jan 12, 2023

python: can't open file '/app/alphafold/run_alphafold.py': [Errno 13] Permission denied (related to Ubuntu update?) #677

Closed

sz-1002 closed this as completed Jan 16, 2023

alexanderimanicowenrivers mentioned this issue Jan 17, 2023

Low GPU memory-usage and 0 GPU-Util #588

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA_ERROR_ILLEGAL_ADDRESS error with AlphaFold multimer 2.3.0 #667

CUDA_ERROR_ILLEGAL_ADDRESS error with AlphaFold multimer 2.3.0 #667

sz-1002 commented Dec 26, 2022

leiterenato commented Dec 27, 2022

sz-1002 commented Dec 27, 2022

joshabramson commented Dec 27, 2022

joshabramson commented Jan 5, 2023

alexanderimanicowenrivers commented Jan 12, 2023

sz-1002 commented Jan 16, 2023

ChrisLou-bioinfo commented Jan 16, 2023

CUDA_ERROR_ILLEGAL_ADDRESS error with AlphaFold multimer 2.3.0 #667

CUDA_ERROR_ILLEGAL_ADDRESS error with AlphaFold multimer 2.3.0 #667

Comments

sz-1002 commented Dec 26, 2022

leiterenato commented Dec 27, 2022

sz-1002 commented Dec 27, 2022

joshabramson commented Dec 27, 2022

joshabramson commented Jan 5, 2023

alexanderimanicowenrivers commented Jan 12, 2023

sz-1002 commented Jan 16, 2023

ChrisLou-bioinfo commented Jan 16, 2023