Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_ILLEGAL_ADDRESS error with AlphaFold multimer 2.3.0 #667

Closed
sz-1002 opened this issue Dec 26, 2022 · 7 comments
Closed

CUDA_ERROR_ILLEGAL_ADDRESS error with AlphaFold multimer 2.3.0 #667

sz-1002 opened this issue Dec 26, 2022 · 7 comments

Comments

@sz-1002
Copy link

sz-1002 commented Dec 26, 2022

Hi!

I am trying to run AlphaFold 2.3.0 multimer and encountered this error: Execution of replica 0 failed: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered (see details below). I was wondering if you could help me resolve it? Thank you very much!!

Machine spec etc.:

  • Ubuntu 20.04 LTS
  • GPU: nvidia RTX 3090
  • cuda version: tried both 11.1.1 and 11.4.0 (both gave the same error)
  • total length of the protein complex: ~2200 aa

When I searched related errors online, it seems there are generally two solutions proposed: (1) change to a newer cuda version https://github.com/deepmind/dm-haiku/issues/204, or (2) disable unified memory https://github.com/deepmind/alphafold/issues/406.

I tried using cuda 11.4.0 instead of 11.1.1 by changing the following lines in Dockerfile, but the same error persists.

ARG CUDA=11.1.1 --->>> ARG CUDA=11.4.0
FROM nvidia/cuda:${CUDA}-cudnn8-runtime-ubuntu18.04 --->>> FROM nvidia/cuda:${CUDA}-cudnn8-runtime-ubuntu20.04
conda install -y -c conda-forge cudatoolkit==${CUDA_VERSION} --->>> conda install -y -c "nvidia/label/cuda-11.4.0" cuda-toolkit

As for (2) disable unified memory, I am worried that this would give me out of memory error given the size of the protein.

Not sure if this is relevant, but this is a recent problem and prediction for this and other similarly-sized complexes worked fine before (was using v2.2.0 before, and I wonder if this is an issue with e.g. version of jax or jaxlib).

Thank you very much!

Error message:

I1226 15:42:37.167795 140027647279936 run_docker.py:255] I1226 06:42:37.167222 140635718940480 amber_minimize.py:407] Minimizing protein, attempt 1 of 100.
I1226 15:42:39.806318 140027647279936 run_docker.py:255] I1226 06:42:39.805861 140635718940480 amber_minimize.py:68] Restraining 17790 / 35357 particles.
I1226 15:45:15.867727 140027647279936 run_docker.py:255] I1226 06:45:15.866998 140635718940480 amber_minimize.py:177] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {}, 'Se_in_MET': [], 'removed_chains': {0: []}}
I1226 15:45:42.889597 140027647279936 run_docker.py:255] 2022-12-26 06:45:42.889173: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
I1226 15:45:42.899475 140027647279936 run_docker.py:255] Traceback (most recent call last):
I1226 15:45:42.899577 140027647279936 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 432, in <module>
I1226 15:45:42.899646 140027647279936 run_docker.py:255] app.run(main)
I1226 15:45:42.899709 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
I1226 15:45:42.899771 140027647279936 run_docker.py:255] _run_main(main, args)
I1226 15:45:42.899834 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
I1226 15:45:42.899896 140027647279936 run_docker.py:255] sys.exit(main(argv))
I1226 15:45:42.899957 140027647279936 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 408, in main
I1226 15:45:42.900018 140027647279936 run_docker.py:255] predict_structure(
I1226 15:45:42.900109 140027647279936 run_docker.py:255] File "/app/alphafold/run_alphafold.py", line 243, in predict_structure
I1226 15:45:42.900174 140027647279936 run_docker.py:255] relaxed_pdb_str, _, violations = amber_relaxer.process(
I1226 15:45:42.900234 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/relax/relax.py", line 62, in process
I1226 15:45:42.900292 140027647279936 run_docker.py:255] out = amber_minimize.run_pipeline(
I1226 15:45:42.900353 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 489, in run_pipeline
I1226 15:45:42.900412 140027647279936 run_docker.py:255] ret.update(get_violation_metrics(prot))
I1226 15:45:42.900472 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 357, in get_violation_metrics
I1226 15:45:42.900531 140027647279936 run_docker.py:255] structural_violations, struct_metrics = find_violations(prot)
I1226 15:45:42.900591 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/relax/amber_minimize.py", line 339, in find_violations
I1226 15:45:42.900651 140027647279936 run_docker.py:255] violations = folding.find_structural_violations(
I1226 15:45:42.900712 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/model/folding.py", line 761, in find_structural_violations
I1226 15:45:42.900774 140027647279936 run_docker.py:255] between_residue_clashes = all_atom.between_residue_clash_loss(
I1226 15:45:42.900835 140027647279936 run_docker.py:255] File "/app/alphafold/alphafold/model/all_atom.py", line 783, in between_residue_clash_loss
I1226 15:45:42.900898 140027647279936 run_docker.py:255] dists = jnp.sqrt(1e-10 + jnp.sum(
I1226 15:45:42.900959 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/jax/_src/numpy/reductions.py", line 216, in sum
I1226 15:45:42.901019 140027647279936 run_docker.py:255] return _reduce_sum(a, axis=_ensure_optional_axes(axis), dtype=dtype, out=out,
I1226 15:45:42.901078 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
I1226 15:45:42.901138 140027647279936 run_docker.py:255] return fun(*args, **kwargs)
I1226 15:45:42.901199 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/jax/_src/api.py", line 623, in cache_miss
I1226 15:45:42.901261 140027647279936 run_docker.py:255] out_flat = call_bind_continuation(execute(*args_flat))
I1226 15:45:42.901322 140027647279936 run_docker.py:255] File "/opt/conda/lib/python3.8/site-packages/jax/_src/dispatch.py", line 895, in _execute_compiled
I1226 15:45:42.901383 140027647279936 run_docker.py:255] out_flat = compiled.execute(in_flat)
I1226 15:45:42.901446 140027647279936 run_docker.py:255] jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
I1226 15:45:42.901507 140027647279936 run_docker.py:255]
I1226 15:45:42.901568 140027647279936 run_docker.py:255] The stack trace below excludes JAX-internal frames.
I1226 15:45:42.901637 140027647279936 run_docker.py:255] The preceding is the original exception that occurred, unmodified.
@leiterenato
Copy link

Hi @sz-1002
I am having the same issue on the relaxation step.
#660

I tried with newer version of CUDA (11.6) and had the same issue.

@sz-1002
Copy link
Author

sz-1002 commented Dec 27, 2022

Hi @leiterenato
Thanks for letting me know! Hopefully there will be a solution for this.

@joshabramson
Copy link
Collaborator

a different non-relax model issue seems to have been resolved by using cuda 11.8 #646 (comment)

assuming that doesn't work, can try use_gpu_relax=False, or turn off relax entirely with run_relax=False. we will attempt to address the problem more fully in the new year.

@joshabramson
Copy link
Collaborator

We think this is due to jax version change from 0.3.17 to 0.3.25. We don't want to revert jax version though, so are looking for workarounds.

@alexanderimanicowenrivers
Copy link
Collaborator

The latest fix should solve this issue. Thank you for your patience.

@sz-1002
Copy link
Author

sz-1002 commented Jan 16, 2023

Fixed in AlphaFold v2.3.1. Thank you!

@sz-1002 sz-1002 closed this as completed Jan 16, 2023
@ChrisLou-bioinfo
Copy link

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants