Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_ERROR_OUT_OF_MEMORY: out of memory エラー #50

Closed
yujifujeida opened this issue Jan 11, 2022 · 1 comment
Closed

CUDA_ERROR_OUT_OF_MEMORY: out of memory エラー #50

yujifujeida opened this issue Jan 11, 2022 · 1 comment

Comments

@yujifujeida
Copy link

yujifujeida commented Jan 11, 2022

Hi Yoshitaka,

I'm trying LocalColabFold on Windows 11 with WSL2, but it doesn't work.
Errors

2022-01-12 06:00:26,993 Running colabfold 1.2.0 (ee8e17402a3ce8aa67669d3ea22958ef99b808d9)
2022-01-12 06:00:27,001 Found 8 citations for tools or databases
2022-01-12 06:00:30.889012: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 1073741824 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:30.948473: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 966367744 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.001558: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 869731072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.058133: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 782758144 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.115851: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 704482304 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.169997: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 634034176 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.222446: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 570630912 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.276942: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 513568000 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.331319: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 462211328 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.386146: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 415990272 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.442584: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 374391296 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.498272: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 336952320 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.554731: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 303257088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.609158: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 272931584 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.671022: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 245638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.729073: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 221074944 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.786261: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 198967552 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.844108: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 179070976 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.895323: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 161164032 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:31.950303: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 145047808 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:32.003178: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 130543104 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:32.053485: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 117488896 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:32.105073: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 105740032 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:32.163281: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 95166208 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:32.216956: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 85649664 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:32.266061: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 77084928 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:32.500940: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 2147483648 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:32.512066: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 2147483648 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:42.538494: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 2147483648 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:42.551197: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:771] failed to alloc 2147483648 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2022-01-12 06:00:42.551242: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 12.00MiB (rounded to 12582912)requested by op
2022-01-12 06:00:42.551569: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:474] ****************************************************************************************************
2022-01-12 06:00:42.551647: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2085] Execution of replica 0 failed: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 12582912 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:   12.00MiB
              constant allocation:         0B
        maybe_live_out allocation:   12.00MiB
     preallocated temp allocation:         0B
                 total allocation:   24.00MiB
              total fragmentation:         0B (0.00%)

allocation 0: 0x5576515bd320, size 12582912, output shape is |f32[48,512,128]|, maybe-live-out:
 value: <1 copy @0> (size=12582912,offset=0): f32[48,512,128]{2,1,0}
contains:<1 copy @0>
 positions:
  copy
 uses:
 from instruction:%copy = f32[48,512,128]{2,1,0} copy(f32[48,512,128]{2,1,0} %parameter.1)

allocation 1: 0x5576515bd3d0, size 12582912, parameter 0, shape |f32[48,512,128]| at ShapeIndex {}:
 value: <0 parameter.1 @0> (size=12582912,offset=0): f32[48,512,128]{2,1,0}
contains:<0 parameter.1 @0>
 positions:
  parameter.1
 uses:
  copy, operand 0
 from instruction:%parameter.1 = f32[48,512,128]{2,1,0} parameter(0)


Traceback (most recent call last):
  File "/home/yuji/colabfold_batch/colabfold-conda/bin/colabfold_batch", line 8, in <module>
    sys.exit(main())
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 1318, in main
    zip_results=args.zip,
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/batch.py", line 980, in run
    rank_by=rank_by,
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/colabfold/alphafold/models.py", line 82, in load_models_and_params
    model_name=model_name + model_suffix, data_dir=str(data_dir)
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/alphafold/model/data.py", line 37, in get_model_haiku_params
    return utils.flat_params_to_haiku(params)
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/alphafold/model/utils.py", line 80, in flat_params_to_haiku
    hk_params[scope][name] = jnp.array(array)
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/jax/_src/numpy/lax_numpy.py", line 3597, in array
    out = lax._convert_element_type(out, dtype, weak_type=weak_type)
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/jax/_src/lax/lax.py", line 481, in _convert_element_type
    weak_type=new_weak_type)
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/jax/core.py", line 272, in bind
    out = top_trace.process_primitive(self, tracers, params)
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/jax/core.py", line 624, in process_primitive
    return primitive.impl(*tracers, **params)
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 418, in apply_primitive
    return compiled_fun(*args)
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 442, in <lambda>
    return lambda *args, **kw: compiled(*args, **kw)[0]
  File "/home/yuji/colabfold_batch/colabfold-conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 1100, in _execute_compiled
    out_bufs = compiled.execute(input_bufs)
RuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 12582912 bytes.
BufferAssignment OOM Debugging.
BufferAssignment stats:
             parameter allocation:   12.00MiB
              constant allocation:         0B
        maybe_live_out allocation:   12.00MiB
     preallocated temp allocation:         0B
                 total allocation:   24.00MiB
              total fragmentation:         0B (0.00%)

allocation 0: 0x5576515bd320, size 12582912, output shape is |f32[48,512,128]|, maybe-live-out:
 value: <1 copy @0> (size=12582912,offset=0): f32[48,512,128]{2,1,0}
contains:<1 copy @0>
 positions:
  copy
 uses:
 from instruction:%copy = f32[48,512,128]{2,1,0} copy(f32[48,512,128]{2,1,0} %parameter.1)

allocation 1: 0x5576515bd3d0, size 12582912, parameter 0, shape |f32[48,512,128]| at ShapeIndex {}:
 value: <0 parameter.1 @0> (size=12582912,offset=0): f32[48,512,128]{2,1,0}
contains:<0 parameter.1 @0>
 positions:
  parameter.1
 uses:
  copy, operand 0
 from instruction:%parameter.1 = f32[48,512,128]{2,1,0} parameter(0)

nvcc --version and nvidia-smi information are below.

yuji@DESKTOP-B10GKOG:~$ nvidia-smi
Wed Jan 12 05:57:10 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 511.04.01    Driver Version: 511.09       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
|  0%   37C    P8    10W / 170W |    521MiB / 12288MiB |      3%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
yuji@DESKTOP-B10GKOG:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

お忙しいところ恐縮ですが、ご教授お願いします。

@yujifujeida
Copy link
Author

export TF_FORCE_UNIFIED_MEMORY="1"
export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"
export XLA_PYTHON_CLIENT_ALLOCATOR="platform"
export TF_FORCE_GPU_ALLOW_GROWTH="true"

This works, Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant