What happened?
I am currently running some tests on A100 and llama.cpp crashes when CUDA graphs are enabled. Here are repro steps:
cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=80 ..
make -j
./bin/llama-cli -m ../models/llama-3.1-8b/ggml-model-f16.gguf -p "I believe the meaning of life is" -n -1 -s 2 -ngl 99 -c 4096
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CPU buffer size = 1002.00 MiB
llm_load_tensors: CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 4096
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 512.00 MiB
llama_new_context_with_model: KV self size = 512.00 MiB, K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 296.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 16.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 124
system_info: n_threads = 124 (n_threads_batch = 124) / 124 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 1
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
I believe the meaning of life is the way we act in which we perceive our personality and the way of life experience and how many. You need to make this.
How to the one you the most of the way and the one and this one the one the other this in the other this in this is this.
The
The one in this
the this in this and this.
The
and this in this in this is this is the one the oneThe is this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in/home/ggerganov/development/llama.cpp/ggml/src/ggml-cuda.cu:106: CUDA error
CUDA error: an illegal memory access was encountered
current device: 0, in function ggml_backend_cuda_synchronize at /home/ggerganov/development/llama.cpp/ggml/src/ggml-cuda.cu:2476
cudaStreamSynchronize(cuda_ctx->stream())
[New LWP 3036247]
[New LWP 3036251]
[New LWP 3036303]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f0015b87c7f in __GI___wait4 (pid=3036530, stat_loc=0x7ffe83b62044, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007f0015b87c7f in __GI___wait4 (pid=3036530, stat_loc=0x7ffe83b62044, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x00007f0016020a73 in ggml_abort () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#2 0x00007f0016105876 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#3 0x00007f0016106d5b in ggml_backend_cuda_synchronize(ggml_backend*) () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#4 0x00007f001606b60d in ggml_backend_sched_synchronize () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#5 0x00007f001cf29534 in llama_synchronize () from /home/ggerganov/development/llama.cpp/build-cuda-2/src/libllama.so
#6 0x00007f001cf2d539 in llama_get_logits_ith () from /home/ggerganov/development/llama.cpp/build-cuda-2/src/libllama.so
#7 0x000055d200455ef8 in gpt_sampler_sample(gpt_sampler*, llama_context*, int, bool) ()
#8 0x000055d2003db119 in main ()
[Inferior 1 (process 3036246) detached]
Aborted (core dumped)
Build info:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
build: 3870 (841713e1) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:06:00.0 Off | 0 |
| N/A 54C P0 100W / 400W | 8337MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
The model is F16 LLaMA 3.1. The command crashes consistently though at different times after the start.
It stops crashing if I add GGML_CUDA_DISABLE_GRAPHS=1.
@agray3 Do you have ideas what might be the issue? Do you observe the same crash with A100?
Name and Version
version: 3870 (841713e)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response
What happened?
I am currently running some tests on A100 and
llama.cppcrashes when CUDA graphs are enabled. Here are repro steps:Build info:
The model is F16 LLaMA 3.1. The command crashes consistently though at different times after the start.
It stops crashing if I add
GGML_CUDA_DISABLE_GRAPHS=1.@agray3 Do you have ideas what might be the issue? Do you observe the same crash with A100?
Name and Version
version: 3870 (841713e)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux
Relevant log output
No response