Bug: crash with CUDA graphs on A100

### What happened?

I am currently running some tests on A100 and `llama.cpp` crashes when CUDA graphs are enabled. Here are repro steps:

```bash
cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=80 ..
make -j

./bin/llama-cli -m ../models/llama-3.1-8b/ggml-model-f16.gguf -p "I believe the meaning of life is" -n -1 -s 2 -ngl 99 -c 4096

llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =  1002.00 MiB
llm_load_tensors:      CUDA0 buffer size = 14315.02 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  =  512.00 MiB, K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   296.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 124

system_info: n_threads = 124 (n_threads_batch = 124) / 124 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 1
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

I believe the meaning of life is the way we act in which we perceive our personality and the way of life experience and how many. You need to make this.
How to the one you the most of the way and the one and this one the one the other this in the other this in this is this.
The
The one in this
the this in this and this.
The
and this in this in this is this is the one the oneThe is this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in this in/home/ggerganov/development/llama.cpp/ggml/src/ggml-cuda.cu:106: CUDA error
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at /home/ggerganov/development/llama.cpp/ggml/src/ggml-cuda.cu:2476
  cudaStreamSynchronize(cuda_ctx->stream())
[New LWP 3036247]
[New LWP 3036251]
[New LWP 3036303]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f0015b87c7f in __GI___wait4 (pid=3036530, stat_loc=0x7ffe83b62044, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27	../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f0015b87c7f in __GI___wait4 (pid=3036530, stat_loc=0x7ffe83b62044, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27	in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f0016020a73 in ggml_abort () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#2  0x00007f0016105876 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#3  0x00007f0016106d5b in ggml_backend_cuda_synchronize(ggml_backend*) () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#4  0x00007f001606b60d in ggml_backend_sched_synchronize () from /home/ggerganov/development/llama.cpp/build-cuda-2/ggml/src/libggml.so
#5  0x00007f001cf29534 in llama_synchronize () from /home/ggerganov/development/llama.cpp/build-cuda-2/src/libllama.so
#6  0x00007f001cf2d539 in llama_get_logits_ith () from /home/ggerganov/development/llama.cpp/build-cuda-2/src/libllama.so
#7  0x000055d200455ef8 in gpt_sampler_sample(gpt_sampler*, llama_context*, int, bool) ()
#8  0x000055d2003db119 in main ()
[Inferior 1 (process 3036246) detached]
Aborted (core dumped)
```

Build info:

```
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0, VMM: yes
build: 3870 (841713e1) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu
```

```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   54C    P0   100W / 400W |   8337MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
```

The model is F16 LLaMA 3.1. The command crashes consistently though at different times after the start.

It stops crashing if I add `GGML_CUDA_DISABLE_GRAPHS=1`.

@agray3 Do you have ideas what might be the issue? Do you observe the same crash with A100?

### Name and Version

version: 3870 (841713e1)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 for x86_64-linux-gnu

### What operating system are you seeing the problem on?

Linux

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: crash with CUDA graphs on A100 #9727

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: crash with CUDA graphs on A100 #9727

Description

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions