Cuda runtime error and slow eval

Using [a09f919](https://github.com/ggerganov/llama.cpp/releases/tag/master-a09f919) and compiled with 

```
make clean && LLAMA_CUBLAS=1 LLAMA_CUDA_DMMV_X=64 LLAMA_CUDA_DMMV_Y=2 make -j
```

Running with command, on 4x A40-48G

```
./main -m ggml-vic13b-q5_1.bin -ngl 1000 -p "the meaning of life is" -t 8 -c 2048
```

Got error 

```
CUDA error 1 at ggml-cuda.cu:2292: invalid argument
```

output

```
llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA A40) as main device
llama_model_load_internal: mem required  = 2165.28 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 11314 MB
....................................................................................................
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 8 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0


 the meaning of life is to enjoy it.
Author: Epicurus
62. “Life is a journey, and like all journeys, it must come to an end. But the memories we make along the way will live on forever.”
63. “The biggest adventure you can ever take is to live the life of your dreams.”
Author: Oprah Winfrey
64. “Life is like a camera, focus on the good times, develop from the negatives, and keep shooting.”
Author: Unknown (often attributed to Tommy De Senna)
65. “The purpose of our lives is to be happy.”
Author: Dalai Lama [end of text]

llama_print_timings:        load time =  2831.64 ms
llama_print_timings:      sample time =    70.88 ms /   144 runs   (    0.49 ms per token)
llama_print_timings: prompt eval time =   184.13 ms /     6 tokens (   30.69 ms per token)
llama_print_timings:        eval time =  6122.05 ms /   143 runs   (   42.81 ms per token)
llama_print_timings:       total time =  6421.17 ms
CUDA error 1 at ggml-cuda.cu:2292: invalid argument
```

also prompt eval time with long prompt became much longer, ~12ms/token vs 3ms/token few days ago.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cuda runtime error and slow eval #1885

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cuda runtime error and slow eval #1885

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions