Skip to content

Cuda runtime error and slow eval #1885

@huichen

Description

@huichen

Using a09f919 and compiled with

make clean && LLAMA_CUBLAS=1 LLAMA_CUDA_DMMV_X=64 LLAMA_CUDA_DMMV_Y=2 make -j

Running with command, on 4x A40-48G

./main -m ggml-vic13b-q5_1.bin -ngl 1000 -p "the meaning of life is" -t 8 -c 2048

Got error

CUDA error 1 at ggml-cuda.cu:2292: invalid argument

output

llama_model_load_internal: format     = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA A40) as main device
llama_model_load_internal: mem required  = 2165.28 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 11314 MB
....................................................................................................
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 8 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0


 the meaning of life is to enjoy it.
Author: Epicurus
62. “Life is a journey, and like all journeys, it must come to an end. But the memories we make along the way will live on forever.”
63. “The biggest adventure you can ever take is to live the life of your dreams.”
Author: Oprah Winfrey
64. “Life is like a camera, focus on the good times, develop from the negatives, and keep shooting.”
Author: Unknown (often attributed to Tommy De Senna)
65. “The purpose of our lives is to be happy.”
Author: Dalai Lama [end of text]

llama_print_timings:        load time =  2831.64 ms
llama_print_timings:      sample time =    70.88 ms /   144 runs   (    0.49 ms per token)
llama_print_timings: prompt eval time =   184.13 ms /     6 tokens (   30.69 ms per token)
llama_print_timings:        eval time =  6122.05 ms /   143 runs   (   42.81 ms per token)
llama_print_timings:       total time =  6421.17 ms
CUDA error 1 at ggml-cuda.cu:2292: invalid argument

also prompt eval time with long prompt became much longer, ~12ms/token vs 3ms/token few days ago.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions