-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Closed
Labels
Description
Using a09f919 and compiled with
make clean && LLAMA_CUBLAS=1 LLAMA_CUDA_DMMV_X=64 LLAMA_CUDA_DMMV_Y=2 make -j
Running with command, on 4x A40-48G
./main -m ggml-vic13b-q5_1.bin -ngl 1000 -p "the meaning of life is" -t 8 -c 2048
Got error
CUDA error 1 at ggml-cuda.cu:2292: invalid argument
output
llama_model_load_internal: format = ggjt v2 (pre #1508)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 9 (mostly Q5_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 0.09 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA A40) as main device
llama_model_load_internal: mem required = 2165.28 MB (+ 1608.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 40 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 43/43 layers to GPU
llama_model_load_internal: total VRAM used: 11314 MB
....................................................................................................
llama_init_from_file: kv self size = 1600.00 MB
system_info: n_threads = 8 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = -1, n_keep = 0
the meaning of life is to enjoy it.
Author: Epicurus
62. “Life is a journey, and like all journeys, it must come to an end. But the memories we make along the way will live on forever.”
63. “The biggest adventure you can ever take is to live the life of your dreams.”
Author: Oprah Winfrey
64. “Life is like a camera, focus on the good times, develop from the negatives, and keep shooting.”
Author: Unknown (often attributed to Tommy De Senna)
65. “The purpose of our lives is to be happy.”
Author: Dalai Lama [end of text]
llama_print_timings: load time = 2831.64 ms
llama_print_timings: sample time = 70.88 ms / 144 runs ( 0.49 ms per token)
llama_print_timings: prompt eval time = 184.13 ms / 6 tokens ( 30.69 ms per token)
llama_print_timings: eval time = 6122.05 ms / 143 runs ( 42.81 ms per token)
llama_print_timings: total time = 6421.17 ms
CUDA error 1 at ggml-cuda.cu:2292: invalid argument
also prompt eval time with long prompt became much longer, ~12ms/token vs 3ms/token few days ago.