Token generation broken on CUDA when `offload_kqv` is `false` #4991

abetlen · 2024-01-17T00:32:00Z

Originally spotted by @iamlemec in abetlen/llama-cpp-python#1089 reproduced with llama.cpp by passing --no_kv_offload to ./main. Bug causes the model to generate repeated #'s instead of a valid completion.

The text was updated successfully, but these errors were encountered:

JohannesGaessler · 2024-01-17T06:27:43Z

I can reproduce the bug. It seems to have been introduced in #4766 and only occurs from the second eval onwards. So ./perplexity with defaults will not detect this bug but ./perplexity -b 256 will.

cc: @slaren

Artefact2 · 2024-01-17T06:31:02Z

Duplicate of #4983?

ggerganov · 2024-01-17T08:27:45Z

Yup, will be looking into this - if anyone has additional insights like which models / params work and which do not work would be helpful

thiner · 2024-01-19T02:31:50Z

@ggerganov I tried with below model and params, and it ran into the generation error.

command:
            - python3
            - '-m'
            - llama_cpp.server
          args:
            - '--model'
            - openbuddy-mixtral-7bx8-v16.3-32k.Q5_K_M.gguf
            - '--chat_format'
            - openbuddy
            - '--tensor_split'
            - '0.5'
            - '0.5'
            - '--n_gpu_layers'
            - '-1'
            - '--n_threads'
            - '16'
            - '--interrupt_requests'
            - 'false'
            - '--n_ctx'
            - '32500'

iamlemec · 2024-01-19T22:27:18Z

Hopefully some useful info: This occurs on every type of model I've tried: llama, mistral, mixtral, even llava, regardless of quantization.

When running a modified simple program with offload_kqv = false, very short prompts work. At some point you hit a length where the logit value becomes different from the offload_kqv = true case. For any prompt length strictly higher than this cutoff the logit value is simply nan. The cutoff value varies from model to model, for llama-7b and llama-13b it's 16 and for mistral it's 4.

Looking at different values for n_gpu_layers with longer prompts and offload_kqv = false, if it's fully offloaded you get nan. Anything less than full offload you get 0.0 for the most part, unless ngl=1 when you get a non-zero but incorrect number.

slaren · 2024-01-20T17:59:29Z

This should have been fixed in #5049 (already merged), please let me know if you find any other issues.

abetlen · 2024-01-25T16:07:29Z

Forgot to close this, it works, thanks for the fix!

abetlen added the bug-unconfirmed label Jan 17, 2024

JohannesGaessler added bug Something isn't working and removed bug-unconfirmed labels Jan 17, 2024

iamlemec mentioned this issue Jan 21, 2024

Commit 7c898d5 breaks generation on GPU abetlen/llama-cpp-python#1089

Closed

Nindaleth mentioned this issue Jan 21, 2024

Vulkan Implementation #2059

Merged

abetlen closed this as completed Jan 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token generation broken on CUDA when `offload_kqv` is `false` #4991

Token generation broken on CUDA when `offload_kqv` is `false` #4991

abetlen commented Jan 17, 2024

JohannesGaessler commented Jan 17, 2024

Artefact2 commented Jan 17, 2024

ggerganov commented Jan 17, 2024

thiner commented Jan 19, 2024

iamlemec commented Jan 19, 2024 •

edited

Loading

slaren commented Jan 20, 2024

abetlen commented Jan 25, 2024 •

edited

Loading

Token generation broken on CUDA when offload_kqv is false #4991

Token generation broken on CUDA when offload_kqv is false #4991

Comments

abetlen commented Jan 17, 2024

JohannesGaessler commented Jan 17, 2024

Artefact2 commented Jan 17, 2024

ggerganov commented Jan 17, 2024

thiner commented Jan 19, 2024

iamlemec commented Jan 19, 2024 • edited Loading

slaren commented Jan 20, 2024

abetlen commented Jan 25, 2024 • edited Loading

Token generation broken on CUDA when `offload_kqv` is `false` #4991

Token generation broken on CUDA when `offload_kqv` is `false` #4991

iamlemec commented Jan 19, 2024 •

edited

Loading

abetlen commented Jan 25, 2024 •

edited

Loading