Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token generation broken on CUDA when offload_kqv is false #4991

Closed
abetlen opened this issue Jan 17, 2024 · 7 comments
Closed

Token generation broken on CUDA when offload_kqv is false #4991

abetlen opened this issue Jan 17, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@abetlen
Copy link
Collaborator

abetlen commented Jan 17, 2024

Originally spotted by @iamlemec in abetlen/llama-cpp-python#1089 reproduced with llama.cpp by passing --no_kv_offload to ./main. Bug causes the model to generate repeated #'s instead of a valid completion.

@JohannesGaessler JohannesGaessler added bug Something isn't working and removed bug-unconfirmed labels Jan 17, 2024
@JohannesGaessler
Copy link
Collaborator

I can reproduce the bug. It seems to have been introduced in #4766 and only occurs from the second eval onwards. So ./perplexity with defaults will not detect this bug but ./perplexity -b 256 will.

cc: @slaren

@Artefact2
Copy link
Collaborator

Duplicate of #4983?

@ggerganov
Copy link
Owner

Yup, will be looking into this - if anyone has additional insights like which models / params work and which do not work would be helpful

@thiner
Copy link

thiner commented Jan 19, 2024

@ggerganov I tried with below model and params, and it ran into the generation error.

command:
            - python3
            - '-m'
            - llama_cpp.server
          args:
            - '--model'
            - openbuddy-mixtral-7bx8-v16.3-32k.Q5_K_M.gguf
            - '--chat_format'
            - openbuddy
            - '--tensor_split'
            - '0.5'
            - '0.5'
            - '--n_gpu_layers'
            - '-1'
            - '--n_threads'
            - '16'
            - '--interrupt_requests'
            - 'false'
            - '--n_ctx'
            - '32500'

@iamlemec
Copy link
Collaborator

iamlemec commented Jan 19, 2024

Hopefully some useful info: This occurs on every type of model I've tried: llama, mistral, mixtral, even llava, regardless of quantization.

When running a modified simple program with offload_kqv = false, very short prompts work. At some point you hit a length where the logit value becomes different from the offload_kqv = true case. For any prompt length strictly higher than this cutoff the logit value is simply nan. The cutoff value varies from model to model, for llama-7b and llama-13b it's 16 and for mistral it's 4.

Looking at different values for n_gpu_layers with longer prompts and offload_kqv = false, if it's fully offloaded you get nan. Anything less than full offload you get 0.0 for the most part, unless ngl=1 when you get a non-zero but incorrect number.

@slaren
Copy link
Collaborator

slaren commented Jan 20, 2024

This should have been fixed in #5049 (already merged), please let me know if you find any other issues.

@abetlen
Copy link
Collaborator Author

abetlen commented Jan 25, 2024

Forgot to close this, it works, thanks for the fix!

@abetlen abetlen closed this as completed Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants