-
Notifications
You must be signed in to change notification settings - Fork 14k
Labels
CUDARelated to the CUDA backendRelated to the CUDA backendbugSomething isn't workingSomething isn't workingmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Description
Name and Version
$ llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
version: 7122 (21d31e081)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
$ llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
version: 7122 (21d31e081)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
root@C.28352397:~/models$
Operating systems
Linux
GGML backends
CUDA
(does NOT happen on CPU)
Hardware
RTX 5090
Models
IBM Granite 4.0-h-1b, several quants (including BF16 GGUF)
Problem description & steps to reproduce
When I infer at a 32k context (simple script generating a NIAH text), the model returns "???????" or "G".
The same script at 16k context works wine with the model.
The same script with the same llama.cpp and with Granite-4-h-micro (3.5B) works fine at 32k context. The same script with a CPU build of llama.cpp and Granite-4-h-1b works fine at 32k context too.
Similar content, inferring on Transformers with Granite-4-h-1b (original safetensors version) works.
First Bad Commit
No response
Relevant log output
(not sure what to paste - no error message is displayed)Metadata
Metadata
Assignees
Labels
CUDARelated to the CUDA backendRelated to the CUDA backendbugSomething isn't workingSomething isn't workingmedium severityUsed to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)