-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Name and Version
./llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-alderlake.so
version: 6692 (ca71fb9)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
NVIDIA GeForce RTX 4090
Models
ggml-org/granite-docling-258M-GGUF
Problem description & steps to reproduce
First of all, it's fantastic to have docling accessible with llama.cpp! Thank you!
Unfortunately it's not really working for me. I know this release is just a few hours old, but i thought i should let you know about my experience.
The f16 generates pure garbage ([[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[....] and q8 doesn't do much better either (e.g. "Body refers to nagging as the fall of the left trunk, the right trunk, and the middle trunk being clear." is what's able to come up with when processing https://huggingface.co/ibm-granite/granite-docling-258M/blob/main/assets/new_arxiv.png).
Are there any specific settings i need to get it to work properly? Here are mine:
docker run --gpus all -v /home/user/models_test/gguf:/models -p 5050:5050 local/llama.cpp:server-cuda -m /models/granite-docling-258M-f16.gguf --mmproj /models/mmproj-granite-docling-258M-f16.gguf --n-gpu-layers 999 -c 8192 --host 0.0.0.0 --port 5050 --temp 0.6 --top-p 0.9 --top-k 1000 --min-p 0.01
First Bad Commit
No response
Relevant log output
common_sampler_types_from_names: unable to match sampler by name 'edkypmxt'
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 0 | task 2 | processing task
slot update_slots: id 0 | task 2 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 1126
slot update_slots: id 0 | task 2 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 259, n_tokens = 259, progress = 0.230018
slot update_slots: id 0 | task 2 | n_past = 259, memory_seq_rm [259, end)
srv process_chun: processing image...
srv process_chun: image processed in 252 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 325, n_tokens = 2, progress = 0.288632
slot update_slots: id 0 | task 2 | n_past = 325, memory_seq_rm [325, end)
srv process_chun: processing image...
srv process_chun: image processed in 59 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 391, n_tokens = 2, progress = 0.347247
slot update_slots: id 0 | task 2 | n_past = 391, memory_seq_rm [391, end)
srv process_chun: processing image...
srv process_chun: image processed in 13 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 458, n_tokens = 3, progress = 0.406750
slot update_slots: id 0 | task 2 | n_past = 458, memory_seq_rm [458, end)
srv process_chun: processing image...
srv process_chun: image processed in 14 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 524, n_tokens = 2, progress = 0.465364
slot update_slots: id 0 | task 2 | n_past = 524, memory_seq_rm [524, end)
srv process_chun: processing image...
srv process_chun: image processed in 13 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 590, n_tokens = 2, progress = 0.523979
slot update_slots: id 0 | task 2 | n_past = 590, memory_seq_rm [590, end)
srv process_chun: processing image...
srv process_chun: image processed in 14 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 657, n_tokens = 3, progress = 0.583481
slot update_slots: id 0 | task 2 | n_past = 657, memory_seq_rm [657, end)
srv process_chun: processing image...
srv process_chun: image processed in 15 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 723, n_tokens = 2, progress = 0.642096
slot update_slots: id 0 | task 2 | n_past = 723, memory_seq_rm [723, end)
srv process_chun: processing image...
srv process_chun: image processed in 13 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 789, n_tokens = 2, progress = 0.700710
slot update_slots: id 0 | task 2 | n_past = 789, memory_seq_rm [789, end)
srv process_chun: processing image...
srv process_chun: image processed in 12 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 856, n_tokens = 3, progress = 0.760213
slot update_slots: id 0 | task 2 | n_past = 856, memory_seq_rm [856, end)
srv process_chun: processing image...
srv process_chun: image processed in 13 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 922, n_tokens = 2, progress = 0.818828
slot update_slots: id 0 | task 2 | n_past = 922, memory_seq_rm [922, end)
srv process_chun: processing image...
srv process_chun: image processed in 13 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 988, n_tokens = 2, progress = 0.877442
slot update_slots: id 0 | task 2 | n_past = 988, memory_seq_rm [988, end)
srv process_chun: processing image...
srv process_chun: image processed in 13 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 1055, n_tokens = 3, progress = 0.936945
slot update_slots: id 0 | task 2 | n_past = 1055, memory_seq_rm [1055, end)
srv process_chun: processing image...
srv process_chun: image processed in 12 ms
slot update_slots: id 0 | task 2 | prompt processing progress, n_past = 1126, n_tokens = 7, progress = 1.000000
slot update_slots: id 0 | task 2 | prompt done, n_past = 1126, n_tokens = 7
srv log_server_r: request: GET /slots 172.17.0.1 200
slot process_toke: id 0 | task 2 | n_predict (-1) is set for infinite generation. Limiting generated tokens to n_ctx_train (8192) to avoid EOS-less generation infinite loop
slot release: id 0 | task 2 | stop processing: n_past = 8191, truncated = 1
slot print_timing: id 0 | task 2 |
prompt eval time = 716.27 ms / 1126 tokens ( 0.64 ms per token, 1572.04 tokens per second)
eval time = 16400.78 ms / 7066 tokens ( 2.32 ms per token, 430.83 tokens per second)
total time = 17117.05 ms / 8192 tokens