Skip to content

Eval bug: IBM Granite Docling goes in loop. #16678

@engrtipusultan

Description

@engrtipusultan

Name and Version

bash  llama-server --version
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
version: 6800 (0398752)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

Vulkan

Hardware

bash  hostnamectl
Operating System: Ubuntu 24.04.3 LTS
Kernel: Linux 6.14.0-33-generic
Architecture: x86-64
Hardware Vendor: GMKtec
Hardware Model: M5 PLUS
Firmware Version: M5 PLUS 1.03

Device Name AMD Radeon Graphics
PCI (domain:bus:dev.func) 0000:03:00.0
DeviceID:RevID 0x15E7.0xC1
OpenGL Driver Version Mesa 25.2.5 - kisak-mesa PPA
gfx_target_version gfx90c

GPU Type APU
Family Raven (RV)
ASIC Name Renoir
Chip Class GFX9
Shader Engine (SE) 1
Shader Array (SA/SH) per SE 1
CU per SA 8
Total CU 8
RenderBackendPlus (RB+) 2 (16 ROPs)
Peak Pixel Fill-Rate 32 GP/s
GPU Clock 200-2000 MHz
Peak FP32 2048 GFLOPS

VRAM Type DDR4
VRAM Bit Width 128-bit
VRAM Size 16384 MiB
Memory Clock 400-1333 MHz
ResizableBAR Enabled
ECC Memory Not Supported

bash  vulkaninfo --summary
WARNING: [Loader Message] Code 0 : terminator_CreateInstance: Received return code -3 from call to vkCreateInstance in ICD /usr/lib/x86_64-linux-gnu/libvulkan_dzn.so. Skipping this driver.
==========
VULKANINFO
==========

Vulkan Instance Version: 1.4.313


Devices:
========
GPU0:
	apiVersion         = 1.4.318
	driverVersion      = 25.2.5
	vendorID           = 0x1002
	deviceID           = 0x15e7
	deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
	deviceName         = AMD Radeon Graphics (RADV RENOIR)
	driverID           = DRIVER_ID_MESA_RADV
	driverName         = radv
	driverInfo         = Mesa 25.2.5 - kisak-mesa PPA
	conformanceVersion = 1.4.0.0
	deviceUUID         = 00000000-0300-0000-0000-000000000000
	driverUUID         = 414d442d-4d45-5341-2d44-525600000000
GPU1:
	apiVersion         = 1.4.318
	driverVersion      = 25.2.5
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 20.1.8, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 25.2.5 - kisak-mesa PPA (LLVM 20.1.8)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3235-2e32-2e35-202d206b6900
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000

Models

https://huggingface.co/ggml-org/granite-docling-258M-GGUF

  • name: granite-docling
    model_path: /home/tipu/AI/models/ggml-org/Granite_docling/granite-docling-258M-f16.gguf
    model_params: |
    --port 8888
    --api-key 12345
    --mmproj /home/tipu/AI/models/ggml-org/Granite_docling/mmproj-granite-docling-258M-f16.gguf
    --n-predict -1
    --ctx-size 16384
    --n-gpu-layers 99
    --jinja
    --repeat-penalty 1.0
    --temp 0.0
    --top-k 0
    --top-p 1.0
    --alias granite-docling
    --mlock
    --seed -1
    --swa-full
    --no-escape
    --no-mmap

Problem description & steps to reproduce

when I run llama server. I try to perform OCR. I start getting output in loop.

Starting model: granite-docling
Model path: /home/tipu/AI/models/ggml-org/Granite_docling/granite-docling-258M-f16.gguf
Changing directory to: /home/tipu/Applications/llamacpp/
Executing: ./llama-server -m /home/tipu/AI/models/ggml-org/Granite_docling/granite-docling-258M-f16.gguf --port 8888 --api-key 12345 --mmproj /home/tipu/AI/models/ggml-org/Granite_docling/mmproj-granite-docling-258M-f16.gguf --n-predict -1 --ctx-size 16384 --n-gpu-layers 99 --jinja --repeat-penalty 1.0 --temp 0.0 --top-k 0 --top-p 1.0 --alias granite-docling --mlock --seed -1 --swa-full --no-escape --no-mmap
Press Ctrl+C to stop the server and return to the menu.

load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
build: 6800 (0398752d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8888, http threads: 15
main: loading model
srv    load_model: loading model '/home/tipu/AI/models/ggml-org/Granite_docling/granite-docling-258M-f16.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV RENOIR)) (0000:03:00.0) - 37487 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 272 tensors from /home/tipu/AI/models/ggml-org/Granite_docling/granite-docling-258M-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 164M
llama_model_loader: - kv   3:                            general.license str              = apache-2.0
llama_model_loader: - kv   4:                      general.dataset.count u32              = 4
llama_model_loader: - kv   5:                     general.dataset.0.name str              = SynthCodeNet
llama_model_loader: - kv   6:             general.dataset.0.organization str              = Ds4Sd
llama_model_loader: - kv   7:                 general.dataset.0.repo_url str              = https://huggingface.co/ds4sd/SynthCod...
llama_model_loader: - kv   8:                     general.dataset.1.name str              = SynthFormulaNet
llama_model_loader: - kv   9:             general.dataset.1.organization str              = Ds4Sd
llama_model_loader: - kv  10:                 general.dataset.1.repo_url str              = https://huggingface.co/ds4sd/SynthFor...
llama_model_loader: - kv  11:                     general.dataset.2.name str              = SynthChartNet
llama_model_loader: - kv  12:             general.dataset.2.organization str              = Ds4Sd
llama_model_loader: - kv  13:                 general.dataset.2.repo_url str              = https://huggingface.co/ds4sd/SynthCha...
llama_model_loader: - kv  14:                     general.dataset.3.name str              = DoclingMatix
llama_model_loader: - kv  15:             general.dataset.3.organization str              = HuggingFaceM4
llama_model_loader: - kv  16:                 general.dataset.3.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  17:                               general.tags arr[str,14]      = ["text-generation", "documents", "cod...
llama_model_loader: - kv  18:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  19:                          llama.block_count u32              = 30
llama_model_loader: - kv  20:                       llama.context_length u32              = 8192
llama_model_loader: - kv  21:                     llama.embedding_length u32              = 576
llama_model_loader: - kv  22:                  llama.feed_forward_length u32              = 1536
llama_model_loader: - kv  23:                 llama.attention.head_count u32              = 9
llama_model_loader: - kv  24:              llama.attention.head_count_kv u32              = 3
llama_model_loader: - kv  25:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  26:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  27:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  28:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  29:                          general.file_type u32              = 1
llama_model_loader: - kv  30:                           llama.vocab_size u32              = 100352
llama_model_loader: - kv  31:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = granite-docling
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 100264
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  40:            tokenizer.ggml.unknown_token_id u32              = 100338
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 100257
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- for message in messages -%}\n{{- '...
llama_model_loader: - kv  44:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:   61 tensors
llama_model_loader: - type  f16:  211 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 312.88 MiB (16.00 BPW) 
load: printing all EOG tokens:
load:   - 100257 ('<|end_of_text|>')
load: special tokens cache size = 96
load: token to piece cache size = 0.6152 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 576
print_info: n_layer          = 30
print_info: n_head           = 9
print_info: n_head_kv        = 3
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 192
print_info: n_embd_v_gqa     = 192
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 1536
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: model type       = 256M
print_info: model params     = 164.01 M
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100264 '<|start_of_role|>'
print_info: EOS token        = 100257 '<|end_of_text|>'
print_info: EOT token        = 100257 '<|end_of_text|>'
print_info: UNK token        = 100338 '<|unk|>'
print_info: PAD token        = 100257 '<|end_of_text|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 100257 '<|end_of_text|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 31/31 layers to GPU
load_tensors:      Vulkan0 model buffer size =   312.88 MiB
load_tensors:  Vulkan_Host model buffer size =   110.25 MiB
.................................................
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) > n_ctx_train (8192) -- possible training context overflow
llama_context: Vulkan_Host  output buffer size =     0.38 MiB
llama_kv_cache:    Vulkan0 KV buffer size =   360.00 MiB
llama_kv_cache: size =  360.00 MiB ( 16384 cells,  30 layers,  1/1 seqs), K (f16):  180.00 MiB, V (f16):  180.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:    Vulkan0 compute buffer size =   197.12 MiB
llama_context: Vulkan_Host compute buffer size =    33.14 MiB
llama_context: graph nodes  = 937
llama_context: graph splits = 2
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
clip_model_loader: model name:   
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    198
clip_model_loader: n_kv:         36

clip_model_loader: has vision encoder
clip_ctx: CLIP using Vulkan0 backend
load_hparams: projector:          idefics3
load_hparams: n_embd:             768
load_hparams: n_head:             12
load_hparams: n_ff:               3072
load_hparams: n_layer:            12
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     576

--- vision hparams ---
load_hparams: image_size:         512
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  4
load_hparams: n_wa_pattern:       0

load_hparams: model size:         181.22 MiB
load_hparams: metadata size:      0.07 MiB
alloc_compute_meta:    Vulkan0 compute buffer size =    60.00 MiB
alloc_compute_meta:        CPU compute buffer size =     3.00 MiB
srv    load_model: loaded multimodal model, '/home/tipu/AI/models/ggml-org/Granite_docling/mmproj-granite-docling-258M-f16.gguf'
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 16384
srv          init: prompt cache is enabled, size limit: 8192 MiB
srv          init: use `--cache-ram 0` to disable the prompt cache
srv          init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: thinking = 0
main: model loaded
main: chat template, chat_template: {%- for message in messages -%}
{{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}
{%- if message['content'] is string -%}
{{- message['content'] -}}
{%- else -%}
{%- for part in message['content'] -%}
{%- if part['type'] == 'text' -%}
{{- part['text'] -}}
{%- elif part['type'] == 'image' -%}
{{- '<image>' -}}
{%- endif -%}
{%- endfor -%}
{%- endif -%}
{{- '<|end_of_text|>
' -}}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{- '<|start_of_role|>assistant' -}}
{%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}
{{- '<|end_of_role|>' -}}
{%- endif -%}
, example_format: '<|start_of_role|>system<|end_of_role|>You are a helpful assistant<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Hello<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>Hi there<|end_of_text|>
<|start_of_role|>user<|end_of_role|>How are you?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>'
main: server is listening on http://127.0.0.1:8888 - starting the main loop
srv  update_slots: all slots are idle

Output:
<loc_36>loc_34>loc_463>loc_40>Invoice Number: 001
<loc_36>loc_43>loc_463>loc_50>Invoice Number: 001
<loc_36>loc_57>loc_463><loc_67>Invoice Number: 001
<loc_36>loc_63><loc_463><loc_76>Invoice Number: 001
<loc_36><loc_70><loc_463><loc_81>Invoice Number: 001
<loc_36><loc_84><loc_463><loc_104>Invoice Number: 001 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 2025 20

Output from Nanonets:
Invoice

Invoice Number: INV-20250609
Date: June 9, 2025

Bill To: Souvik Mandal
123 Business Street
Kolkata, India

Item Description Quantity Unit Price Total
001 Consulting Services 10 ₹2000 ₹20,000
002 Design Work 5 ₹1500 ₹7,500
Grand Total ₹27,500

Thank you for your business!
Payment was received on June 7, 2025

Image

First Bad Commit

Not sure.

Relevant log output

bash  llama-server -m /home/tipu/AI/models/ggml-org/Granite_docling/granite-docling-258M-f16.gguf --port 8888 --api-key 12345 --mmproj /home/tipu/AI/models/ggml-org/Granite_docling/mmproj-granite-docling-258M-f16.gguf --n-predict -1 --ctx-size 16384 --n-gpu-layers 99 --jinja --repeat-penalty 1.0 --temp 0.0 --top-k 0 --top-p 1.0 --alias granite-docling --mlock --seed -1 --swa-full --no-escape --no-mmap -lv 1 
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
build: 6800 (0398752d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8888, http threads: 15
main: loading model
srv    load_model: loading model '/home/tipu/AI/models/ggml-org/Granite_docling/granite-docling-258M-f16.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV RENOIR)) (0000:03:00.0) - 36602 MiB free
llama_model_loader: loaded meta data with 45 key-value pairs and 272 tensors from /home/tipu/AI/models/ggml-org/Granite_docling/granite-docling-258M-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 164M
llama_model_loader: - kv   3:                            general.license str              = apache-2.0
llama_model_loader: - kv   4:                      general.dataset.count u32              = 4
llama_model_loader: - kv   5:                     general.dataset.0.name str              = SynthCodeNet
llama_model_loader: - kv   6:             general.dataset.0.organization str              = Ds4Sd
llama_model_loader: - kv   7:                 general.dataset.0.repo_url str              = https://huggingface.co/ds4sd/SynthCod...
llama_model_loader: - kv   8:                     general.dataset.1.name str              = SynthFormulaNet
llama_model_loader: - kv   9:             general.dataset.1.organization str              = Ds4Sd
llama_model_loader: - kv  10:                 general.dataset.1.repo_url str              = https://huggingface.co/ds4sd/SynthFor...
llama_model_loader: - kv  11:                     general.dataset.2.name str              = SynthChartNet
llama_model_loader: - kv  12:             general.dataset.2.organization str              = Ds4Sd
llama_model_loader: - kv  13:                 general.dataset.2.repo_url str              = https://huggingface.co/ds4sd/SynthCha...
llama_model_loader: - kv  14:                     general.dataset.3.name str              = DoclingMatix
llama_model_loader: - kv  15:             general.dataset.3.organization str              = HuggingFaceM4
llama_model_loader: - kv  16:                 general.dataset.3.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  17:                               general.tags arr[str,14]      = ["text-generation", "documents", "cod...
llama_model_loader: - kv  18:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  19:                          llama.block_count u32              = 30
llama_model_loader: - kv  20:                       llama.context_length u32              = 8192
llama_model_loader: - kv  21:                     llama.embedding_length u32              = 576
llama_model_loader: - kv  22:                  llama.feed_forward_length u32              = 1536
llama_model_loader: - kv  23:                 llama.attention.head_count u32              = 9
llama_model_loader: - kv  24:              llama.attention.head_count_kv u32              = 3
llama_model_loader: - kv  25:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  26:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  27:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  28:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  29:                          general.file_type u32              = 1
llama_model_loader: - kv  30:                           llama.vocab_size u32              = 100352
llama_model_loader: - kv  31:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = granite-docling
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.bos_token_id u32              = 100264
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  40:            tokenizer.ggml.unknown_token_id u32              = 100338
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 100257
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- for message in messages -%}\n{{- '...
llama_model_loader: - kv  44:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:   61 tensors
llama_model_loader: - type  f16:  211 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 312.88 MiB (16.00 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 100351 '</code>' is not marked as EOG
load: control token: 100345 '<row_3_col_4>' is not marked as EOG
load: control token: 100344 '<row_3_col_3>' is not marked as EOG
load: control token: 100342 '<row_3_col_1>' is not marked as EOG
load: control token: 100340 '<global-img>' is not marked as EOG
load: control token: 100339 '<fake_token_around_image>' is not marked as EOG
load: control token: 100337 '<rhed>' is not marked as EOG
load: control token: 100336 '<ched>' is not marked as EOG
load: control token: 100335 '<nl>' is not marked as EOG
load: control token: 100333 '<ucel>' is not marked as EOG
load: control token: 100332 '<lcel>' is not marked as EOG
load: control token: 100329 '<rec_' is not marked as EOG
load: control token: 100324 '<unordered_list>' is not marked as EOG
load: control token: 100320 '<references>' is not marked as EOG
load: control token: 100319 '</paragraph>' is not marked as EOG
load: control token: 100317 '</text>' is not marked as EOG
load: control token: 100312 '<chart>' is not marked as EOG
load: control token: 100310 '</value_' is not marked as EOG
load: control token: 100309 '<value_' is not marked as EOG
load: control token: 100305 '<key_value_region>' is not marked as EOG
load: control token: 100304 '</form>' is not marked as EOG
load: control token: 100303 '<form>' is not marked as EOG
load: control token: 100300 '</checkbox_selected>' is not marked as EOG
load: control token: 100298 '</otsl>' is not marked as EOG
load: control token: 100296 '</section_header_level_6>' is not marked as EOG
load: control token: 100293 '<section_header_level_5>' is not marked as EOG
load: control token: 100291 '<section_header_level_4>' is not marked as EOG
load: control token: 100290 '</section_header_level_3>' is not marked as EOG
load: control token: 100288 '</section_header_level_2>' is not marked as EOG
load: control token: 100287 '<section_header_level_2>' is not marked as EOG
load: control token: 100286 '</section_header_level_1>' is not marked as EOG
load: control token: 100285 '<section_header_level_1>' is not marked as EOG
load: control token: 100284 '</picture>' is not marked as EOG
load: control token: 100283 '<picture>' is not marked as EOG
load: control token: 100281 '<page_header>' is not marked as EOG
load: control token: 100277 '<list_item>' is not marked as EOG
load: control token: 100276 '</formula>' is not marked as EOG
load: control token: 100274 '</footnote>' is not marked as EOG
load: control token: 100273 '<footnote>' is not marked as EOG
load: control token: 100272 '</caption>' is not marked as EOG
load: control token: 100269 '<title>' is not marked as EOG
load: control token: 100268 '<row_2_col_3>' is not marked as EOG
load: control token: 100266 '</title>' is not marked as EOG
load: control token: 100265 '<|end_of_role|>' is not marked as EOG
load: control token: 100263 '<row_2_col_1>' is not marked as EOG
load: control token: 100262 '<row_1_col_4>' is not marked as EOG
load: control token: 100256 '<|pad|>' is not marked as EOG
load: control token: 100343 '<row_3_col_2>' is not marked as EOG
load: control token: 100315 '<smiles>' is not marked as EOG
load: control token: 100347 '<row_4_col_2>' is not marked as EOG
load: control token: 100316 '</smiles>' is not marked as EOG
load: control token: 100321 '</references>' is not marked as EOG
load: control token: 100326 '<group>' is not marked as EOG
load: control token: 100275 '<formula>' is not marked as EOG
load: control token: 100292 '</section_header_level_4>' is not marked as EOG
load: control token: 100259 '<row_1_col_2>' is not marked as EOG
load: control token: 100294 '</section_header_level_5>' is not marked as EOG
load: control token: 100301 '<checkbox_unselected>' is not marked as EOG
load: control token: 100279 '<page_footer>' is not marked as EOG
load: control token: 100270 '<image>' is not marked as EOG
load: control token: 100295 '<section_header_level_6>' is not marked as EOG
load: control token: 100261 '<row_1_col_3>' is not marked as EOG
load: control token: 100334 '<xcel>' is not marked as EOG
load: control token: 100264 '<|start_of_role|>' is not marked as EOG
load: control token: 100348 '<row_4_col_3>' is not marked as EOG
load: control token: 100327 '<doctag>' is not marked as EOG
load: control token: 100306 '</key_value_region>' is not marked as EOG
load: control token: 100330 '<fcel>' is not marked as EOG
load: control token: 100271 '<caption>' is not marked as EOG
load: control token: 100308 '</key_' is not marked as EOG
load: control token: 100278 '</list_item>' is not marked as EOG
load: control token: 100258 '<row_1_col_1>' is not marked as EOG
load: control token: 100331 '<ecel>' is not marked as EOG
load: control token: 100289 '<section_header_level_3>' is not marked as EOG
load: control token: 100323 '</ordered_list>' is not marked as EOG
load: control token: 100280 '</page_footer>' is not marked as EOG
load: control token: 100338 '<|unk|>' is not marked as EOG
load: control token: 100346 '<row_4_col_1>' is not marked as EOG
load: control token: 100297 '<otsl>' is not marked as EOG
load: control token: 100311 '<link_' is not marked as EOG
load: control token: 100307 '<key_' is not marked as EOG
load: control token: 100328 '</doctag>' is not marked as EOG
load: control token: 100267 '<row_2_col_2>' is not marked as EOG
load: control token: 100260 '<text>' is not marked as EOG
load: control token: 100341 '<row_2_col_4>' is not marked as EOG
load: control token: 100318 '<paragraph>' is not marked as EOG
load: control token: 100314 '<page_break>' is not marked as EOG
load: control token: 100299 '<checkbox_selected>' is not marked as EOG
load: control token: 100313 '</chart>' is not marked as EOG
load: control token: 100282 '</page_header>' is not marked as EOG
load: control token: 100322 '<ordered_list>' is not marked as EOG
load: control token: 100302 '</checkbox_unselected>' is not marked as EOG
load: control token: 100325 '</unordered_list>' is not marked as EOG
load: control token: 100349 '<row_4_col_4>' is not marked as EOG
load: control token: 100350 '<code>' is not marked as EOG
load: printing all EOG tokens:
load:   - 100257 ('<|end_of_text|>')
load: special tokens cache size = 96
load: token to piece cache size = 0.6152 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 576
print_info: n_layer          = 30
print_info: n_head           = 9
print_info: n_head_kv        = 3
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 192
print_info: n_embd_v_gqa     = 192
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 1536
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: model type       = 256M
print_info: model params     = 164.01 M
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 100352
print_info: n_merges         = 100000
print_info: BOS token        = 100264 '<|start_of_role|>'
print_info: EOS token        = 100257 '<|end_of_text|>'
print_info: EOT token        = 100257 '<|end_of_text|>'
print_info: UNK token        = 100338 '<|unk|>'
print_info: PAD token        = 100257 '<|end_of_text|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 100257 '<|end_of_text|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: layer   0 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   1 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   2 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   3 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   4 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   5 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   6 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   7 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   8 assigned to device Vulkan0, is_swa = 0
load_tensors: layer   9 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  10 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  11 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  12 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  13 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  14 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  15 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  16 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  17 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  18 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  19 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  20 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  21 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  22 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  23 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  24 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  25 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  26 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  27 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  28 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  29 assigned to device Vulkan0, is_swa = 0
load_tensors: layer  30 assigned to device Vulkan0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.ffn_up.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_q.weight
create_tensor: loading tensor blk.28.attn_k.weight
create_tensor: loading tensor blk.28.attn_v.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate.weight
create_tensor: loading tensor blk.28.ffn_down.weight
create_tensor: loading tensor blk.28.ffn_up.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.attn_q.weight
create_tensor: loading tensor blk.29.attn_k.weight
create_tensor: loading tensor blk.29.attn_v.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate.weight
create_tensor: loading tensor blk.29.ffn_down.weight
create_tensor: loading tensor blk.29.ffn_up.weight
load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 31/31 layers to GPU
load_tensors:      Vulkan0 model buffer size =   312.88 MiB
load_tensors:  Vulkan_Host model buffer size =   110.25 MiB
load_all_data: device Vulkan0 does not support async, host buffers or events
................................................load_all_data: buffer type Vulkan_Host is not the default buffer type for device Vulkan0 for async uploads
.
llama_init_from_model: model default pooling_type is [0], but [-1] was specified
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (16384) > n_ctx_train (8192) -- possible training context overflow
set_abort_callback: call
llama_context: Vulkan_Host  output buffer size =     0.38 MiB
create_memory: n_ctx = 16384 (padded)
llama_kv_cache: layer   0: dev = Vulkan0
llama_kv_cache: layer   1: dev = Vulkan0
llama_kv_cache: layer   2: dev = Vulkan0
llama_kv_cache: layer   3: dev = Vulkan0
llama_kv_cache: layer   4: dev = Vulkan0
llama_kv_cache: layer   5: dev = Vulkan0
llama_kv_cache: layer   6: dev = Vulkan0
llama_kv_cache: layer   7: dev = Vulkan0
llama_kv_cache: layer   8: dev = Vulkan0
llama_kv_cache: layer   9: dev = Vulkan0
llama_kv_cache: layer  10: dev = Vulkan0
llama_kv_cache: layer  11: dev = Vulkan0
llama_kv_cache: layer  12: dev = Vulkan0
llama_kv_cache: layer  13: dev = Vulkan0
llama_kv_cache: layer  14: dev = Vulkan0
llama_kv_cache: layer  15: dev = Vulkan0
llama_kv_cache: layer  16: dev = Vulkan0
llama_kv_cache: layer  17: dev = Vulkan0
llama_kv_cache: layer  18: dev = Vulkan0
llama_kv_cache: layer  19: dev = Vulkan0
llama_kv_cache: layer  20: dev = Vulkan0
llama_kv_cache: layer  21: dev = Vulkan0
llama_kv_cache: layer  22: dev = Vulkan0
llama_kv_cache: layer  23: dev = Vulkan0
llama_kv_cache: layer  24: dev = Vulkan0
llama_kv_cache: layer  25: dev = Vulkan0
llama_kv_cache: layer  26: dev = Vulkan0
llama_kv_cache: layer  27: dev = Vulkan0
llama_kv_cache: layer  28: dev = Vulkan0
llama_kv_cache: layer  29: dev = Vulkan0
llama_kv_cache:    Vulkan0 KV buffer size =   360.00 MiB
llama_kv_cache: size =  360.00 MiB ( 16384 cells,  30 layers,  1/1 seqs), K (f16):  180.00 MiB, V (f16):  180.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2184
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:    Vulkan0 compute buffer size =   197.12 MiB
llama_context: Vulkan_Host compute buffer size =    33.14 MiB
llama_context: graph nodes  = 937
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: added <|end_of_text|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
clip_model_loader: model name:   
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    198
clip_model_loader: n_kv:         36

clip_model_loader: has vision encoder
clip_model_loader: tensor[0]: n_dims = 2, name = mm.model.fc.weight, tensor_size=14155776, offset=0, shape:[12288, 576, 1, 1], type = f16
clip_model_loader: tensor[1]: n_dims = 1, name = v.patch_embd.bias, tensor_size=3072, offset=14155776, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[2]: n_dims = 4, name = v.patch_embd.weight, tensor_size=2359296, offset=14158848, shape:[16, 16, 3, 768], type = f32
clip_model_loader: tensor[3]: n_dims = 2, name = v.position_embd.weight, tensor_size=3145728, offset=16518144, shape:[768, 1024, 1, 1], type = f32
clip_model_loader: tensor[4]: n_dims = 1, name = v.blk.0.ln1.bias, tensor_size=3072, offset=19663872, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[5]: n_dims = 1, name = v.blk.0.ln1.weight, tensor_size=3072, offset=19666944, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[6]: n_dims = 1, name = v.blk.0.ln2.bias, tensor_size=3072, offset=19670016, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[7]: n_dims = 1, name = v.blk.0.ln2.weight, tensor_size=3072, offset=19673088, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[8]: n_dims = 1, name = v.blk.0.ffn_up.bias, tensor_size=12288, offset=19676160, shape:[3072, 1, 1, 1], type = f32
clip_model_loader: tensor[9]: n_dims = 2, name = v.blk.0.ffn_up.weight, tensor_size=4718592, offset=19688448, shape:[768, 3072, 1, 1], type = f16
clip_model_loader: tensor[10]: n_dims = 1, name = v.blk.0.ffn_down.bias, tensor_size=3072, offset=24407040, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[11]: n_dims = 2, name = v.blk.0.ffn_down.weight, tensor_size=4718592, offset=24410112, shape:[3072, 768, 1, 1], type = f16
clip_model_loader: tensor[12]: n_dims = 1, name = v.blk.0.attn_k.bias, tensor_size=3072, offset=29128704, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[13]: n_dims = 2, name = v.blk.0.attn_k.weight, tensor_size=1179648, offset=29131776, shape:[768, 768, 1, 1], type = f16
clip_model_loader: tensor[14]: n_dims = 1, name = v.blk.0.attn_out.bias, tensor_size=3072, offset=30311424, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[15]: n_dims = 2, name = v.blk.0.attn_out.weight, tensor_size=1179648, offset=30314496, shape:[768, 768, 1, 1], type = f16
clip_model_loader: tensor[16]: n_dims = 1, name = v.blk.0.attn_q.bias, tensor_size=3072, offset=31494144, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[17]: n_dims = 2, name = v.blk.0.attn_q.weight, tensor_size=1179648, offset=31497216, shape:[768, 768, 1, 1], type = f16
clip_model_loader: tensor[18]: n_dims = 1, name = v.blk.0.attn_v.bias, tensor_size=3072, offset=32676864, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[19]: n_dims = 2, name = v.blk.0.attn_v.weight, tensor_size=1179648, offset=32679936, shape:[768, 768, 1, 1], type = f16
clip_model_loader: tensor[20]: n_dims = 1, name = v.blk.1.ln1.bias, tensor_size=3072, offset=33859584, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[21]: n_dims = 1, name = v.blk.1.ln1.weight, tensor_size=3072, offset=33862656, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[22]: n_dims = 1, name = v.blk.1.ln2.bias, tensor_size=3072, offset=33865728, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[23]: n_dims = 1, name = v.blk.1.ln2.weight, tensor_size=3072, offset=33868800, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[24]: n_dims = 1, name = v.blk.1.ffn_up.bias, tensor_size=12288, offset=33871872, shape:[3072, 1, 1, 1], type = f32
clip_model_loader: tensor[25]: n_dims = 2, name = v.blk.1.ffn_up.weight, tensor_size=4718592, offset=33884160, shape:[768, 3072, 1, 1], type = f16
clip_model_loader: tensor[26]: n_dims = 1, name = v.blk.1.ffn_down.bias, tensor_size=3072, offset=38602752, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[27]: n_dims = 2, name = v.blk.1.ffn_down.weight, tensor_size=4718592, offset=38605824, shape:[3072, 768, 1, 1], type = f16
clip_model_loader: tensor[28]: n_dims = 1, name = v.blk.1.attn_k.bias, tensor_size=3072, offset=43324416, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[29]: n_dims = 2, name = v.blk.1.attn_k.weight, tensor_size=1179648, offset=43327488, shape:[768, 768, 1, 1], type = f16
clip_model_loader: tensor[30]: n_dims = 1, name = v.blk.1.attn_out.bias, tensor_size=3072, offset=44507136, shape:[768, 1, 1, 1], type = f32
clip_model_loader: tensor[31]: n_dims = 2, name = v.blk.1.attn_out.weight, tensor_size=1179648, offset=44510208, shape:[768, 768, 1, 1], type = f16
clip_model_loader: tensor[32]: n_dims = 1, name = v.blk.1.attn_q.bias, tensor_size=3072,

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions