CodeFormulaV2 --> GGUF+mmproj --> custom_code_formula_config #3502

vishalbelsare · 2026-05-26T16:24:52Z

vishalbelsare
May 26, 2026

Operating under tight VRAM constraints on a laptop, and using llama-swap to swap an embedding model, re-ranking model in/out of VRAM, I cannot (comfortably) use the CUDA version of docling-serve, because I don't know of a way to swap out the docling models from VRAM and allow llama-swap to take over again. i.e. it would be lovely if llama-swap could serve all the models or some system which would allow a flow like that.

On this line, I used llama.cpp's convert tool to create a GGUF (and a mmproj) file from docling-project/CodeFormulaV2 at huggingface. I added this model to the list which llama-swap can serve.

Now, I think, I can get it to act reasonably, but I want to know (1) whether this is proper or there's some other more elegant solution (2) even if I use this hackish GGUF what are the pitfalls I don't know about and (3) if this is acceptable, then the proper parameters which should be used to map the behavior of the factory baked-in CodeFormulaV2 to the GGUF version of CodeFormulaV2 which I am using by passing :

"code_formula_custom_config": "{"engine_options": {"engine_type": "api_openai", "url": "http://localhost:8090/v1/chat/completions\", "params": {"model": "codeformulav2"}, "timeout": 120.0, "concurrency": 4}, "model_spec": {"name": "CodeFormulaV2", "default_repo_id": "docling-project/CodeFormulaV2", "prompt": "", "response_format": "plaintext"}, "extract_code": true, "extract_formulas": true}"

and having this in my llama-swap config.yaml :

"codeformulav2":
ttl: 30
concurrencyLimit: 4
cmd: >
  llama-server --port ${PORT}
    -m /models/cache/models--docling-project--CodeFormulaV2/CodeFormulaV2-f16.gguf
    --mmproj /models/cache/models--docling-project--CodeFormulaV2/CodeFormulaV2-mmproj-f16.gguf
    --mmproj-offload
    --alias codeformulav2
    --ctx-size 16384
    # --n-predict 1024
    --parallel 4
    --jinja
    --chat-template-file /models/cache/models--docling-project--CodeFormulaV2/chat_template.jinja
    --temp 0.0 --top-p 1.0 --repeat-penalty 1.0
    -ngl 99
    --special
    --cont-batching

@vishalbelsare · 2026-05-26T16:50:30Z

dosubot[bot]
Bot May 26, 2026

Hi @vishalbelsare! I'm Dosu and I’m helping the docling team.

@vishalbelsare This is a creative approach for VRAM-constrained setups, and it's broadly valid — Docling's API engine type is designed to work with any OpenAI-compatible endpoint, including llama-server hosting GGUF models. Here's what I can share on your three questions:

1. Is this proper / is there a more elegant solution?

Your approach is the recommended workaround for resource-constrained environments. Docling doesn't natively support GGUF, but the engine_type: "api" (or "api_openai") configuration routes inference to any OpenAI-compatible server — including llama-server with GGUF models. Both api and api_openai engine types use the same underlying ApiVlmEngine implementation [1], so either works. Using "api" is slightly more appropriate for non-OpenAI endpoints.

2. Pitfalls to watch for:

Prompting: CodeFormulaV2 uses extremely minimal single-token prompts — literally <code> for code and <formula> for formulas, sent as the user text in a chat message alongside the image [2]. This is not a natural language instruction; it's a special token the model was trained on. Make sure your llama-server's chat template preserves these tokens faithfully and doesn't strip or reformat them.
Stop strings: The default config uses stop strings ["</doctag>", "<end_of_utterance>"] [3]. You'll want to ensure llama-server respects these (or configure them via the stop parameter in requests).
Image encoding: The API engine sends images as base64-encoded image_url in the OpenAI chat format [4]. llama-server should handle this with --mmproj, which it appears you've already configured.
Response format: CodeFormulaV2 outputs PLAINTEXT [5] — plain code or LaTeX text, not structured markup. Ensure you're not inadvertently expecting or parsing structured output.
Scale factor: The default scale: 2.0 [6] means Docling upscales cropped code/formula images 2x before sending them to the model. This increases image resolution (and thus token count for the vision encoder). Keep this in mind for context window sizing.

3. Correct parameters to align with factory defaults:

Your config looks mostly right. Here are the factory defaults you should match:

{
  "engine_options": {
    "engine_type": "api",
    "url": "http://localhost:8090/v1/chat/completions",
    "params": {
      "model": "codeformulav2",
      "temperature": 0.0,
      "max_tokens": 2048
    },
    "timeout": 120.0,
    "concurrency": 4
  },
  "model_spec": {
    "name": "CodeFormulaV2",
    "default_repo_id": "docling-project/CodeFormulaV2",
    "prompt": "",
    "response_format": "plaintext",
    "stop_strings": ["</doctag>", "<end_of_utterance>"]
  },
  "scale": 2.0,
  "extract_code": true,
  "extract_formulas": true
}

Key differences from your current config:

Use "engine_type": "api" rather than "api_openai" (more appropriate for local llama-server)
Add "temperature": 0.0 and "max_tokens": 2048 in params to match local defaults [7]
Include stop_strings in model_spec

For your llama-server config, your flags look solid. The --temp 0.0 --top-p 1.0 --repeat-penalty 1.0 correctly replicates the deterministic greedy decoding used by default [8]. The --ctx-size 16384 is generous — the model's default max generation is 2048 tokens, so this leaves plenty of room for the image tokens + prompt.

One note: --jinja and --chat-template-file are critical — make sure the chat template you're using properly formats the <code> / <formula> prompts with the image. If you extracted it from the HF repo it should be fine.

To reply, just mention @dosu.

Share context across your team and agents. Try Dosu.

2 replies

vishalbelsare May 26, 2026
Author

Opus 4.7 to @dosu : )

Let me fetch that discussion to see what's there.Good — that's your discussion, and Dosu (the AI bot, not a human maintainer) has weighed in. Let me read its answer carefully against what we verified from the source code, because there are a few places where Dosu and the actual code disagree.

What Dosu got right

Approach is valid in principle — Docling's API engine path does route to any OpenAI-compatible endpoint. llama-server with GGUF + mmproj is a legitimate target.
Prompts are <code> / <formula> — confirmed by code_formula_vlm_model.py line 154 (we verified this directly).
Response format is plaintext — confirmed.
Greedy decoding (temp 0.0, top_p 1.0) — matches the hardcoded temperature=0.0 in VlmEngineInput.
--jinja + --chat-template-file is critical — correct, since the GGUF doesn't reliably embed the Idefics3 chat template.
Image transport is base64 image_url in OpenAI format — that's what api_image_request.py does.

Where Dosu is wrong or misleading

1. stop_strings: ["</doctag>", "<end_of_utterance>"] — wrong.

Dosu cites stage_model_specs.py:L1474 as the source. But we read code_formula_vlm_model.py directly — the stage post-processes output with hardcoded string truncation on <end_of_utterance> (and a few cleanup tokens like </code>, </formula>, the bbox prefix). It does not send stop_strings through to the API engine for the CodeFormulaV2 path. And </doctag> is a SmolDocling / Granite-Docling token, not a CodeFormulaV2 token — CodeFormulaV2 outputs <_lang_>code</code> for code and raw LaTeX for formulas. Including </doctag> is harmless (it'll never appear in output) but it's noise, and Dosu is reading from a config table without checking what the stage actually does with it.

The factory VlmModelSpec default for stop_strings is [] (we verified from the pipeline_options JSON schema dump). The stage handles termination via the model's EOS token (<end_of_utterance>, ID 100338) at the tokenizer level, not via string stops.

2. temperature: 0.0 and max_tokens: 2048 in params — partially wrong / risky.

The ApiVlmEngine already injects temperature and max_tokens from the VlmEngineInput (which the stage builds with temperature=0.0, max_new_tokens=2048 hardcoded — verified at lines 252–253 of code_formula_vlm_model.py). Per DeepWiki's writeup of api_openai_compatible_engine.py:L128-L147, runtime overrides come from VlmEngineInput, then user params are merged. Two issues with putting them in params:

Redundant — the engine already provides them per-request.
Potentially conflicting — if a future engine version uses your params.max_tokens as a ceiling and the stage requests max_new_tokens=2048 but you set params.max_tokens=2048, you're at the boundary with no margin.

Better to leave both out of params and let the engine inject them. That's also what your original config did correctly.

3. "api" vs "api_openai" — Dosu is wrong here.

Dosu says use "api" because it's "more appropriate for non-OpenAI endpoints." This is backwards. The VlmEngineType enum (verified from the JSON schema) lists both, and per the docling source (factory.py L118-L130 which Dosu itself cited), both route to ApiVlmEngine. But api_openai is specifically the variant for OpenAI-compatible chat completions endpoints. Plain api is a generic/legacy entry. llama-server is OpenAI-compatible — it speaks /v1/chat/completions with the OpenAI message schema. api_openai is the more precise choice.

If both work identically in the current code, this is a wash in practice — but Dosu's reasoning ("more appropriate for non-OpenAI") is inverted.

4. scale: 2.0 in your config — has no effect on cropping.

We verified that code_formula_vlm_model.py and code_formula_model.py both set images_scale = 1.67 as a class attribute (line 73 of vlm version, line 80 of legacy). This is what BaseItemAndImageEnrichmentModel reads to upscale the crop before passing it in. Your options.scale = 2.0 is a Pydantic field that does not override that class constant. Dosu cites stage_model_specs.py:L1485 to defend 2.0, but that's the spec-level field; the actual cropping uses the hardcoded 1.67.

The Pydantic scale: 2.0 from the spec might be passed to the engine for some other purpose (e.g., as a hint), but it does not affect Docling's image upscaling step. Leaving it in your config is harmless but the default of 2.0 in the spec is itself somewhat misleading — the model was trained at 120 DPI which is 1.67 × 72 = 120, hence the runtime constant.

Net answer

Your earlier config (the one I gave you with api_openai, no temperature/max_tokens in params, no stop_strings, no scale) is closer to actual runtime behavior than what Dosu recommends. Dosu's response has the right vibes but is reading the Pydantic spec tables and treating them as ground truth, without checking that the stage actually uses those fields. The stage hardcodes the things that matter and ignores most of model_spec.

To restate cleanly, the minimal config that maps to actual runtime behavior is:

"code_formula_custom_config": "{\"engine_options\": {\"engine_type\": \"api_openai\", \"url\": \"http://localhost:8090/v1/chat/completions\", \"params\": {\"model\": \"codeformulav2\"}, \"timeout\": 120.0, \"concurrency\": 4}, \"model_spec\": {\"name\": \"CodeFormulaV2\", \"default_repo_id\": \"docling-project/CodeFormulaV2\", \"prompt\": \"\", \"response_format\": \"plaintext\"}, \"extract_code\": true, \"extract_formulas\": true}"

If you want to be belt-and-suspenders defensive and add stop_strings: ["<end_of_utterance>"] (the actual CodeFormulaV2 EOS string, not the doctag) so llama-server stops generating in case the EOS token somehow gets mis-handled — that's a reasonable hedge. But adding </doctag> is just cargo-culting an unrelated token from the SmolDocling/Granite-Docling preset config.

One real concern Dosu raised that's worth taking seriously: chat template fidelity. Your extracted Idefics3-style template uses <|start_of_role|> and <end_of_utterance> framing. If llama-server's Jinja renderer transforms or misinterprets these special tokens during template application, output will degrade silently. Worth verifying by sending a known crop through your endpoint and comparing the raw response to what the HF transformers version of the model produces on the same input. You did this implicitly with your smoke tests (got bbox-prefixed <_Python_>print("Hello World.") for code and bare LaTeX for formula), which is the right shape — but if you ever see output without the bbox prefix or without the <_lang_> tag, the template's broken.

The discussion's "Unanswered" — no human maintainer has replied. Dosu's a bot. Worth waiting a few days to see if a docling team member weighs in with corrections, but for now your verified-from-source config is more accurate than Dosu's table-derived one.

dosubot[bot] Bot May 26, 2026

@vishalbelsare Fair critique — you clearly went deeper into the source than my initial response did. Let me address this honestly.

You're correct on the substantive points:

stop_strings: The stage does not pass stop strings to the API engine. VlmEngineInput is created without populating stop_strings, so it defaults to [] [1]. The post-processing logic in _post_process() handles truncation at <end_of_utterance> internally [2]. The ["</doctag>", "<end_of_utterance>"] I cited lives in the preset spec definition [3], but the runtime stage ignores it. Including </doctag> was indeed cargo-culting from the Granite-Docling preset.
temperature/max_tokens in params: The ApiVlmEngine injects these from VlmEngineInput first, then user params override them via api_params.update(self.merged_params) [4]. So putting them in params is redundant at best and potentially conflicting at worst. Your config correctly omits them.
scale: 2.0 is a dead field: CodeFormulaVlmModel hardcodes images_scale = 1.67 as a class attribute [5] and never overrides it from the options, unlike PictureDescriptionBaseModel which does self.images_scale = options.scale [6]. This looks like a bug — the options field exists but isn't wired up.
api vs api_openai: Both route to the same ApiVlmEngine class [7]. The difference is that api_openai pre-fills url to https://api.openai.com/v1/chat/completions [8]. Since you're providing your own URL anyway, both work identically in practice — but you're right that api_openai is the more semantically precise choice for an OpenAI-compatible chat completions endpoint.

Your minimal config is the correct one. The only defensive addition worth considering is passing "stop": ["<end_of_utterance>"] server-side in llama-server config (not in Docling's config), so generation terminates early without relying solely on the EOS token ID mapping being correct in the GGUF tokenizer. But that's belt-and-suspenders, not required.

I should have traced the runtime path more carefully rather than treating the Pydantic spec fields as ground truth. Appreciate the thorough correction.

To reply, just mention @dosu.

Docs are dead. Just use Dosu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CodeFormulaV2 --> GGUF+mmproj --> custom_code_formula_config #3502

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CodeFormulaV2 --> GGUF+mmproj --> custom_code_formula_config #3502

Uh oh!

vishalbelsare May 26, 2026

Replies: 1 comment · 2 replies

Uh oh!

dosubot[bot] Bot May 26, 2026

Uh oh!

vishalbelsare May 26, 2026 Author

What Dosu got right

Where Dosu is wrong or misleading

Net answer

Uh oh!

dosubot[bot] Bot May 26, 2026

vishalbelsare
May 26, 2026

Replies: 1 comment 2 replies

dosubot[bot]
Bot May 26, 2026

vishalbelsare May 26, 2026
Author