CodeFormulaV2 --> GGUF+mmproj --> custom_code_formula_config #3502
Replies: 1 comment 2 replies
-
|
Hi @vishalbelsare! I'm Dosu and I’m helping the docling team. @vishalbelsare This is a creative approach for VRAM-constrained setups, and it's broadly valid — Docling's API engine type is designed to work with any OpenAI-compatible endpoint, including llama-server hosting GGUF models. Here's what I can share on your three questions: 1. Is this proper / is there a more elegant solution? Your approach is the recommended workaround for resource-constrained environments. Docling doesn't natively support GGUF, but the 2. Pitfalls to watch for:
3. Correct parameters to align with factory defaults: Your config looks mostly right. Here are the factory defaults you should match: {
"engine_options": {
"engine_type": "api",
"url": "http://localhost:8090/v1/chat/completions",
"params": {
"model": "codeformulav2",
"temperature": 0.0,
"max_tokens": 2048
},
"timeout": 120.0,
"concurrency": 4
},
"model_spec": {
"name": "CodeFormulaV2",
"default_repo_id": "docling-project/CodeFormulaV2",
"prompt": "",
"response_format": "plaintext",
"stop_strings": ["</doctag>", "<end_of_utterance>"]
},
"scale": 2.0,
"extract_code": true,
"extract_formulas": true
}Key differences from your current config:
For your llama-server config, your flags look solid. The One note: To reply, just mention @dosu. Share context across your team and agents. Try Dosu. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Operating under tight VRAM constraints on a laptop, and using llama-swap to swap an embedding model, re-ranking model in/out of VRAM, I cannot (comfortably) use the CUDA version of docling-serve, because I don't know of a way to swap out the docling models from VRAM and allow llama-swap to take over again. i.e. it would be lovely if llama-swap could serve all the models or some system which would allow a flow like that.
On this line, I used llama.cpp's convert tool to create a GGUF (and a mmproj) file from docling-project/CodeFormulaV2 at huggingface. I added this model to the list which llama-swap can serve.
Now, I think, I can get it to act reasonably, but I want to know (1) whether this is proper or there's some other more elegant solution (2) even if I use this hackish GGUF what are the pitfalls I don't know about and (3) if this is acceptable, then the proper parameters which should be used to map the behavior of the factory baked-in CodeFormulaV2 to the GGUF version of CodeFormulaV2 which I am using by passing :
"code_formula_custom_config": "{"engine_options": {"engine_type": "api_openai", "url": "http://localhost:8090/v1/chat/completions\", "params": {"model": "codeformulav2"}, "timeout": 120.0, "concurrency": 4}, "model_spec": {"name": "CodeFormulaV2", "default_repo_id": "docling-project/CodeFormulaV2", "prompt": "", "response_format": "plaintext"}, "extract_code": true, "extract_formulas": true}"
and having this in my llama-swap config.yaml :
Beta Was this translation helpful? Give feedback.
All reactions