Prerequisites
Expected Behavior
Llama(model_path=...) should successfully load a GGUF whose embedded tokenizer.chat_template contains HuggingFace's {% generation %} / {% endgeneration %} Jinja tags (e.g. SmolLM3, and any future HF-shipped model adopting the same template extension), even when the caller intends to pass an explicit chat_format override.
Current Behavior
Llama.__init__ raises jinja2.exceptions.TemplateSyntaxError: Encountered unknown tag 'generation' before the model is usable. The error fires during Jinja2ChatFormatter.__init__, which eagerly compiles every chat template found in GGUF metadata regardless of whether the caller will use it.
The {% generation %} tag is a HuggingFace transformers chat-template extension that marks training-time generation spans for loss masking. It has no inference-time meaning, but jinja2's default environment doesn't recognize it.
Environment and Context
- Hardware: Apple M1 Pro, 16 GB
- OS: macOS 14.6 (Darwin 23.6.0)
- Python: 3.12.9
- llama-cpp-python: main (commit current as of 2025-05-21)
- jinja2: 3.x
Failure Information
jinja2.exceptions.TemplateSyntaxError: Encountered unknown tag 'generation'.
Jinja was looking for the following tags: 'elif' or 'else' or 'endif'.
The innermost block that needs to be closed is 'if'.
Steps to Reproduce
pip install llama-cpp-python
huggingface-cli download bartowski/HuggingFaceTB_SmolLM3-3B-GGUF HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf
python -c "from llama_cpp import Llama; Llama(model_path='./HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf', chat_format='chatml')"
The chat_format='chatml' override is intentionally provided to show the failure occurs even when the embedded template would be bypassed: the template is compiled at init regardless.
Related
A complete fix is available; will open a PR shortly.
Prerequisites
Expected Behavior
Llama(model_path=...)should successfully load a GGUF whose embeddedtokenizer.chat_templatecontains HuggingFace's{% generation %}/{% endgeneration %}Jinja tags (e.g. SmolLM3, and any future HF-shipped model adopting the same template extension), even when the caller intends to pass an explicitchat_formatoverride.Current Behavior
Llama.__init__raisesjinja2.exceptions.TemplateSyntaxError: Encountered unknown tag 'generation'before the model is usable. The error fires duringJinja2ChatFormatter.__init__, which eagerly compiles every chat template found in GGUF metadata regardless of whether the caller will use it.The
{% generation %}tag is a HuggingFace transformers chat-template extension that marks training-time generation spans for loss masking. It has no inference-time meaning, but jinja2's default environment doesn't recognize it.Environment and Context
Failure Information
Steps to Reproduce
pip install llama-cpp-pythonhuggingface-cli download bartowski/HuggingFaceTB_SmolLM3-3B-GGUF HuggingFaceTB_SmolLM3-3B-Q4_K_M.ggufpython -c "from llama_cpp import Llama; Llama(model_path='./HuggingFaceTB_SmolLM3-3B-Q4_K_M.gguf', chat_format='chatml')"The
chat_format='chatml'override is intentionally provided to show the failure occurs even when the embedded template would be bypassed: the template is compiled at init regardless.Related
nodesimport and incompleteparse()body, plus seven months of no reviewer activity.A complete fix is available; will open a PR shortly.