fix(hf): filter model_options for apply_chat_template via Jinja AST allowlist#1168
Open
planetf1 wants to merge 7 commits into
Open
fix(hf): filter model_options for apply_chat_template via Jinja AST allowlist#1168planetf1 wants to merge 7 commits into
planetf1 wants to merge 7 commits into
Conversation
11 tasks
…ve-computing#1154) `temperature`, `max_new_tokens`, and `do_sample` were splatted directly into the Jinja template variable namespace via `apply_chat_template`. HF templates silently ignore unknown kwargs today, but a future template that references any of those names would be silently shadowed. Adds `_filter_generate_only_options` (the inverse of the existing `_filter_chat_template_only_options`) and applies it at both `apply_chat_template` call sites in `LocalHFBackend`. Adds 12 unit tests in `test_huggingface_filter_options.py` covering each key individually and the full filter→backend-specific chain, plus a qualitative e2e regression guard in `test_huggingface.py`. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…ilter fix - Reword _filter_generate_only_options docstring: replace "Inverse of" with "Companion to" (it is a deny-list, not a mathematical complement); add a note that model_options arrive post-_simplify_and_merge so user string keys are already normalised to sentinels - Clarify inline comment on MAX_NEW_TOKENS: the sentinel is renamed *downstream* by _make_backend_specific_and_remove, not in this method - Update e2e test docstring to be honest: the qualitative test guards pipeline non-regression; filter correctness is covered by unit tests in test_huggingface_filter_options.py - Expand test fixture docstring to document the __new__ invariant and the single attribute that must be set - Expand module docstring to clarify that torch must be importable even though no GPU is needed to run the tests Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The initial fix covered only temperature, max_new_tokens, and do_sample. Expand the denylist to include all commonly-used HuggingFace GenerationConfig options that must not reach the Jinja template namespace: - Sampling: top_k, top_p, typical_p, repetition_penalty, no_repeat_ngram_size, length_penalty, num_beams, num_beam_groups, diversity_penalty, penalty_alpha, early_stopping - Length: min_new_tokens - Sequence count: num_return_sequences - Special token IDs: pad_token_id, bos_token_id, eos_token_id, forced_bos_token_id, forced_eos_token_id Adds 19 parametrised unit tests (one per new key) and an integration test that verifies all template-only keys survive unchanged when every expanded denylist key is present in the same options dict. Closes generative-computing#1170 (no longer needed as a follow-up). Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The previous approach maintained a hand-written denylist of GenerationConfig option names to strip before apply_chat_template. This is fundamentally incomplete: transformers has ~60+ GenerationConfig parameters and the list requires manual updates as HuggingFace adds new ones. Replace it with a self-maintaining solution: parse the tokenizer's Jinja2 template with jinja2.meta.find_undeclared_variables to compute the exact set of kwargs the template can consume. Only keys in that set are forwarded to apply_chat_template. Generation-only options are silently dropped not because they are named in a list, but because no chat template references them. _HF_INTERNAL_TEMPLATE_VARS (5 entries) excludes variables HuggingFace injects automatically — named params to render_jinja_template (messages, tools, add_generation_prompt) and Jinja env globals from _compile_jinja_template (raise_exception, strftime_now). Sourced by reading the transformers source directly; see the constant's docstring for the relevant function references. The allowlist is computed lazily and cached on the instance (cached_property), so the Jinja parse happens once per backend lifetime. Tests: - test_hf_internal_template_vars_contents: strict equality check on the 5-item exclusion set — fails immediately if transformers changes what it injects - allowlist tests: Jinja AST parsing, exclusion of HF internals, absence of generate-only option names in a realistic template, caching, empty template - _filter_for_chat_template tests: sentinel renaming, generate-only drop, template-var pass-through, empty input, unknown key drop - _filter_chat_template_only_options: regression guard (pre-existing method) Closes generative-computing#1154 Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Adds comprehensive unit tests for the new _chat_template_allowlist cached property and _filter_for_chat_template method introduced in the preceding commit, plus tokenizer-only integration tests against the real Granite 3.3 8B template (no GPU required; skipped automatically when the tokenizer is not in the local HF cache). Unit tests cover: - _HF_INTERNAL_TEMPLATE_VARS exact-set assertion (load-bearing upgrade guard) - allowlist includes template variables / excludes HF-internal / generate-only vars - empty allowlist for missing chat_template - cached_property returns same object on repeated access - _filter_for_chat_template: passes referenced vars, drops generate-only, renames Mellea sentinels, handles empty input, drops unknown keys - _filter_chat_template_only_options regression guard (pre-existing method) Integration tests (marked huggingface, skip when tokenizer not cached): - Granite allowlist includes 'think' and 'guardian_config' - Granite allowlist excludes generate-only options - Granite allowlist excludes HF-internal vars Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
_FakeTokenizer() is not a PreTrainedTokenizer, so assigning it to b._tokenizer caused a mypy [assignment] error in CI. Use object.__setattr__ instead — its value param is typed Any in typeshed so mypy accepts the assignment without suppression. Assisted-by: Claude Code Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…variants
Three robustness fixes to the Jinja AST allowlist:
1. Register jinja2.ext.loopcontrols so templates using {% break %} or
{% continue %} parse without TemplateSyntaxError.
2. Wrap env.parse() in try/except jinja2.TemplateSyntaxError — templates
that use unsupported extensions such as {% generation %} / {% endgeneration %}
(DeepSeek-R1, Qwen3, transformers >= 4.47) return frozenset() instead of
crashing during inference.
3. Guard against non-string chat_template (None, list-of-alternates, dict) with
an isinstance check before attempting to parse.
Move the inline 'import jinja2.meta' to module level (already done for jinja2).
Add a comment at the KV-cache apply_chat_template call site explaining why
add_generation_prompt is intentionally absent there (it is HF-internal and
the KV-cache path formats context, not a generation turn).
Three new unit tests cover each robustness case:
- test_allowlist_non_string_template_returns_empty
- test_allowlist_graceful_on_generation_tag
- test_allowlist_break_continue_tags_parsed_correctly
Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
3346b68 to
bc01569
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The problem
When you pass generation options like
temperature,max_new_tokens, ordo_sampletoLocalHFBackend, those options were being forwarded into HuggingFace'sapply_chat_template. The method accepts arbitrary**kwargsand passes them straight into the Jinja template's variable namespace.Standard HF templates silently ignore variables they don't recognise, so nothing breaks for today's models. But this is a latent correctness hazard: any model whose Jinja template happens to define a variable named
temperature,num_beams, or any otherGenerationConfigparameter would silently receive the caller's generation setting instead of the template author's intended value. It also makes the calling contract ofmodel_optionsopaque — there's no principled boundary between what goes toapply_chat_templateand what goes togenerate().How other backends handle this
For comparison, the OpenAI backend uses
inspect.signature(Completions.create)to build a self-maintaining filter: it forwards only kwargs that appear in the API client's call signature. The HF backend's equivalent interface is the Jinja template itself — template variables are the "declared parameters" forapply_chat_template.The fix
Rather than maintaining a hand-written denylist of
GenerationConfigparameter names (which can never be complete — transformers has 60+ generation parameters), this PR uses the Jinja2 template as the source of truth.At both
apply_chat_templatecall sites, a new_filter_for_chat_templatemethod is applied. It:jinja2.meta.find_undeclared_variablesto get the exact set of variables the template references._HF_INTERNAL_TEMPLATE_VARS— the 5 variables HuggingFace injects automatically (see below).Generation-only options like
temperatureare dropped not because they appear on any list, but because no standard HF chat template ever references them as Jinja variables. The allowlist is computed once per backend instance (viafunctools.cached_property) and adapts automatically as models evolve.The exclusion set — please review carefully
_HF_INTERNAL_TEMPLATE_VARSis the small set of variables that HuggingFace injects into the template namespace unconditionally. These must be excluded from the allowlist to avoid"got multiple values for keyword argument"TypeErrors or silent replacement of injected callables.messagesrender_jinja_templatenamed paramtoolsrender_jinja_templatenamed paramtools=convert_tools_to_json(...)explicitlyadd_generation_promptrender_jinja_templatenamed paramadd_generation_prompt=Trueat the standard-generation call siteraise_exception_compile_jinja_templateJinja env globalfind_undeclared_variablescannot see env globals; treats this as undeclaredstrftime_now_compile_jinja_templateJinja env globalRelevant transformers source locations (verified against installed version):
PreTrainedTokenizerBase.apply_chat_template— therender_jinja_template(...)calltransformers.utils.chat_template_utils._compile_jinja_template— thejinja_env.globals["raise_exception"]andjinja_env.globals["strftime_now"]assignmentstest_hf_internal_template_vars_contentsfails immediately if the constant changes without a corresponding test update — making the exclusion set an explicit, auditable contract.Behaviour changes reviewers should be aware of
add_generation_prompton the KV-cache path: becauseadd_generation_promptis in_HF_INTERNAL_TEMPLATE_VARS, any user-supplied value is now dropped rather than forwarded. This is correct: the KV-cache path formats context (not a generation turn), so HF's default ofFalseis the right behaviour. A comment at the call site documents this explicitly.Graceful degradation for unsupported template extensions: templates using
{% generation %} / {% endgeneration %}(DeepSeek-R1, Qwen3, transformers ≥ 4.47) or unusual Jinja extensions that cannot be parsed by a plainjinja2.Environmentreturnfrozenset()— the filter forwards nothing toapply_chat_templaterather than crashing.{% break %} / {% continue %}are handled correctly viajinja2.ext.loopcontrols.Non-string
chat_template: some tokenizers storechat_templateas a list of named template dicts. The filter now guards against this with anisinstance(str)check and returnsfrozenset()for non-string values, leavingapply_chat_templateto handle that format itself.Changes
mellea/backends/huggingface.py_HF_INTERNAL_TEMPLATE_VARS(5 entries, documented above)_chat_template_allowlistcached property — parses the Jinja AST once per backend instance, withloopcontrolsextension registered and graceful fallback on parse failure_filter_for_chat_templatemethod — renames sentinels then filters by allowlistapply_chat_templatecall sites updated to use_filter_for_chat_template_filter_generate_only_optionsremoved (the 21-key denylist is gone)jinja2andjinja2.metaimported at module leveltest/backends/test_huggingface_filter_options.py— 21 unit tests (no GPU needed):_HF_INTERNAL_TEMPLATE_VARScontents{% generation %}tag (graceful degradation),{% break %}(loopcontrols), non-string template (isinstance guard)_filter_for_chat_template: sentinel renaming, generate-only drop, template-var pass-through, empty input, unknown key drop_filter_chat_template_only_options: regression guard on pre-existing methodhuggingface): Granite tokenizer allowlist includesthink+guardian_config, excludes generate-only keys and HF-internal varsTesting
The e2e test
test_generate_only_options_do_not_break_generationis@pytest.mark.qualitativeand requires the real model fixture.Where this fits
Standalone bugfix. No other phases.
Closes #1154