Skip to content

fix(hf): filter model_options for apply_chat_template via Jinja AST allowlist#1168

Open
planetf1 wants to merge 7 commits into
generative-computing:mainfrom
planetf1:worktree-issue-1154
Open

fix(hf): filter model_options for apply_chat_template via Jinja AST allowlist#1168
planetf1 wants to merge 7 commits into
generative-computing:mainfrom
planetf1:worktree-issue-1154

Conversation

@planetf1
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 commented May 27, 2026

The problem

When you pass generation options like temperature, max_new_tokens, or do_sample to LocalHFBackend, those options were being forwarded into HuggingFace's apply_chat_template. The method accepts arbitrary **kwargs and passes them straight into the Jinja template's variable namespace.

Standard HF templates silently ignore variables they don't recognise, so nothing breaks for today's models. But this is a latent correctness hazard: any model whose Jinja template happens to define a variable named temperature, num_beams, or any other GenerationConfig parameter would silently receive the caller's generation setting instead of the template author's intended value. It also makes the calling contract of model_options opaque — there's no principled boundary between what goes to apply_chat_template and what goes to generate().

How other backends handle this

For comparison, the OpenAI backend uses inspect.signature(Completions.create) to build a self-maintaining filter: it forwards only kwargs that appear in the API client's call signature. The HF backend's equivalent interface is the Jinja template itself — template variables are the "declared parameters" for apply_chat_template.

The fix

Rather than maintaining a hand-written denylist of GenerationConfig parameter names (which can never be complete — transformers has 60+ generation parameters), this PR uses the Jinja2 template as the source of truth.

At both apply_chat_template call sites, a new _filter_for_chat_template method is applied. It:

  1. Parses the tokenizer's Jinja2 template with jinja2.meta.find_undeclared_variables to get the exact set of variables the template references.
  2. Subtracts _HF_INTERNAL_TEMPLATE_VARS — the 5 variables HuggingFace injects automatically (see below).
  3. Forwards only the keys that are in the resulting allowlist.

Generation-only options like temperature are dropped not because they appear on any list, but because no standard HF chat template ever references them as Jinja variables. The allowlist is computed once per backend instance (via functools.cached_property) and adapts automatically as models evolve.

The exclusion set — please review carefully

_HF_INTERNAL_TEMPLATE_VARS is the small set of variables that HuggingFace injects into the template namespace unconditionally. These must be excluded from the allowlist to avoid "got multiple values for keyword argument" TypeErrors or silent replacement of injected callables.

Variable Source Why excluded
messages render_jinja_template named param always injected as the first positional arg
tools render_jinja_template named param our call sites pass tools=convert_tools_to_json(...) explicitly
add_generation_prompt render_jinja_template named param passed as add_generation_prompt=True at the standard-generation call site
raise_exception _compile_jinja_template Jinja env global find_undeclared_variables cannot see env globals; treats this as undeclared
strftime_now _compile_jinja_template Jinja env global same as above

Relevant transformers source locations (verified against installed version):

  • Named params: PreTrainedTokenizerBase.apply_chat_template — the render_jinja_template(...) call
  • Jinja globals: transformers.utils.chat_template_utils._compile_jinja_template — the jinja_env.globals["raise_exception"] and jinja_env.globals["strftime_now"] assignments

⚠️ If transformers is upgraded, verify that this set is still accurate. The test test_hf_internal_template_vars_contents fails immediately if the constant changes without a corresponding test update — making the exclusion set an explicit, auditable contract.

Behaviour changes reviewers should be aware of

  1. add_generation_prompt on the KV-cache path: because add_generation_prompt is in _HF_INTERNAL_TEMPLATE_VARS, any user-supplied value is now dropped rather than forwarded. This is correct: the KV-cache path formats context (not a generation turn), so HF's default of False is the right behaviour. A comment at the call site documents this explicitly.

  2. Graceful degradation for unsupported template extensions: templates using {% generation %} / {% endgeneration %} (DeepSeek-R1, Qwen3, transformers ≥ 4.47) or unusual Jinja extensions that cannot be parsed by a plain jinja2.Environment return frozenset() — the filter forwards nothing to apply_chat_template rather than crashing. {% break %} / {% continue %} are handled correctly via jinja2.ext.loopcontrols.

  3. Non-string chat_template: some tokenizers store chat_template as a list of named template dicts. The filter now guards against this with an isinstance(str) check and returns frozenset() for non-string values, leaving apply_chat_template to handle that format itself.

Changes

  • mellea/backends/huggingface.py

    • New module-level constant _HF_INTERNAL_TEMPLATE_VARS (5 entries, documented above)
    • New _chat_template_allowlist cached property — parses the Jinja AST once per backend instance, with loopcontrols extension registered and graceful fallback on parse failure
    • New _filter_for_chat_template method — renames sentinels then filters by allowlist
    • Both apply_chat_template call sites updated to use _filter_for_chat_template
    • _filter_generate_only_options removed (the 21-key denylist is gone)
    • jinja2 and jinja2.meta imported at module level
  • test/backends/test_huggingface_filter_options.py — 21 unit tests (no GPU needed):

    • Strict equality check on _HF_INTERNAL_TEMPLATE_VARS contents
    • Allowlist: Jinja AST parsing, exclusion of all 5 HF-internal vars, absence of generate-only names in a realistic template, caching, empty template
    • Robustness: {% generation %} tag (graceful degradation), {% break %} (loopcontrols), non-string template (isinstance guard)
    • _filter_for_chat_template: sentinel renaming, generate-only drop, template-var pass-through, empty input, unknown key drop
    • _filter_chat_template_only_options: regression guard on pre-existing method
    • Integration tests (marked huggingface): Granite tokenizer allowlist includes think + guardian_config, excludes generate-only keys and HF-internal vars

Testing

uv run pytest test/backends/test_huggingface_filter_options.py -m "not huggingface" -v
# 18 passed

The e2e test test_generate_only_options_do_not_break_generation is @pytest.mark.qualitative and requires the real model fixture.

Where this fits

Standalone bugfix. No other phases.

Closes #1154

@github-actions github-actions Bot added the bug Something isn't working label May 27, 2026
@planetf1 planetf1 marked this pull request as ready for review May 27, 2026 14:09
@planetf1 planetf1 requested a review from a team as a code owner May 27, 2026 14:09
@planetf1 planetf1 marked this pull request as draft May 27, 2026 14:20
@planetf1 planetf1 mentioned this pull request May 27, 2026
11 tasks
@planetf1 planetf1 changed the title fix: strip generate-only options before apply_chat_template fix(hf): filter model_options for apply_chat_template via Jinja AST allowlist May 27, 2026
@planetf1 planetf1 marked this pull request as ready for review May 27, 2026 15:35
planetf1 added 7 commits May 28, 2026 23:53
…ve-computing#1154)

`temperature`, `max_new_tokens`, and `do_sample` were splatted directly
into the Jinja template variable namespace via `apply_chat_template`.
HF templates silently ignore unknown kwargs today, but a future template
that references any of those names would be silently shadowed.

Adds `_filter_generate_only_options` (the inverse of the existing
`_filter_chat_template_only_options`) and applies it at both
`apply_chat_template` call sites in `LocalHFBackend`.

Adds 12 unit tests in `test_huggingface_filter_options.py` covering
each key individually and the full filter→backend-specific chain, plus
a qualitative e2e regression guard in `test_huggingface.py`.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…ilter fix

- Reword _filter_generate_only_options docstring: replace "Inverse of"
  with "Companion to" (it is a deny-list, not a mathematical complement);
  add a note that model_options arrive post-_simplify_and_merge so user
  string keys are already normalised to sentinels
- Clarify inline comment on MAX_NEW_TOKENS: the sentinel is renamed
  *downstream* by _make_backend_specific_and_remove, not in this method
- Update e2e test docstring to be honest: the qualitative test guards
  pipeline non-regression; filter correctness is covered by unit tests
  in test_huggingface_filter_options.py
- Expand test fixture docstring to document the __new__ invariant and
  the single attribute that must be set
- Expand module docstring to clarify that torch must be importable even
  though no GPU is needed to run the tests

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The initial fix covered only temperature, max_new_tokens, and do_sample.
Expand the denylist to include all commonly-used HuggingFace GenerationConfig
options that must not reach the Jinja template namespace:

- Sampling: top_k, top_p, typical_p, repetition_penalty, no_repeat_ngram_size,
  length_penalty, num_beams, num_beam_groups, diversity_penalty, penalty_alpha,
  early_stopping
- Length: min_new_tokens
- Sequence count: num_return_sequences
- Special token IDs: pad_token_id, bos_token_id, eos_token_id,
  forced_bos_token_id, forced_eos_token_id

Adds 19 parametrised unit tests (one per new key) and an integration test
that verifies all template-only keys survive unchanged when every expanded
denylist key is present in the same options dict.

Closes generative-computing#1170 (no longer needed as a follow-up).

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
The previous approach maintained a hand-written denylist of GenerationConfig
option names to strip before apply_chat_template. This is fundamentally
incomplete: transformers has ~60+ GenerationConfig parameters and the list
requires manual updates as HuggingFace adds new ones.

Replace it with a self-maintaining solution: parse the tokenizer's Jinja2
template with jinja2.meta.find_undeclared_variables to compute the exact set
of kwargs the template can consume. Only keys in that set are forwarded to
apply_chat_template. Generation-only options are silently dropped not because
they are named in a list, but because no chat template references them.

_HF_INTERNAL_TEMPLATE_VARS (5 entries) excludes variables HuggingFace injects
automatically — named params to render_jinja_template (messages, tools,
add_generation_prompt) and Jinja env globals from _compile_jinja_template
(raise_exception, strftime_now). Sourced by reading the transformers source
directly; see the constant's docstring for the relevant function references.

The allowlist is computed lazily and cached on the instance (cached_property),
so the Jinja parse happens once per backend lifetime.

Tests:
- test_hf_internal_template_vars_contents: strict equality check on the 5-item
  exclusion set — fails immediately if transformers changes what it injects
- allowlist tests: Jinja AST parsing, exclusion of HF internals, absence of
  generate-only option names in a realistic template, caching, empty template
- _filter_for_chat_template tests: sentinel renaming, generate-only drop,
  template-var pass-through, empty input, unknown key drop
- _filter_chat_template_only_options: regression guard (pre-existing method)

Closes generative-computing#1154

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Adds comprehensive unit tests for the new _chat_template_allowlist
cached property and _filter_for_chat_template method introduced in the
preceding commit, plus tokenizer-only integration tests against the real
Granite 3.3 8B template (no GPU required; skipped automatically when the
tokenizer is not in the local HF cache).

Unit tests cover:
- _HF_INTERNAL_TEMPLATE_VARS exact-set assertion (load-bearing upgrade guard)
- allowlist includes template variables / excludes HF-internal / generate-only vars
- empty allowlist for missing chat_template
- cached_property returns same object on repeated access
- _filter_for_chat_template: passes referenced vars, drops generate-only,
  renames Mellea sentinels, handles empty input, drops unknown keys
- _filter_chat_template_only_options regression guard (pre-existing method)

Integration tests (marked huggingface, skip when tokenizer not cached):
- Granite allowlist includes 'think' and 'guardian_config'
- Granite allowlist excludes generate-only options
- Granite allowlist excludes HF-internal vars

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
_FakeTokenizer() is not a PreTrainedTokenizer, so assigning it to
b._tokenizer caused a mypy [assignment] error in CI. Use
object.__setattr__ instead — its value param is typed Any in typeshed
so mypy accepts the assignment without suppression.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…variants

Three robustness fixes to the Jinja AST allowlist:

1. Register jinja2.ext.loopcontrols so templates using {% break %} or
   {% continue %} parse without TemplateSyntaxError.
2. Wrap env.parse() in try/except jinja2.TemplateSyntaxError — templates
   that use unsupported extensions such as {% generation %} / {% endgeneration %}
   (DeepSeek-R1, Qwen3, transformers >= 4.47) return frozenset() instead of
   crashing during inference.
3. Guard against non-string chat_template (None, list-of-alternates, dict) with
   an isinstance check before attempting to parse.

Move the inline 'import jinja2.meta' to module level (already done for jinja2).

Add a comment at the KV-cache apply_chat_template call site explaining why
add_generation_prompt is intentionally absent there (it is HF-internal and
the KV-cache path formats context, not a generation turn).

Three new unit tests cover each robustness case:
- test_allowlist_non_string_template_returns_empty
- test_allowlist_graceful_on_generation_tag
- test_allowlist_break_continue_tags_parsed_correctly

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 force-pushed the worktree-issue-1154 branch from 3346b68 to bc01569 Compare May 28, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apply_chat_template receives generate-only options from _make_backend_specific_and_remove

1 participant