Skip to content

test: add unit tests for granite formatters (#812)#818

Merged
planetf1 merged 8 commits intogenerative-computing:mainfrom
planetf1:test/granite-formatter-unit-tests
Apr 14, 2026
Merged

test: add unit tests for granite formatters (#812)#818
planetf1 merged 8 commits intogenerative-computing:mainfrom
planetf1:test/granite-formatter-unit-tests

Conversation

@planetf1
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 commented Apr 10, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

198 unit tests for granite formatters (3.2, 3.3, intrinsics). No GPU, network, or model downloads — runs in ~35s.

Why these tests exist

mellea/formatters/granite/ has 2347 statements at 52% coverage. However The only existing tests are gated behind e2e + huggingface + require_gpu(min_vram_gb=12) and never run in CI. They also take time and resource to run locally which impacts developers.

This PR adds fast regression detection for the parsing and serialisation logic. The new tests won't pick up new issues necessarily, but will help prevent us from unintentionally breaking the code.

End result is 82% from unit testing only

What's covered

File Tests Covers
test_granite3_shared.py 19 find_substring_in_text, create_dict, parse_hallucinations_text, span helpers
test_granite32_output.py 25 Citation parsing (<co>N</co>), model output splitting, validation, full transform pipeline
test_granite33_output.py 22 JSON-delimited citations/hallucinations, <think>/<response> extraction, controls cleanup (#173)
test_granite32_input.py 33 System message decision tree (tools × docs × thinking × controls), sanitise, transform
test_granite33_input.py 29 Same matrix, available_tools role, per-document document_id role formatting
test_intrinsics_canned_output.py 52 5 full-pipeline regression tests + 47 auto-discovered schema validation of all JSON fixtures
test_openai_compat.py 18 All input JSON fixtures pass OpenAI SDK request validation; rewritten inputs likewise

Note for reviewers

These tests cover the Granite 3.2/3.3 formatters. Granite 4.x currently reuses the 3.x parsers — are there any gaps in parsing behaviour for 4.x that these tests should cover? CC @frreiss

Production bug fix: nltk_check() missing LookupError

Writing these unit tests exposed a production bug in nltk_check() (mellea/formatters/granite/base/optional.py and util.py).

Problem: nltk_check() only caught ImportError (package missing) but not LookupError (package installed, punkt_tab data not downloaded). Since nltk is always present as a transitive dep of rouge_score, the ImportError path was effectively dead code — users got an unhelpful ValueError instead of install instructions.

Fixes:

  • nltk_check() now catches both ImportError and LookupError, matching the original intent
  • nltk>=3.9 declared as explicit core dependency (citation parsing is part of the granite formatter, not optional)
  • punkt_tab download added to quality.yml — same pattern as Ollama model pulls. CC @avinash2692

Note: punkt_tab is NLTK tokenizer data, separate from the pip package — no way to solve via dependency declarations alone. Current approach: clear error message for users + explicit CI setup (matches Haystack and NLTK's own CI). Alternative would be auto-downloading at runtime, as Unstructured does, but that makes unexpected network calls.

Code review cleanups

  • granite_iomellea in error messages: nltk_check() error messages referenced granite_io (copy-paste from upstream). Fixed to reference mellea.
  • Consolidated duplicate nltk_check: was duplicated in both optional.py and util.py. Consolidated into optional.py as single source of truth — consumers in granite32/output.py and granite33/output.py now import from there.
  • Removed unused imports in test_granite32_output.py
  • Fixed assertion logic in test_balanced_tags_no_warnings (orand)

Design decisions

  • Tests import constants rather than hardcoding magic strings — auto-adapt when constants change
  • Factory helpers centralise ChatCompletion construction — one place to update if models change
  • Fixture validation tests (47) and OpenAI compat tests auto-discover JSON files via glob — zero maintenance when files change
  • Canned output regression tests replay saved model outputs through the full transform pipeline to catch parsing regressions without calling a model. Use local YAML configs with float-tolerant comparison

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@planetf1 planetf1 changed the title test: add 199 unit tests for granite formatters (#812) test: add unit tests for granite formatters (#812) Apr 10, 2026
@planetf1 planetf1 requested review from frreiss and jakelorocco April 10, 2026 12:35
@planetf1 planetf1 marked this pull request as ready for review April 10, 2026 12:36
@planetf1 planetf1 requested a review from a team as a code owner April 10, 2026 12:36
@planetf1
Copy link
Copy Markdown
Contributor Author

This is the first PR of a few I have considered creating to try and improve our unit test coverage.

In some cases we get coverage from either examples, or e2e tests - which is still important, but we can potentially run tests more efficiently in more environments, with more specific detection of issues with more unit test. It's important to note of course that tests written after/from implementation code may make similar assumptions - for this reason they may not find more issues, but they do help particularly in spotting regressions where we change some code and break something else.

@planetf1 planetf1 requested a review from a team as a code owner April 10, 2026 13:42
@planetf1 planetf1 requested a review from ajbozarth April 10, 2026 13:42
@planetf1
Copy link
Copy Markdown
Contributor Author

planetf1 commented Apr 10, 2026

CI failure in the previous run was caused by NLTK punkt_tab data not being available — 3 tests that exercise the citation/hallucination parsing pipeline hit LookupError from nltk.sent_tokenize().

Root cause: nltk_check() only caught ImportError, not LookupError. This is a production bug — users with citations enabled would get the same failure.

Fixed in 6d851e6 and 4b917bf:

  • nltk_check() now catches LookupError and shows install instructions (matching original intent)
  • nltk>=3.9 declared as explicit core dependency — it was only present transitively via rouge_score but is directly used by granite citation parsing
  • quality.yml downloads punkt_tab data in CI (same pattern as Ollama model pulls)

CC @avinash2692 for the CI change.

@planetf1 planetf1 requested a review from avinash2692 April 10, 2026 13:43
…#812)

Unit tests for Granite 3.2 and 3.3 input/output processors, shared
utilities, IntrinsicsResultProcessor canned data regression, and
OpenAI SDK compatibility. No GPU, network, or model downloads required.

- test_granite3_shared.py: find_substring_in_text, create_dict,
  parse_hallucinations_text, hallucination/citation span helpers
- test_granite32_output.py: citation parsing, model output splitting,
  validation, transform pipeline
- test_granite33_output.py: JSON-delimited citations/hallucinations,
  think/response extraction, controls cleanup
- test_granite32_input.py: system message matrix, sanitize, transform
- test_granite33_input.py: available_tools role, per-document roles
- test_intrinsics_canned_output.py: canned model outputs through
  IntrinsicsResultProcessor, Pydantic schema validation of fixtures
- test_openai_compat.py: ChatCompletion round-trip through OpenAI SDK

Closes generative-computing#812
nltk_check() only caught ImportError (package not installed) but missed
LookupError (package installed, data not downloaded). Users with nltk
present via transitive deps (rouge_score) got an unhelpful ValueError
instead of install instructions when punkt_tab was missing.

Also adds punkt_tab download to CI workflow — same pattern as Ollama
model pulls.
nltk is required by granite citation/hallucination parsing
(nltk.sent_tokenize) but was only present as a transitive dependency
of rouge_score. Pin >=3.9 for punkt_tab support (security fix over
pickle-based punkt).
@planetf1 planetf1 force-pushed the test/granite-formatter-unit-tests branch from 4b917bf to fbaca46 Compare April 10, 2026 14:12
- Fix granite_io → mellea in nltk_check error message (copy-paste from upstream)
- Consolidate duplicate nltk_check into optional.py (single source of truth)
- Remove unused imports in test_granite32_output.py
- Fix assertion logic in test_balanced_tags_no_warnings (or → and)
@planetf1
Copy link
Copy Markdown
Contributor Author

Code review cleanups applied in 5a0c241:

  • granite_iomellea in nltk_check() error messages — pre-existing copy-paste from upstream granite_io, now references mellea
  • Consolidated duplicate nltk_check() — was identical in both optional.py and util.py. Now lives only in optional.py; granite32/output.py and granite33/output.py import from there
  • Removed 4 unused imports in test_granite32_output.py
  • Fixed assertion logic in test_balanced_tags_no_warningsorand (original logic was always true when no warning present)

All 198 tests pass, all pre-commit hooks clean.

- Fix invalid TCP port 98765 → 127.0.0.1:1 in OpenAI compat test
- Add assertions that testdata directories are non-empty (fail loudly
  instead of silently collecting zero test cases)
- Add unit tests for nltk_check LookupError handling
@jakelorocco
Copy link
Copy Markdown
Contributor

I didn't see this issue / PR until after the work was done; but is there a reason not to just import the tests from the granite common library? https://github.com/ibm-granite/granite-common/tree/main/tests/granite_common

@planetf1
Copy link
Copy Markdown
Contributor Author

I didn't see this issue / PR until after the work was done; but is there a reason not to just import the tests from the granite common library? https://github.com/ibm-granite/granite-common/tree/main/tests/granite_common

Not being aware of them - I'll take a look early next week and see how this relates. thanks

@planetf1 planetf1 marked this pull request as draft April 10, 2026 18:03
@planetf1
Copy link
Copy Markdown
Contributor Author

Moving to draft to review repo link

Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through the code changes and they look good, but have not taken a detailed look at the new test code. I had Claude do a review and it found the following:


A few things worth considering:

nltk as a hard dependency
nltk>=3.9 is now declared as a core dep in pyproject.toml. Previously the nltk_check() context manager implied it was optional. If citation parsing is always available for Granite users this is fine, but worth a conscious decision — it adds ~4MB to every install even for users who never use citations.

Weak caplog assertions
Several tests match on substrings like "different number" in caplog.text.lower() without asserting log level. These would survive message rewording silently. Consider assert any(r.levelname == "WARNING" and "different number" in r.message.lower() for r in caplog.records) or similar.

Missing edge cases for malformed model output
The output parser tests don't cover:

  • Unclosed citation tags (e.g. <co>1 without </co>)
  • Duplicate citation IDs with conflicting doc indices

Low risk since these match actual model output, but worth a note if the formatter is ever extended.

On the granite-common question
These tests cover Mellea's adapter layer (Granite32InputProcessor, VLLMExtraBody/chat_template_kwargs handling, Mellea-specific validation rules) rather than the core Granite format — so they're not redundant with granite-common. That said, it might be worth checking whether any base format behavior tests could be inherited to reduce future divergence risk.


I also wonder if the new dep should be in optional deps instead. Also it seems like we could reuse some of the upstream granite tests even if we still need to extend them

@planetf1
Copy link
Copy Markdown
Contributor Author

@jakelorocco had a quick look - and see where the copy was done (for divergence). Some of the tests can be adapted

Overlap with PR #818

Area Overlap Viability / Action
Granite 3.2/3.3 output parsing — we have 7 test classes each testing private functions; they test via top-level processor Partial — same code paths, different granularity Skip — we already cover this better at the unit level
Granite 3.2/3.3 input processing — we test internal logic; they validate against HF tokenizer golden truth Partial — complementary approaches Cherry-pick tokenizer validation tests — need import rewrites, @pytest.mark.slow, skip-if-no-model, module-scoped HF fixtures
Shared output functions (find_substring_in_text, create_dict, etc.) None — ours are unique, not tested upstream N/A
IntrinsicsRewriter canned output — same golden-file pattern, similar fixtures High — test code is redundant Copy missing testdata (answer_relevance_*, gpt_oss_* golden files) and add scenarios to our existing fixtures. Skip porting test code.
OpenAI SDK compat — same fake-client technique, expect APIConnectionError High — fully redundant Skip — already covered
Base types (ToolCall, messages, ChatCompletion, GraniteChatCompletion validation) None — we don't have these Good candidate — pure Pydantic unit tests, only needs granite_commonmellea.formatters.granite import rewrites. Fills a real coverage gap.
Retrievers (Elasticsearch, InMemoryRetriever, embeddings) None — out of PR scope Separate PR — needs import rewrites, mocked ES, model download for embeddings (mark slow)
E2E with real models (test_run_model, test_run_transformers) None — needs GPU/model downloads Not viable as unit tests — could become slow + qualitative integration tests in a future PR
Meta: orphan file check — ensures every testdata file is referenced by a test None — we don't have this Easy win — trivial meta-test, prevents silent test rot, no external deps

There's good stuff there - but some of the content here is valuable too - common goal is to ensure we have unit test coverage.

I'm happy to work through some of that porting/consolidation if it's useful? The question becomes whether we want to merge this first, then adapt, or if you'd prefer to do it in one PR?

@ajbozarth
Copy link
Copy Markdown
Contributor

Also ran uv run pytest locally and results look ok, only two unrelated failures in examples that I'm looking into separately:

2 failed, 1289 passed, 59 skipped, 22 deselected, 2 xfailed, 2 xpassed, 152 warnings in 1955.14s (0:32:35)

@planetf1
Copy link
Copy Markdown
Contributor Author

@ajbozarth

The weak caplog assertions are in 5a0c241

On ntlk:

  • The nltk dependency was already there. The difference it was previously implicit - ie pulled in as a transitive dependency via rouge_score. If you look on uv.lock it's been there all along. So there is no change to the install foot print
  • Making it explicit was to add clarity (in fact I think your comment demonstrates that) -- we have code that directly uses nltk, so best practice is to depend on it directly, not hide behind transitivity.
  • This came up because for nltk to actually work adds another requirement -- an explicit download stop. In fact the code previously was failing silently as it was not correctly capturing the exception, leading to the code then not working.
  • The changes therefore in this PR were to ensure this is all clear and detectable -- since when I tried to run the test I fell into the same hidden trap...

On testing more generally:
@ajbozarth I note the points about missing edge cases - @jakelorocco my plan is to refine this testing based on the comments above. I'd propose to do this in an additional PR and get this batch in first. Is that ok with you (and Alex - I'd add extra edge cases in here too).

@planetf1 planetf1 marked this pull request as ready for review April 13, 2026 08:47
@planetf1
Copy link
Copy Markdown
Contributor Author

Moving back to ready. After evaluation I'd propose to do followups in a followup PR.

@planetf1
Copy link
Copy Markdown
Contributor Author

#827 opened with proposed followup

@planetf1 planetf1 requested review from ajbozarth and psschwei April 13, 2026 08:50
Copy link
Copy Markdown
Member

@psschwei psschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
One minor comment (and one out of scope nice-to-have)
Since others are also reviewing, will let one of them give the approval

- Replace stale granite_io reference with mellea in optional.py docstring
- Use record-level caplog assertions in granite32 output tests for
  consistency with granite33 tests
- Fix stale granite_io reference in util.py import_optional docstring
- Tighten test_missing_colon_skipped assertion to verify empty result
Copy link
Copy Markdown
Contributor

@ajbozarth ajbozarth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a note on #827 on the follow up work.

Edit: I also reran uv run pytest test/formatters/ locally and all passed:

290 passed, 1 xfailed, 2 warnings in 463.93s (0:07:43)

@planetf1 planetf1 added this pull request to the merge queue Apr 14, 2026
Merged via the queue into generative-computing:main with commit 2dcda8b Apr 14, 2026
10 checks passed
@planetf1 planetf1 deleted the test/granite-formatter-unit-tests branch April 14, 2026 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: add unit tests for granite formatters (3.2, 3.3, intrinsics)

4 participants