test: add unit tests for granite formatters (#812) by planetf1 · Pull Request #818 · generative-computing/mellea

planetf1 · 2026-04-10T12:08:00Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue: Fixes test: add unit tests for granite formatters (3.2, 3.3, intrinsics) #812

198 unit tests for granite formatters (3.2, 3.3, intrinsics). No GPU, network, or model downloads — runs in ~35s.

Why these tests exist

mellea/formatters/granite/ has 2347 statements at 52% coverage. However The only existing tests are gated behind e2e + huggingface + require_gpu(min_vram_gb=12) and never run in CI. They also take time and resource to run locally which impacts developers.

This PR adds fast regression detection for the parsing and serialisation logic. The new tests won't pick up new issues necessarily, but will help prevent us from unintentionally breaking the code.

End result is 82% from unit testing only

What's covered

File	Tests	Covers
`test_granite3_shared.py`	19	`find_substring_in_text`, `create_dict`, `parse_hallucinations_text`, span helpers
`test_granite32_output.py`	25	Citation parsing (`<co>N</co>`), model output splitting, validation, full transform pipeline
`test_granite33_output.py`	22	JSON-delimited citations/hallucinations, `<think>`/`<response>` extraction, controls cleanup (#173)
`test_granite32_input.py`	33	System message decision tree (tools × docs × thinking × controls), sanitise, transform
`test_granite33_input.py`	29	Same matrix, `available_tools` role, per-document `document_id` role formatting
`test_intrinsics_canned_output.py`	52	5 full-pipeline regression tests + 47 auto-discovered schema validation of all JSON fixtures
`test_openai_compat.py`	18	All input JSON fixtures pass OpenAI SDK request validation; rewritten inputs likewise

Note for reviewers

These tests cover the Granite 3.2/3.3 formatters. Granite 4.x currently reuses the 3.x parsers — are there any gaps in parsing behaviour for 4.x that these tests should cover? CC @frreiss

Production bug fix: `nltk_check()` missing `LookupError`

Writing these unit tests exposed a production bug in nltk_check() (mellea/formatters/granite/base/optional.py and util.py).

Problem: nltk_check() only caught ImportError (package missing) but not LookupError (package installed, punkt_tab data not downloaded). Since nltk is always present as a transitive dep of rouge_score, the ImportError path was effectively dead code — users got an unhelpful ValueError instead of install instructions.

Fixes:

nltk_check() now catches both ImportError and LookupError, matching the original intent
nltk>=3.9 declared as explicit core dependency (citation parsing is part of the granite formatter, not optional)
punkt_tab download added to quality.yml — same pattern as Ollama model pulls. CC @avinash2692

Note: punkt_tab is NLTK tokenizer data, separate from the pip package — no way to solve via dependency declarations alone. Current approach: clear error message for users + explicit CI setup (matches Haystack and NLTK's own CI). Alternative would be auto-downloading at runtime, as Unstructured does, but that makes unexpected network calls.

Code review cleanups

granite_io → mellea in error messages: nltk_check() error messages referenced granite_io (copy-paste from upstream). Fixed to reference mellea.
Consolidated duplicate nltk_check: was duplicated in both optional.py and util.py. Consolidated into optional.py as single source of truth — consumers in granite32/output.py and granite33/output.py now import from there.
Removed unused imports in test_granite32_output.py
Fixed assertion logic in test_balanced_tags_no_warnings (or → and)

Design decisions

Tests import constants rather than hardcoding magic strings — auto-adapt when constants change
Factory helpers centralise ChatCompletion construction — one place to update if models change
Fixture validation tests (47) and OpenAI compat tests auto-discover JSON files via glob — zero maintenance when files change
Canned output regression tests replay saved model outputs through the full transform pipeline to catch parsing regressions without calling a model. Use local YAML configs with float-tolerant comparison

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

github-actions · 2026-04-10T12:08:29Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

planetf1 · 2026-04-10T13:25:05Z

This is the first PR of a few I have considered creating to try and improve our unit test coverage.

In some cases we get coverage from either examples, or e2e tests - which is still important, but we can potentially run tests more efficiently in more environments, with more specific detection of issues with more unit test. It's important to note of course that tests written after/from implementation code may make similar assumptions - for this reason they may not find more issues, but they do help particularly in spotting regressions where we change some code and break something else.

planetf1 · 2026-04-10T13:43:24Z

CI failure in the previous run was caused by NLTK punkt_tab data not being available — 3 tests that exercise the citation/hallucination parsing pipeline hit LookupError from nltk.sent_tokenize().

Root cause: nltk_check() only caught ImportError, not LookupError. This is a production bug — users with citations enabled would get the same failure.

Fixed in 6d851e6 and 4b917bf:

nltk_check() now catches LookupError and shows install instructions (matching original intent)
nltk>=3.9 declared as explicit core dependency — it was only present transitively via rouge_score but is directly used by granite citation parsing
quality.yml downloads punkt_tab data in CI (same pattern as Ollama model pulls)

CC @avinash2692 for the CI change.

…#812) Unit tests for Granite 3.2 and 3.3 input/output processors, shared utilities, IntrinsicsResultProcessor canned data regression, and OpenAI SDK compatibility. No GPU, network, or model downloads required. - test_granite3_shared.py: find_substring_in_text, create_dict, parse_hallucinations_text, hallucination/citation span helpers - test_granite32_output.py: citation parsing, model output splitting, validation, transform pipeline - test_granite33_output.py: JSON-delimited citations/hallucinations, think/response extraction, controls cleanup - test_granite32_input.py: system message matrix, sanitize, transform - test_granite33_input.py: available_tools role, per-document roles - test_intrinsics_canned_output.py: canned model outputs through IntrinsicsResultProcessor, Pydantic schema validation of fixtures - test_openai_compat.py: ChatCompletion round-trip through OpenAI SDK Closes generative-computing#812

nltk_check() only caught ImportError (package not installed) but missed LookupError (package installed, data not downloaded). Users with nltk present via transitive deps (rouge_score) got an unhelpful ValueError instead of install instructions when punkt_tab was missing. Also adds punkt_tab download to CI workflow — same pattern as Ollama model pulls.

nltk is required by granite citation/hallucination parsing (nltk.sent_tokenize) but was only present as a transitive dependency of rouge_score. Pin >=3.9 for punkt_tab support (security fix over pickle-based punkt).

- Fix granite_io → mellea in nltk_check error message (copy-paste from upstream) - Consolidate duplicate nltk_check into optional.py (single source of truth) - Remove unused imports in test_granite32_output.py - Fix assertion logic in test_balanced_tags_no_warnings (or → and)

planetf1 · 2026-04-10T14:42:40Z

Code review cleanups applied in 5a0c241:

granite_io → mellea in nltk_check() error messages — pre-existing copy-paste from upstream granite_io, now references mellea
Consolidated duplicate nltk_check() — was identical in both optional.py and util.py. Now lives only in optional.py; granite32/output.py and granite33/output.py import from there
Removed 4 unused imports in test_granite32_output.py
Fixed assertion logic in test_balanced_tags_no_warnings — or → and (original logic was always true when no warning present)

All 198 tests pass, all pre-commit hooks clean.

- Fix invalid TCP port 98765 → 127.0.0.1:1 in OpenAI compat test - Add assertions that testdata directories are non-empty (fail loudly instead of silently collecting zero test cases) - Add unit tests for nltk_check LookupError handling

jakelorocco · 2026-04-10T17:33:27Z

I didn't see this issue / PR until after the work was done; but is there a reason not to just import the tests from the granite common library? https://github.com/ibm-granite/granite-common/tree/main/tests/granite_common

planetf1 · 2026-04-10T18:02:55Z

I didn't see this issue / PR until after the work was done; but is there a reason not to just import the tests from the granite common library? https://github.com/ibm-granite/granite-common/tree/main/tests/granite_common

Not being aware of them - I'll take a look early next week and see how this relates. thanks

planetf1 · 2026-04-10T18:03:37Z

Moving to draft to review repo link

ajbozarth

I read through the code changes and they look good, but have not taken a detailed look at the new test code. I had Claude do a review and it found the following:

A few things worth considering:

nltk as a hard dependency
nltk>=3.9 is now declared as a core dep in pyproject.toml. Previously the nltk_check() context manager implied it was optional. If citation parsing is always available for Granite users this is fine, but worth a conscious decision — it adds ~4MB to every install even for users who never use citations.

Weak caplog assertions
Several tests match on substrings like "different number" in caplog.text.lower() without asserting log level. These would survive message rewording silently. Consider assert any(r.levelname == "WARNING" and "different number" in r.message.lower() for r in caplog.records) or similar.

Missing edge cases for malformed model output
The output parser tests don't cover:

Unclosed citation tags (e.g. <co>1 without </co>)
Duplicate citation IDs with conflicting doc indices

Low risk since these match actual model output, but worth a note if the formatter is ever extended.

On the granite-common question
These tests cover Mellea's adapter layer (Granite32InputProcessor, VLLMExtraBody/chat_template_kwargs handling, Mellea-specific validation rules) rather than the core Granite format — so they're not redundant with granite-common. That said, it might be worth checking whether any base format behavior tests could be inherited to reduce future divergence risk.

I also wonder if the new dep should be in optional deps instead. Also it seems like we could reuse some of the upstream granite tests even if we still need to extend them

planetf1 · 2026-04-10T18:17:10Z

@jakelorocco had a quick look - and see where the copy was done (for divergence). Some of the tests can be adapted

Overlap with PR #818

Area	Overlap	Viability / Action
Granite 3.2/3.3 output parsing — we have 7 test classes each testing private functions; they test via top-level processor	Partial — same code paths, different granularity	Skip — we already cover this better at the unit level
Granite 3.2/3.3 input processing — we test internal logic; they validate against HF tokenizer golden truth	Partial — complementary approaches	Cherry-pick tokenizer validation tests — need import rewrites, `@pytest.mark.slow`, skip-if-no-model, module-scoped HF fixtures
Shared output functions (`find_substring_in_text`, `create_dict`, etc.)	None — ours are unique, not tested upstream	N/A
IntrinsicsRewriter canned output — same golden-file pattern, similar fixtures	High — test code is redundant	Copy missing testdata (`answer_relevance_`, `gpt_oss_` golden files) and add scenarios to our existing fixtures. Skip porting test code.
OpenAI SDK compat — same fake-client technique, expect `APIConnectionError`	High — fully redundant	Skip — already covered
Base types (`ToolCall`, messages, `ChatCompletion`, `GraniteChatCompletion` validation)	None — we don't have these	Good candidate — pure Pydantic unit tests, only needs `granite_common` → `mellea.formatters.granite` import rewrites. Fills a real coverage gap.
Retrievers (Elasticsearch, InMemoryRetriever, embeddings)	None — out of PR scope	Separate PR — needs import rewrites, mocked ES, model download for embeddings (mark `slow`)
E2E with real models (`test_run_model`, `test_run_transformers`)	None — needs GPU/model downloads	Not viable as unit tests — could become `slow` + `qualitative` integration tests in a future PR
Meta: orphan file check — ensures every testdata file is referenced by a test	None — we don't have this	Easy win — trivial meta-test, prevents silent test rot, no external deps

There's good stuff there - but some of the content here is valuable too - common goal is to ensure we have unit test coverage.

I'm happy to work through some of that porting/consolidation if it's useful? The question becomes whether we want to merge this first, then adapt, or if you'd prefer to do it in one PR?

ajbozarth · 2026-04-10T18:24:20Z

Also ran uv run pytest locally and results look ok, only two unrelated failures in examples that I'm looking into separately:

2 failed, 1289 passed, 59 skipped, 22 deselected, 2 xfailed, 2 xpassed, 152 warnings in 1955.14s (0:32:35)

planetf1 · 2026-04-13T08:46:59Z

@ajbozarth

The weak caplog assertions are in 5a0c241

On ntlk:

The nltk dependency was already there. The difference it was previously implicit - ie pulled in as a transitive dependency via rouge_score. If you look on uv.lock it's been there all along. So there is no change to the install foot print
Making it explicit was to add clarity (in fact I think your comment demonstrates that) -- we have code that directly uses nltk, so best practice is to depend on it directly, not hide behind transitivity.
This came up because for nltk to actually work adds another requirement -- an explicit download stop. In fact the code previously was failing silently as it was not correctly capturing the exception, leading to the code then not working.
The changes therefore in this PR were to ensure this is all clear and detectable -- since when I tried to run the test I fell into the same hidden trap...

On testing more generally:
@ajbozarth I note the points about missing edge cases - @jakelorocco my plan is to refine this testing based on the comments above. I'd propose to do this in an additional PR and get this batch in first. Is that ok with you (and Alex - I'd add extra edge cases in here too).

planetf1 · 2026-04-13T08:47:29Z

Moving back to ready. After evaluation I'd propose to do followups in a followup PR.

planetf1 · 2026-04-13T08:49:23Z

#827 opened with proposed followup

psschwei

LGTM
One minor comment (and one out of scope nice-to-have)
Since others are also reviewing, will let one of them give the approval

mellea/formatters/granite/base/optional.py

test/formatters/granite/test_granite32_output.py

- Replace stale granite_io reference with mellea in optional.py docstring - Use record-level caplog assertions in granite32 output tests for consistency with granite33 tests

- Fix stale granite_io reference in util.py import_optional docstring - Tighten test_missing_colon_skipped assertion to verify empty result

ajbozarth

LGTM, left a note on #827 on the follow up work.

Edit: I also reran uv run pytest test/formatters/ locally and all passed:

290 passed, 1 xfailed, 2 warnings in 463.93s (0:07:43)

github-actions bot added the testing label Apr 10, 2026

planetf1 changed the title ~~test: add 199 unit tests for granite formatters (#812)~~ test: add unit tests for granite formatters (#812) Apr 10, 2026

planetf1 requested review from frreiss and jakelorocco April 10, 2026 12:35

planetf1 marked this pull request as ready for review April 10, 2026 12:36

planetf1 requested a review from a team as a code owner April 10, 2026 12:36

planetf1 requested a review from a team as a code owner April 10, 2026 13:42

planetf1 requested a review from ajbozarth April 10, 2026 13:42

planetf1 requested a review from avinash2692 April 10, 2026 13:43

planetf1 added 4 commits April 10, 2026 15:10

test: remove test_empty_substring (tests stdlib not our code)

c14be80

fix: declare nltk as explicit dependency

fbaca46

nltk is required by granite citation/hallucination parsing (nltk.sent_tokenize) but was only present as a transitive dependency of rouge_score. Pin >=3.9 for punkt_tab support (security fix over pickle-based punkt).

planetf1 force-pushed the test/granite-formatter-unit-tests branch from 4b917bf to fbaca46 Compare April 10, 2026 14:12

planetf1 mentioned this pull request Apr 10, 2026

test: add unit tests for stdlib/requirements (#814) #820

Merged

8 tasks

test: harden granite test robustness

498bd5f

- Fix invalid TCP port 98765 → 127.0.0.1:1 in OpenAI compat test - Add assertions that testdata directories are non-empty (fail loudly instead of silently collecting zero test cases) - Add unit tests for nltk_check LookupError handling

planetf1 marked this pull request as draft April 10, 2026 18:03

ajbozarth reviewed Apr 10, 2026

View reviewed changes

ajbozarth mentioned this pull request Apr 10, 2026

test: model consolidation and flexibility #732

Open

7 tasks

planetf1 marked this pull request as ready for review April 13, 2026 08:47

planetf1 mentioned this pull request Apr 13, 2026

test: expand granite formatter unit tests — edge cases & granite-common alignment #827

Open

planetf1 requested review from ajbozarth and psschwei April 13, 2026 08:50

psschwei reviewed Apr 13, 2026

View reviewed changes

mellea/formatters/granite/base/optional.py Outdated Show resolved Hide resolved

test/formatters/granite/test_granite32_output.py Outdated Show resolved Hide resolved

psschwei mentioned this pull request Apr 13, 2026

Bug: response_end span in granite33 uses full response length instead of sentence length #843

Open

planetf1 added 2 commits April 13, 2026 16:20

fix: address PR generative-computing#818 review comments

0ba3bc1

- Replace stale granite_io reference with mellea in optional.py docstring - Use record-level caplog assertions in granite32 output tests for consistency with granite33 tests

fix: address code review findings

8953101

- Fix stale granite_io reference in util.py import_optional docstring - Tighten test_missing_colon_skipped assertion to verify empty result

This was referenced Apr 13, 2026

refactor: consolidate duplicate import_optional from util.py into optional.py #844

Open

fix: granite33 response_end span uses sentence length not full respon… #845

Draft

jakelorocco approved these changes Apr 13, 2026

View reviewed changes

ajbozarth approved these changes Apr 13, 2026

View reviewed changes

planetf1 added this pull request to the merge queue Apr 14, 2026

Merged via the queue into generative-computing:main with commit 2dcda8b Apr 14, 2026
10 checks passed

planetf1 deleted the test/granite-formatter-unit-tests branch April 14, 2026 08:48

Conversation

planetf1 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Why these tests exist

What's covered

Note for reviewers

Production bug fix: nltk_check() missing LookupError

Code review cleanups

Design decisions

Testing

Uh oh!

github-actions bot commented Apr 10, 2026

Uh oh!

planetf1 commented Apr 10, 2026

Uh oh!

planetf1 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

planetf1 commented Apr 10, 2026

Uh oh!

jakelorocco commented Apr 10, 2026

Uh oh!

planetf1 commented Apr 10, 2026

Uh oh!

planetf1 commented Apr 10, 2026

Uh oh!

ajbozarth left a comment

Choose a reason for hiding this comment

Uh oh!

planetf1 commented Apr 10, 2026

Overlap with PR #818

Uh oh!

ajbozarth commented Apr 10, 2026

Uh oh!

planetf1 commented Apr 13, 2026

Uh oh!

planetf1 commented Apr 13, 2026

Uh oh!

planetf1 commented Apr 13, 2026

Uh oh!

psschwei left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajbozarth left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

planetf1 commented Apr 10, 2026 •

edited

Loading

Production bug fix: `nltk_check()` missing `LookupError`

planetf1 commented Apr 10, 2026 •

edited

Loading

ajbozarth left a comment •

edited

Loading