test: add unit tests for granite formatters (#812)#818
Conversation
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
|
This is the first PR of a few I have considered creating to try and improve our unit test coverage. In some cases we get coverage from either examples, or e2e tests - which is still important, but we can potentially run tests more efficiently in more environments, with more specific detection of issues with more unit test. It's important to note of course that tests written after/from implementation code may make similar assumptions - for this reason they may not find more issues, but they do help particularly in spotting regressions where we change some code and break something else. |
|
CI failure in the previous run was caused by NLTK Root cause:
CC @avinash2692 for the CI change. |
…#812) Unit tests for Granite 3.2 and 3.3 input/output processors, shared utilities, IntrinsicsResultProcessor canned data regression, and OpenAI SDK compatibility. No GPU, network, or model downloads required. - test_granite3_shared.py: find_substring_in_text, create_dict, parse_hallucinations_text, hallucination/citation span helpers - test_granite32_output.py: citation parsing, model output splitting, validation, transform pipeline - test_granite33_output.py: JSON-delimited citations/hallucinations, think/response extraction, controls cleanup - test_granite32_input.py: system message matrix, sanitize, transform - test_granite33_input.py: available_tools role, per-document roles - test_intrinsics_canned_output.py: canned model outputs through IntrinsicsResultProcessor, Pydantic schema validation of fixtures - test_openai_compat.py: ChatCompletion round-trip through OpenAI SDK Closes generative-computing#812
nltk_check() only caught ImportError (package not installed) but missed LookupError (package installed, data not downloaded). Users with nltk present via transitive deps (rouge_score) got an unhelpful ValueError instead of install instructions when punkt_tab was missing. Also adds punkt_tab download to CI workflow — same pattern as Ollama model pulls.
nltk is required by granite citation/hallucination parsing (nltk.sent_tokenize) but was only present as a transitive dependency of rouge_score. Pin >=3.9 for punkt_tab support (security fix over pickle-based punkt).
4b917bf to
fbaca46
Compare
- Fix granite_io → mellea in nltk_check error message (copy-paste from upstream) - Consolidate duplicate nltk_check into optional.py (single source of truth) - Remove unused imports in test_granite32_output.py - Fix assertion logic in test_balanced_tags_no_warnings (or → and)
|
Code review cleanups applied in 5a0c241:
All 198 tests pass, all pre-commit hooks clean. |
- Fix invalid TCP port 98765 → 127.0.0.1:1 in OpenAI compat test - Add assertions that testdata directories are non-empty (fail loudly instead of silently collecting zero test cases) - Add unit tests for nltk_check LookupError handling
|
I didn't see this issue / PR until after the work was done; but is there a reason not to just import the tests from the granite common library? https://github.com/ibm-granite/granite-common/tree/main/tests/granite_common |
Not being aware of them - I'll take a look early next week and see how this relates. thanks |
|
Moving to draft to review repo link |
ajbozarth
left a comment
There was a problem hiding this comment.
I read through the code changes and they look good, but have not taken a detailed look at the new test code. I had Claude do a review and it found the following:
A few things worth considering:
nltk as a hard dependency
nltk>=3.9 is now declared as a core dep in pyproject.toml. Previously the nltk_check() context manager implied it was optional. If citation parsing is always available for Granite users this is fine, but worth a conscious decision — it adds ~4MB to every install even for users who never use citations.
Weak caplog assertions
Several tests match on substrings like "different number" in caplog.text.lower() without asserting log level. These would survive message rewording silently. Consider assert any(r.levelname == "WARNING" and "different number" in r.message.lower() for r in caplog.records) or similar.
Missing edge cases for malformed model output
The output parser tests don't cover:
- Unclosed citation tags (e.g.
<co>1without</co>) - Duplicate citation IDs with conflicting doc indices
Low risk since these match actual model output, but worth a note if the formatter is ever extended.
On the granite-common question
These tests cover Mellea's adapter layer (Granite32InputProcessor, VLLMExtraBody/chat_template_kwargs handling, Mellea-specific validation rules) rather than the core Granite format — so they're not redundant with granite-common. That said, it might be worth checking whether any base format behavior tests could be inherited to reduce future divergence risk.
I also wonder if the new dep should be in optional deps instead. Also it seems like we could reuse some of the upstream granite tests even if we still need to extend them
|
@jakelorocco had a quick look - and see where the copy was done (for divergence). Some of the tests can be adapted Overlap with PR #818
There's good stuff there - but some of the content here is valuable too - common goal is to ensure we have unit test coverage. I'm happy to work through some of that porting/consolidation if it's useful? The question becomes whether we want to merge this first, then adapt, or if you'd prefer to do it in one PR? |
|
Also ran 2 failed, 1289 passed, 59 skipped, 22 deselected, 2 xfailed, 2 xpassed, 152 warnings in 1955.14s (0:32:35) |
|
The weak caplog assertions are in 5a0c241 On ntlk:
On testing more generally: |
|
Moving back to ready. After evaluation I'd propose to do followups in a followup PR. |
|
#827 opened with proposed followup |
psschwei
left a comment
There was a problem hiding this comment.
LGTM
One minor comment (and one out of scope nice-to-have)
Since others are also reviewing, will let one of them give the approval
- Replace stale granite_io reference with mellea in optional.py docstring - Use record-level caplog assertions in granite32 output tests for consistency with granite33 tests
- Fix stale granite_io reference in util.py import_optional docstring - Tighten test_missing_colon_skipped assertion to verify empty result
2dcda8b
Misc PR
Type of PR
Description
198 unit tests for granite formatters (3.2, 3.3, intrinsics). No GPU, network, or model downloads — runs in ~35s.
Why these tests exist
mellea/formatters/granite/has 2347 statements at 52% coverage. However The only existing tests are gated behinde2e+huggingface+require_gpu(min_vram_gb=12)and never run in CI. They also take time and resource to run locally which impacts developers.This PR adds fast regression detection for the parsing and serialisation logic. The new tests won't pick up new issues necessarily, but will help prevent us from unintentionally breaking the code.
End result is 82% from unit testing only
What's covered
test_granite3_shared.pyfind_substring_in_text,create_dict,parse_hallucinations_text, span helperstest_granite32_output.py<co>N</co>), model output splitting, validation, full transform pipelinetest_granite33_output.py<think>/<response>extraction, controls cleanup (#173)test_granite32_input.pytest_granite33_input.pyavailable_toolsrole, per-documentdocument_idrole formattingtest_intrinsics_canned_output.pytest_openai_compat.pyNote for reviewers
These tests cover the Granite 3.2/3.3 formatters. Granite 4.x currently reuses the 3.x parsers — are there any gaps in parsing behaviour for 4.x that these tests should cover? CC @frreiss
Production bug fix:
nltk_check()missingLookupErrorWriting these unit tests exposed a production bug in
nltk_check()(mellea/formatters/granite/base/optional.pyandutil.py).Problem:
nltk_check()only caughtImportError(package missing) but notLookupError(package installed,punkt_tabdata not downloaded). Sincenltkis always present as a transitive dep ofrouge_score, theImportErrorpath was effectively dead code — users got an unhelpfulValueErrorinstead of install instructions.Fixes:
nltk_check()now catches bothImportErrorandLookupError, matching the original intentnltk>=3.9declared as explicit core dependency (citation parsing is part of the granite formatter, not optional)punkt_tabdownload added toquality.yml— same pattern as Ollama model pulls. CC @avinash2692Note:
punkt_tabis NLTK tokenizer data, separate from the pip package — no way to solve via dependency declarations alone. Current approach: clear error message for users + explicit CI setup (matches Haystack and NLTK's own CI). Alternative would be auto-downloading at runtime, as Unstructured does, but that makes unexpected network calls.Code review cleanups
granite_io→melleain error messages:nltk_check()error messages referencedgranite_io(copy-paste from upstream). Fixed to referencemellea.nltk_check: was duplicated in bothoptional.pyandutil.py. Consolidated intooptional.pyas single source of truth — consumers ingranite32/output.pyandgranite33/output.pynow import from there.test_granite32_output.pytest_balanced_tags_no_warnings(or→and)Design decisions
ChatCompletionconstruction — one place to update if models changeTesting