feat: (m decomp) M Decompose Readme and Docstring Updates#767
Merged
csbobby merged 19 commits intogenerative-computing:mainfrom Apr 1, 2026
Merged
feat: (m decomp) M Decompose Readme and Docstring Updates#767csbobby merged 19 commits intogenerative-computing:mainfrom
csbobby merged 19 commits intogenerative-computing:mainfrom
Conversation
Contributor
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
jakelorocco
approved these changes
Mar 31, 2026
Contributor
jakelorocco
left a comment
There was a problem hiding this comment.
doc updates look good to me; you have a few linting/formatting errors preventing a merge
added 2 commits
March 31, 2026 09:48
Contributor
Author
Thank you. Done. |
…ve-computing#727, generative-computing#728) (generative-computing#742) * test: add granularity marker taxonomy infrastructure (generative-computing#727) Register unit/integration/e2e markers in conftest and pyproject.toml. Add unit auto-apply hook in pytest_collection_modifyitems. Deprecate llm marker (synonym for e2e). Remove dead plugins marker. Rewrite MARKERS_GUIDE.md as authoritative marker reference. Sync AGENTS.md Section 3 with new taxonomy. * test: add audit-markers skill for test classification (generative-computing#728) Skill classifies tests as unit/integration/e2e/qualitative using general heuristics (Part 1) and project-specific rules (Part 2). Includes fixture chain tracing guidance, backend detection heuristics, and example file handling. References MARKERS_GUIDE.md for tables. * chore: add CLAUDE.md and agent skills infrastructure Add CLAUDE.md referencing AGENTS.md for project directives. Add skill-author meta-skill for cross-compatible skill creation. The audit-markers skill was added in the previous commit. * test: improve audit-markers skill quality and add resource predicates Resolve 8 quality issues from dry-run review of the audit-markers skill: - Add behavioural signal detection tables and Step 0 triage procedure for scaling to full-repo audits (grep for backend behaviour, not just existing markers) - Clarify unit/integration boundary with scope-of-mocks rule - Allow module-level qualitative when every function qualifies - Replace resource marker inference with predicate factory pattern - Make llm→e2e rule explicit for # pytest: comments in examples - Redesign report format: 3-tier output (summary table, issues-only detail, batch groups) instead of per-function listing - Remove stale infrastructure note (conftest hook already exists) Add test/predicates.py with reusable skipif decorators: require_gpu, require_ram, require_gpu_isolation, require_api_key, require_package, require_ollama, require_python. Update skill-author with dry-run review step and 4 new authoring guidelines (variable scope, category boundaries, temporal assertions, qualifying absolutes). Refs: generative-computing#727, generative-computing#728 * chore: remove issue references from audit-markers skill Epic/issue numbers are task context, not permanent skill knowledge. * docs: align MARKERS_GUIDE.md with predicate factory pattern MARKERS_GUIDE.md documented legacy resource markers (requires_gpu, etc.) as the active convention while SKILL.md instructed migration to predicates — a direct conflict that would cause the audit agent to stall or produce incorrect edits. - Replace resource markers section with predicate-first documentation - Move legacy markers to deprecated subsection (conftest still handles them) - Update common patterns example to use predicate imports - Add test/predicates.py to related files - Add explicit dry-run enforcement to SKILL.md Step 4 Refs: generative-computing#727, generative-computing#728 * fix: validate_skill.py schema mismatch and brittle YAML parsing Two bugs: - Required `version` at root level but skill-author guide nests it under `metadata` — guaranteed failure on valid skills - Naive `content.split('---')` breaks on markdown horizontal rules Fix: use yaml.safe_load_all for robust frontmatter extraction, check `name`/`description` at root and `version` under `metadata.version`. * fix: migrate deprecated llm markers to e2e, add backend registry, update audit-markers skill - Replace all `pytest.mark.llm` with `pytest.mark.e2e` across 34 test files and 87 example files (comment-based markers) - Add `BACKEND_MARKERS` data-driven registry in test/conftest.py as single source of truth for backend marker registration - Register `bedrock` backend marker in conftest.py, pyproject.toml, MARKERS_GUIDE.md, and add missing marker to test_bedrock.py - Reclassify test_alora_train.py as integration (was unit); add importorskip for peft dependency - Add missing `e2e` tier markers to test_tracing.py and test_tracing_backend.py - Update audit-markers skill: report-first default, predicate migration as fix (not recommendation), backend registry gap detection * feat: add estimate-vram skill and fix MPS VRAM detection - New /estimate-vram agent skill that analyses test files to determine correct require_gpu(min_vram_gb=N) and require_ram(min_gb=N) values by tracing model IDs and looking up parameter counts dynamically - Fix _gpu_vram_gb() in test/predicates.py to use torch.mps.recommended_max_memory() on macOS MPS instead of returning 0 - Fix get_system_capabilities() in test/conftest.py with same MPS path - Update test/README.md with predicates table and legacy marker deprecation - Add /estimate-vram cross-reference in audit-markers skill * refactor: fold estimate-vram into audit-markers skill VRAM estimation is only useful during marker audits, not standalone. Move the model-tracing and VRAM computation procedure into the audit-markers resource gating section and delete the separate skill. * docs: drop isolation refs and fix RAM guidance in markers docs requires_heavy_ram and requires_gpu_isolation are deprecated with no replacement — models load into VRAM not system RAM, and GPU isolation is now automatic. require_ram() stays available for genuinely RAM-bound tests but has no current use case. * docs: add legacy marker guidance for example files in audit-markers skill * refactor: remove require_ollama() predicate — redundant with backend marker The ollama backend marker + conftest auto-skip already handles Ollama availability. No other backend has a dedicated predicate — consistent to let the marker system handle it. * refactor: replace requires_heavy_ram gate with huggingface backend marker in examples conftest The legacy requires_heavy_ram marker (blanket 48 GB RAM threshold) conflated VRAM with system RAM. Replace both the collection-time and runtime skip logic to gate on the huggingface backend marker instead, which accurately checks GPU availability. * refactor: replace ad-hoc bedrock skipif with require_api_key predicate * refactor: migrate legacy resource markers to predicates Replace deprecated pytest markers with typed predicate functions from test/predicates.py across all test files and example files: - requires_gpu → require_gpu(min_vram_gb=N) with per-model VRAM estimates - requires_heavy_ram → removed (conflated VRAM with RAM; no replacement needed) - requires_gpu_isolation → removed (GPU isolation is now automatic) - requires_api_key → require_api_key("VAR1", "VAR2", ...) with explicit env vars Also removes spurious requires_gpu from ollama-backed tests (test_genslot, test_think_budget_forcing, test_component_typing) and adds missing integration marker to test_hook_call_sites. VRAM estimates computed from model parameter counts using bf16 formula (params_B × 2 × 1.2, rounded up to next even GB): - granite-3.3-8b: 20 GB, Mistral-7B: 18 GB, granite-4.0-micro (3B): 8 GB - Qwen3-0.6B: 4 GB (conservative for vLLM KV cache headroom) - granite-4.0-h-micro (3B): 8 GB, alora training (3B): 12 GB * test: skip collection gracefully when optional backend deps are missing Add pytest.importorskip() / pytest.importorskip() guards to 14 test files that previously aborted the entire test run with a ModuleNotFoundError when optional extras were not installed: - torch / llguidance (mellea[hf]): test_huggingface, test_huggingface_tools, test_alora_train_integration, test_intrinsics_formatters, test_core, test_guardian, test_rag, test_spans - litellm (mellea[litellm]): test_litellm_ollama, test_litellm_watsonx - ibm_watsonx_ai (mellea[watsonx]): test_watsonx - docling / docling_core (mellea[mify]): test_tool_calls, test_richdocument, test_transform With these guards, `uv run pytest` runs all collectable tests and reports skipped files with a clear reason instead of aborting at first ImportError. * test: refine integration marker definition and apply audit fixes Expand integration to cover SDK-boundary tests (OTel InMemoryMetricReader, InMemorySpanExporter, LoggingHandler) — tests that assert against a real third-party SDK contract, not just multi-component wiring. Updates SKILL.md and MARKERS_GUIDE.md with new definition, indicators, tie-breaker, and SDK-boundary signal tables. Applied fixes: - test/telemetry/test_{metrics,metrics_token,logging}.py: add integration marker - test/telemetry/test_metrics_backend.py: add openai marker to OTel+OpenAI test, remove redundant inline skip already covered by require_api_key predicate - test/cli/test_alora_train.py: add integration to test_imports_work (real LoraConfig) - test/formatters/granite/test_intrinsics_formatters.py: remove unregistered block_network marker - test/stdlib/components/docs/test_richdocument.py: add integration pytestmark + e2e/huggingface/qualitative on skipped generation test - test/backends/test_openai_ollama.py: note inherited module marker limitation - docs/examples/plugins/testing_plugins.py: add # pytest: unit * test: add importorskip guards and optional-dep skip logic for examples - test/plugins/test_payloads.py: importorskip("cpex") — skip module when mellea[hooks] not installed instead of failing mid-test with ImportError - test/telemetry/test_metrics_plugins.py: same cpex guard - docs/examples/conftest.py: extend _check_optional_imports to cover docling, pandas, cpex (mellea.plugins imports), and litellm; also call the check from pytest_pycollect_makemodule so directly-specified files are guarded too - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * fix: convert example import errors to skips; add cpex importorskip guards Replace per-dep import checks in examples conftest with a runtime approach: ExampleModule (a pytest.Module subclass) is now returned by pytest_pycollect_makemodule for all runnable example files, preventing pytest's default collector from importing them directly. Import errors in the subprocess are caught in ExampleItem.runtest() and converted to skips, so no optional dependency needs to be encoded in conftest. Remove _check_optional_imports entirely — it was hand-maintained and would need updating for every new optional dep. Also: - test/plugins/test_payloads.py: importorskip("cpex") - test/telemetry/test_metrics_plugins.py: importorskip("cpex") - docs/examples/image_text_models/README.md: add Prerequisites section listing models to pull (granite3.2-vision, qwen2.5vl:7b) * test: skip OTel-dependent tests when opentelemetry not installed Locally running without mellea[telemetry] caused three tests to fail with assertion errors rather than skip cleanly. Add importorskip at module level for test_tracing.py and a skipif decorator for the single OTel-gated test in test_astream_exception_propagation.py. * fix: use conservative heuristic for Apple Silicon GPU memory detection Metal's recommendedMaxWorkingSetSize is a static device property (~75% of total RAM) that ignores current system load. Replace it with min(total * 0.75, total - 16) so that desktop/IDE memory usage is accounted for. Also removes the torch dependency for GPU detection on Apple Silicon — sysctl hw.memsize is used directly. CUDA path on Linux is unchanged. * test: add training memory signals to audit-markers skill; bump alora VRAM gate Training tests need ~2x the base model inference memory (activations, optimizer states, gradient temporaries). The skill now detects training signals (train_model, Trainer, epochs=) and checks that require_gpu min_vram_gb uses the 2x rule. Bump test_alora_train_integration from min_vram_gb=12 to 20 (3B bfloat16: ~6 GB inference, ~12 GB training peak + headroom) so it skips correctly on 32 GB Apple Silicon under typical load. * fix: cache system capabilities result in examples conftest get_system_capabilities() was caching the function reference, not the result — causing the Ollama socket check (1s timeout) and full capability detection to re-run for every example file during collection (~102 times). Cache the result dict instead so detection runs exactly once. * fix: cache get_system_capabilities() result in test/conftest.py The function was called once per test in pytest_runtest_setup (325+ calls) and once at collection in pytest_collection_modifyitems, each time re-running the Ollama socket check (1s timeout when down), sysctl subprocess, and psutil query. Cache the result after the first call. * fix: flush MPS memory pool in intrinsic test fixture teardown torch.cuda.empty_cache() is a no-op on Apple Silicon MPS, leaving the MPS allocator pool occupied after each module fixture tears down. The next module then loads a fresh model into an already-pressured pool, causing the process RSS to grow unboundedly across modules. Both calls are now guarded so CUDA and MPS runs each get the correct flush. * fix: load LocalHFBackend model in config dtype to prevent float32 upcasting AutoModelForCausalLM.from_pretrained without torch_dtype may load weights in float32 on CPU before moving to MPS/CUDA, doubling peak memory briefly and leaving float32 remnants in the allocator pool. torch_dtype="auto" respects the model config (bfloat16 for Granite) for both the CPU load and the device transfer. * test: remove --isolate-heavy process isolation and bump intrinsic VRAM gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR generative-computing#721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references * test: migrate legacy markers in test_intrinsics_formatters.py Replace deprecated @pytest.mark.llm, @pytest.mark.requires_gpu, @pytest.mark.requires_heavy_ram, @pytest.mark.requires_gpu_isolation with @pytest.mark.e2e and @require_gpu(min_vram_gb=12) to align with the new marker taxonomy (generative-computing#727/generative-computing#728). VRAM gate set to 12 GB matching the 3B-parameter model loaded across the parametrized test cases. * test: add integration marker to test_dependency_isolation.py * docs: document OLLAMA_KEEP_ALIVE=1m as memory optimisation for unordered test runs * fix: suppress mypy name-defined for torch.Tensor after importorskip change * fix: ruff format huggingface.py from_pretrained args * fix: ruff format test_watsonx.py and test_huggingface_tools.py * refactor: remove requires_gpu, requires_heavy_ram, requires_gpu_isolation markers and handlers * refactor: remove --ignore-*-check override flags from conftest * refactor: remove requires_api_key marker; fix api backend group to match watsonx+bedrock markers * fix: address review Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * test: mark test_image_block_in_instruction as qualitative * chore: commit .claude/settings.json with skillLocations for skill discovery * docs: broaden audit-markers skill description to cover diagnostic use cases * docs: add diagnostic mode to audit-markers skill for troubleshooting skip/resource issues --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Readme and Doc Strings PR
Type of PR
Description
Testing