Conversation
- Add session-scoped shared_vllm_backend fixture using Granite 4 Micro - Update test_vllm.py and test_vllm_tools.py to use shared backend - Fall back to module-scoped backends when --isolate-heavy flag is set - Both modules now use consistent Granite 4 Micro model - Enhance CUDA OOM error message with actionable solutions - Maintains backward compatibility with existing isolation mechanism Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
@planetf1 : Hmm, this is a little weird. You should never run out of memory in a 80gb GPU for the tests that we are running (with the aggressive cleanup in place). Do you have a stack trace of the skips/failures that I can look at? |
Caused by the MPS flag. My second run correctly had this off - it affects cuda isolation .... |
|
@ajbozarth I think the test you saw fail is flaky. It's marked qualitative and there's probably a race with ollama not handling in parallel? Maybe worth checking if already issue and if not opening one. |
|
I ran inside of and got |
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
Finally did some local testing on my Mac (M1 Max, 32GB):
845 passed, 81 skipped, 3 deselected, 2 xfailed, 1 xpassed, 112 warnings in 1130.18s (0:18:50)
2 failed, 843 passed, 81 skipped, 3 deselected, 2 xfailed, 1 xpassed, 111 warnings in 1119.48s (0:18:39)(the two failures are common qualitative failures) I also I ran these after removing the @avinash2692 I hope you don't mind but I pushed a commit removing those marks and updated the description to auto close those issues |
|
@ajbozarth there's a ruff format error that needs fixing. |
Odd pre-commit didn't catch that, @avinash2692 I've started another benchmark so you'll need to pull and fix it, sorry |
| # Cleanup GPU memory | ||
| base_model.cpu() | ||
| del model_with_adapter | ||
| del base_model | ||
| import gc | ||
|
|
||
| gc.collect() | ||
| if torch.cuda.is_available(): | ||
| torch.cuda.empty_cache() | ||
|
|
There was a problem hiding this comment.
Should we be cleaning up after the second test as well?
There was a problem hiding this comment.
I think so, let me have a look at it.
| # 1. Clear the LRU cache (holds DynamicCache KV tensors on GPU) | ||
| if hasattr(backend, "_cache") and hasattr(backend._cache, "cache"): |
There was a problem hiding this comment.
I feel like we should just grab the backend by class; I'm not the biggest fan of this hasattr calls when we know the types of backends we will be processing.
There was a problem hiding this comment.
Fair enough. This is just general purpose enough to clear any CUDA memory if we decide to reuse the LRU cache in another backend.
Also, would like to keep this more generic cause I don't want to assume that the user does/does not have hf installed in their mellea installation.
| try: | ||
| import torch | ||
|
|
||
| if torch.cuda.is_available(): | ||
| free_before, total = torch.cuda.mem_get_info() | ||
| logger.info( | ||
| f" GPU before cleanup: {free_before / 1024**3:.1f}GB free " | ||
| f"/ {total / 1024**3:.1f}GB total" | ||
| ) | ||
| else: | ||
| free_before = 0 | ||
|
|
There was a problem hiding this comment.
Meta question: is this backend cleanup stuff we should just being doing in the del functions of each backend? Like if I spin up a bunch of huggingface backends in my own code, shouldn't this garbage collection code be executed in those cases as well?
There was a problem hiding this comment.
hmm. maybe? Right now, the reason this exisits for hf backends is cause we do tend to hold on to GPU memory in our LRUCache. This could be included in the tear down for hf (and vllm) backends, but we might need to figure out where that would happen in the execution of a mellea program (at the end of a session? are the end on the program?)
There was a problem hiding this comment.
but we might need to figure out where that would happen in the execution of a mellea program (at the end of a session? are the end on the program?)
could we put it in a function and let the user (or test) decide when to call it?
There was a problem hiding this comment.
++ on Alex's point - it seems like we should call on end of a session, and should ensure we document the behaviour (maybe it's obvious). But given the size of these models being able to explicitly do this may be needed?
There was a problem hiding this comment.
I'm happy to let the user call it, but the issue is that we will be introducing complexities again asking the user to do GPU mem management. Maybe this is a helper function in the LocalhfBackend that can be used based on the discretion of the user.
There was a problem hiding this comment.
Maybe this is a helper function in the LocalhfBackend that can be used based on the discretion of the user.
This was what I was thinking
| # Reorder tests by backend if requested | ||
| if config.getoption("--group-by-backend", default=False): | ||
| logger = FancyLogger.get_logger() | ||
| logger.info("Grouping tests by backend (--group-by-backend enabled)") | ||
|
|
||
| # Group items by backend | ||
| grouped_items = [] | ||
| seen = set() | ||
|
|
||
| for group_name in BACKEND_GROUP_ORDER: | ||
| marker = BACKEND_GROUPS[group_name]["marker"] | ||
| group_tests = [ | ||
| item | ||
| for item in items | ||
| if item.get_closest_marker(marker) and id(item) not in seen | ||
| ] | ||
|
|
||
| if group_tests: | ||
| logger.info( | ||
| f"Backend group '{group_name}': {len(group_tests)} tests ({BACKEND_GROUPS[group_name]['description']})" | ||
| ) | ||
| grouped_items.extend(group_tests) | ||
| for item in group_tests: | ||
| seen.add(id(item)) | ||
|
|
||
| # Add tests without backend markers at the end | ||
| unmarked = [item for item in items if id(item) not in seen] | ||
| if unmarked: | ||
| logger.info(f"Unmarked tests: {len(unmarked)} tests") | ||
| grouped_items.extend(unmarked) | ||
|
|
||
| # Reorder in place | ||
| items[:] = grouped_items | ||
| logger.info(f"Total tests reordered: {len(items)}") |
There was a problem hiding this comment.
If we do switch to using session fixtures for the backends like we do for vllm, this code becomes unnecessary.
There was a problem hiding this comment.
hmm, why do you say so? we do still need to group the tests so that we can tear down back ends between fixtures. But maybe I'm missing something here.
| if prev_group in ("vllm", "openai_vllm"): | ||
| try: | ||
| shared_backend_defs = ( | ||
| item.session._fixturemanager._arg2fixturedefs.get( | ||
| "shared_vllm_backend" | ||
| ) | ||
| ) | ||
| if shared_backend_defs: | ||
| backend_instance = shared_backend_defs[-1].cached_result[0] | ||
| if backend_instance is not None: | ||
| cleanup_gpu_backend( | ||
| backend_instance, "shared-vllm-transition" | ||
| ) |
There was a problem hiding this comment.
Don't the individual fixtures call this as well? Why call it again here?
There was a problem hiding this comment.
I don't think this happens for vllm tests cause the backend is at session level and so the teardown is also happening at session level.
|
I'm doing some test classification in #742 which, whilst not colliding at a code level (I hope), does effectively depend on getting this change in. So as a general point, I think if we think this improves things, even if there's followups, I'd err on merging. But I will comment on discussions above. |
jakelorocco
left a comment
There was a problem hiding this comment.
looks good to me; I think there might be additional improvements that could be done but we should try to get nightly tests running; thanks @avinash2692
…M gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR generative-computing#721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references
…M gates - Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(), and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR generative-computing#721 - Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py - Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans — correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was wrong and masked by the now-fixed MPS pool leak - Add adapter accumulation signals to audit-markers skill - Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references
CI: memory management in tests
Type of PR
Issues that it fixes
Problem
Running the full test suite on a single GPU causes OOM errors when vLLM tests start after HuggingFace tests. This could be because the 8B HF model stays resident in GPU memory because the existing cleanup (
gc.collect()+empty_cache()) cannot free tensors held by indirect references — LRU caches, PEFT adapter hooks, accelerate dispatch hooks, and class-level_cached_blocks. There are also redundancies in backends and this simplifies some of it.Solution
cleanup_gpu_backend()— a unified cleanup function that callsmodel.cpu()to forcefully move tensors off GPU, then clears all GPU-resident state. This replaces the previouscleanup_vllm_backend()andaggressive_gpu_cleanup()which relied solely ongc.collect().Cleanup logs before/after GPU memory for easy debugging:
--group-by-backendthat groups tests based on the backend marker. This gives us an opportunity toFiles changed
test/conftest.pycleanup_gpu_backend(). Removed redundant cleanup functions. Ungatedmemory_cleaner(). Replacedget_device_properties()withnvidia-smito prevent CUDA fork errors. Added between-group GPU cleanup.test/backends/test_huggingface.pyreturn→yield+cleanup_gpu_backend()test/backends/test_huggingface_tools.pytest/backends/test_vllm.pyreturninside generator), updated cleanuptest/backends/test_vllm_tools.pytest/telemetry/test_metrics_backend.pydelwithcleanup_gpu_backend()test/stdlib/components/intrinsic/test_rag.pytest/stdlib/test_spans.pycleanup_gpu_backend()on teardowntest/cli/test_alora_train_integration.pymodel.cpu()+ cleanup after GPU usagetest/backends/test_openai_vllm.pyvllm servesubprocess output — skip reasons now show actual errorsmellea/backends/vllm.pytest/scripts/run_tests_with_ollama.shHow to run
Testing