Skip to content

ci: memory management in tests#721

Merged
avinash2692 merged 36 commits intomainfrom
ci/625-memory-management-in-tests
Mar 26, 2026
Merged

ci: memory management in tests#721
avinash2692 merged 36 commits intomainfrom
ci/625-memory-management-in-tests

Conversation

@avinash2692
Copy link
Copy Markdown
Member

@avinash2692 avinash2692 commented Mar 23, 2026

CI: memory management in tests

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Issues that it fixes

Problem

Running the full test suite on a single GPU causes OOM errors when vLLM tests start after HuggingFace tests. This could be because the 8B HF model stays resident in GPU memory because the existing cleanup (gc.collect() + empty_cache()) cannot free tensors held by indirect references — LRU caches, PEFT adapter hooks, accelerate dispatch hooks, and class-level _cached_blocks. There are also redundancies in backends and this simplifies some of it.

Solution

  • Introduce cleanup_gpu_backend() — a unified cleanup function that calls model.cpu() to forcefully move tensors off GPU, then clears all GPU-resident state. This replaces the previous cleanup_vllm_backend() and aggressive_gpu_cleanup() which relied solely on gc.collect().

Cleanup logs before/after GPU memory for easy debugging:

Cleaning up huggingface backend GPU memory...
  GPU before cleanup: 62.0GB free / 79.2GB total
  Cleared LRU cache
  Removed accelerate dispatch hooks
  GPU after cleanup: 78.1GB free / 79.2GB total (reclaimed 16.1GB)
  • There is a new pytest marker --group-by-backend that groups tests based on the backend marker. This gives us an opportunity to
    • have a unified vllm backend that is shared between tests
    • have aggressive cleanup of gpu memory between grouped backends.
    • eliminate the need for process isolation (CUDA lock still exists but there is enough GPU memory in an 80 GB GPU to run all the tests)

Files changed

File What changed
test/conftest.py Added cleanup_gpu_backend(). Removed redundant cleanup functions. Ungated memory_cleaner(). Replaced get_device_properties() with nvidia-smi to prevent CUDA fork errors. Added between-group GPU cleanup.
test/backends/test_huggingface.py returnyield + cleanup_gpu_backend()
test/backends/test_huggingface_tools.py Same
test/backends/test_vllm.py Fixed yield bug (return inside generator), updated cleanup
test/backends/test_vllm_tools.py Same
test/telemetry/test_metrics_backend.py Replaced bare del with cleanup_gpu_backend()
test/stdlib/components/intrinsic/test_rag.py Same
test/stdlib/test_spans.py Added cleanup_gpu_backend() on teardown
test/cli/test_alora_train_integration.py Added model.cpu() + cleanup after GPU usage
test/backends/test_openai_vllm.py Captured vllm serve subprocess output — skip reasons now show actual errors
mellea/backends/vllm.py Updated error messages for fork/exclusive_process errors. Early bail-out for non-OOM failures.
test/scripts/run_tests_with_ollama.sh New. End-to-end test runner — downloads ollama (no sudo), starts server, pulls/warms models, runs pytest, shuts down.

How to run

# End-to-end script with ollama startup and tear down. 
"./test/scripts/run_tests_with_ollama.sh --group-by-backend --timeout=1200 -v -rs -s"
 
# GPU tests only (no ollama)
uv run pytest test/ --group-by-backend -v

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

avinash2692 and others added 26 commits March 12, 2026 14:56
- Add session-scoped shared_vllm_backend fixture using Granite 4 Micro
- Update test_vllm.py and test_vllm_tools.py to use shared backend
- Fall back to module-scoped backends when --isolate-heavy flag is set
- Both modules now use consistent Granite 4 Micro model
- Enhance CUDA OOM error message with actionable solutions
- Maintains backward compatibility with existing isolation mechanism

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@avinash2692 avinash2692 requested a review from a team as a code owner March 23, 2026 15:32
@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 23, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@avinash2692 avinash2692 changed the title Ci/625 memory management in tests test: memory management in tests Mar 23, 2026
@avinash2692
Copy link
Copy Markdown
Member Author

I got:

Result: 1 failed, 832 passed, 37 skipped, 2 xfailed, 1 xpassed — 25:50 total

The failing test was test/backends/test_openai_ollama.py::test_chat_stream

Additionally the VLLM tests failed (skipped) with OOM (it allocated 60GB KV cache then ran out during test warmup). I think there was still another vllm server running, plus Ollama

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 784.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 449.06 MiB is free. Process 1986221 has 58.06 MiB memory in use. Process 1989725 has 1.09 GiB memory in use. Process 2000937 has 5.33 GiB memory in use. Including non-PyTorch memory, this process has 72.27 GiB memory in use. Of the allocated memory 69.51 GiB is allocated by PyTorch, with 222.00 MiB allocated in private pools (e.g., CUDA Graphs), and 272.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I need to double-check the job submission parms - my final submission was with:

bsub -J "mellea_-_all_tests" -q normal -G grp_runtime -cwd "/proj/dmfexp/eiger/users/jonesn/mellea-d" -o "/proj/dmfexp/eiger/users/jonesn/mellea-d/job_logs/mellea_-_all_tests_%J.stdout" -e "/proj/dmfexp/eiger/users/jonesn/mellea-d/job_logs/mellea_-_all_tests_%J.stderr" -gpu "num=1:mode=shared:j_exclusive=yes:mps=yes" "test/scripts/run_tests_with_ollama.sh --group-by-backend --timeout=1200 -v -rs -s"

ie I was still using mps as I'd needed before -- though I don't think it would cause an issue. - will correct and rerun

@planetf1 : Hmm, this is a little weird. You should never run out of memory in a 80gb GPU for the tests that we are running (with the aggressive cleanup in place). Do you have a stack trace of the skips/failures that I can look at?

@planetf1
Copy link
Copy Markdown
Contributor

I got:

@planetf1 : Hmm, this is a little weird. You should never run out of memory in a 80gb GPU for the tests that we are running (with the aggressive cleanup in place). Do you have a stack trace of the skips/failures that I can look at?

Caused by the MPS flag. My second run correctly had this off - it affects cuda isolation ....

@planetf1
Copy link
Copy Markdown
Contributor

@ajbozarth I think the test you saw fail is flaky. It's marked qualitative and there's probably a race with ollama not handling in parallel? Maybe worth checking if already issue and if not opening one.

@ajbozarth
Copy link
Copy Markdown
Contributor

I ran

bash ./test/scripts/run_tests_with_ollama.sh --group-by-backend --timeout=1200 -v -rs -s

inside of

bsub -Is -n 1 -G grp_preemptable -q preemptable -gpu "num=1/task:mode=shared:mps=no:j_exclusive=yes:gvendor=nvidia" /bin/bash

and got

==================================== ERRORS ====================================
_______________________ ERROR at setup of test_think_big _______________________

gh_run = 0

    @pytest.fixture(scope="module")
    def m_session(gh_run):
        """Start default Mellea's session."""
        if gh_run == 1:  # on github
            m = start_session(
                "ollama", model_id=MODEL_ID, model_options={ModelOption.MAX_NEW_TOKENS: 5}
            )
        else:
>           m = start_session("ollama", model_id=MODEL_ID)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

test/stdlib/sampling/test_think_budget_forcing.py:31: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/stdlib/session.py:241: in start_session
    backend = backend_class(model_id, model_options=model_options, **backend_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <mellea.backends.ollama.OllamaModelBackend object at 0x14f9c678ba70>
model_id = ModelIdentifier(hf_model_name='openai/gpt-oss-20b', ollama_name='gpt-oss:20b', watsonx_name=None, mlx_name=None, openai_name=None, bedrock_name='openai.gpt-oss-20b', hf_tokenizer_name=None)
formatter = None, base_url = None, model_options = None

    def __init__(
        self,
        model_id: str | ModelIdentifier = model_ids.IBM_GRANITE_4_MICRO_3B,
        formatter: ChatFormatter | None = None,
        base_url: str | None = None,
        model_options: dict | None = None,
    ):
        """Initialize an Ollama backend, connecting to the server and pulling the model if needed."""
        super().__init__(
            model_id=model_id,
            formatter=(
                formatter
                if formatter is not None
                else TemplateFormatter(model_id=model_id)
            ),
            model_options=model_options,
        )
        # Run the ollama model id accessor early, so that an Assertion fails immediately if we cannot find an ollama model id for the provided ModelIdentifier.
        self._get_ollama_model_id()
    
        # Setup the client and ensure that we have the model available.
        self._base_url = base_url
        self._client = ollama.Client(base_url)
    
        self._client_cache = ClientCache(2)
    
        # Call once to set up an async client and prepopulate the cache.
        _ = self._async_client
    
        if not self._check_ollama_server():
            err = f"could not create OllamaModelBackend: ollama server not running at {base_url}"
            FancyLogger.get_logger().error(err)
            raise Exception(err)
        if not self._pull_ollama_model():
            err = f"could not create OllamaModelBackend: {self._get_ollama_model_id()} could not be pulled from ollama library"
            FancyLogger.get_logger().error(err)
>           raise Exception(err)
E           Exception: could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library

mellea/backends/ollama.py:97: Exception
------------------------------ Captured log setup ------------------------------
ERROR    fancy_logger:ollama.py:96 could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library
_____________________ ERROR at setup of test_think_little ______________________

gh_run = 0

    @pytest.fixture(scope="module")
    def m_session(gh_run):
        """Start default Mellea's session."""
        if gh_run == 1:  # on github
            m = start_session(
                "ollama", model_id=MODEL_ID, model_options={ModelOption.MAX_NEW_TOKENS: 5}
            )
        else:
>           m = start_session("ollama", model_id=MODEL_ID)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

test/stdlib/sampling/test_think_budget_forcing.py:31: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
mellea/stdlib/session.py:241: in start_session
    backend = backend_class(model_id, model_options=model_options, **backend_kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <mellea.backends.ollama.OllamaModelBackend object at 0x14f9c678ba70>
model_id = ModelIdentifier(hf_model_name='openai/gpt-oss-20b', ollama_name='gpt-oss:20b', watsonx_name=None, mlx_name=None, openai_name=None, bedrock_name='openai.gpt-oss-20b', hf_tokenizer_name=None)
formatter = None, base_url = None, model_options = None

    def __init__(
        self,
        model_id: str | ModelIdentifier = model_ids.IBM_GRANITE_4_MICRO_3B,
        formatter: ChatFormatter | None = None,
        base_url: str | None = None,
        model_options: dict | None = None,
    ):
        """Initialize an Ollama backend, connecting to the server and pulling the model if needed."""
        super().__init__(
            model_id=model_id,
            formatter=(
                formatter
                if formatter is not None
                else TemplateFormatter(model_id=model_id)
            ),
            model_options=model_options,
        )
        # Run the ollama model id accessor early, so that an Assertion fails immediately if we cannot find an ollama model id for the provided ModelIdentifier.
        self._get_ollama_model_id()
    
        # Setup the client and ensure that we have the model available.
        self._base_url = base_url
        self._client = ollama.Client(base_url)
    
        self._client_cache = ClientCache(2)
    
        # Call once to set up an async client and prepopulate the cache.
        _ = self._async_client
    
        if not self._check_ollama_server():
            err = f"could not create OllamaModelBackend: ollama server not running at {base_url}"
            FancyLogger.get_logger().error(err)
            raise Exception(err)
        if not self._pull_ollama_model():
            err = f"could not create OllamaModelBackend: {self._get_ollama_model_id()} could not be pulled from ollama library"
            FancyLogger.get_logger().error(err)
>           raise Exception(err)
E           Exception: could not create OllamaModelBackend: gpt-oss:20b could not be pulled from ollama library

mellea/backends/ollama.py:97: Exception
=========================== short test summary info ============================
SKIPPED [1] test/backends/test_openai_vllm.py:149: vLLM process not available: vLLM server exited before startup (code 1).
SKIPPED [1] test/backends/test_openai_vllm.py:156: vLLM process not available: vLLM server exited before startup (code 1).
SKIPPED [1] test/backends/test_openai_vllm.py:178: vLLM process not available: vLLM server exited before startup (code 1).
SKIPPED [1] test/backends/test_openai_vllm.py:186: vLLM process not available: vLLM server exited before startup (code 1).
SKIPPED [1] test/backends/test_openai_vllm.py:196: vLLM process not available: vLLM server exited before startup (code 1).
SKIPPED [1] test/backends/test_openai_vllm.py:229: vLLM process not available: vLLM server exited before startup (code 1).
SKIPPED [1] test/backends/test_openai_vllm.py:241: vLLM process not available: vLLM server exited before startup (code 1).
SKIPPED [17] test/conftest.py:786: Skipping test: watsonx API key not found in environment
SKIPPED [1] test/backends/test_bedrock.py:27: Skipping Bedrock backend tests if $AWS_BEARER_TOKEN_BEDROCK is not set.
SKIPPED [1] test/plugins/test_manager.py:33: must pass --disable-default-mellea-plugins for this test
SKIPPED [1] test/plugins/test_manager.py:45: must pass --disable-default-mellea-plugins for this test
SKIPPED [1] test/stdlib/components/docs/test_richdocument.py:100: unconditional skip
SKIPPED [1] test/stdlib/requirements/test_reqlib_python.py:216: Sandbox tests require llm-sandbox[docker] and Docker to be available
SKIPPED [1] test/stdlib/requirements/test_reqlib_python.py:227: Sandbox tests require llm-sandbox[docker] and Docker to be available
SKIPPED [1] test/stdlib/requirements/test_reqlib_python.py:240: Sandbox tests require llm-sandbox[docker] and Docker to be available
SKIPPED [1] test/telemetry/test_tracing_backend.py:67: Telemetry not initialized
SKIPPED [1] test/telemetry/test_tracing_backend.py:113: Telemetry not initialized
SKIPPED [1] test/telemetry/test_tracing_backend.py:158: Telemetry not initialized
SKIPPED [1] test/telemetry/test_tracing_backend.py:209: Telemetry not initialized
SKIPPED [1] test/telemetry/test_tracing_backend.py:245: Telemetry not initialized
SKIPPED [1] test/telemetry/test_tracing_backend.py:279: Telemetry not initialized
= 831 passed, 37 skipped, 2 xfailed, 1 xpassed, 131 warnings, 2 errors in 1497.04s (0:24:57) =

Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
@ajbozarth
Copy link
Copy Markdown
Contributor

Finally did some local testing on my Mac (M1 Max, 32GB):

uv run pytest -v:

845 passed, 81 skipped, 3 deselected, 2 xfailed, 1 xpassed, 112 warnings in 1130.18s (0:18:50)

uv run pytest --group-by-backend -v:

2 failed, 843 passed, 81 skipped, 3 deselected, 2 xfailed, 1 xpassed, 111 warnings in 1119.48s (0:18:39)

(the two failures are common qualitative failures)

I also I ran these after removing the pytest.mark.requires_heavy_ram I added in #623 so this does fix #620 and #630

@avinash2692 I hope you don't mind but I pushed a commit removing those marks and updated the description to auto close those issues

@planetf1
Copy link
Copy Markdown
Contributor

@ajbozarth there's a ruff format error that needs fixing.

@ajbozarth
Copy link
Copy Markdown
Contributor

@ajbozarth there's a ruff format error that needs fixing.

Odd pre-commit didn't catch that, @avinash2692 I've started another benchmark so you'll need to pull and fix it, sorry

@avinash2692 avinash2692 changed the title test: memory management in tests CI: memory management in tests Mar 24, 2026
@avinash2692 avinash2692 changed the title CI: memory management in tests ci: memory management in tests Mar 24, 2026
Comment on lines +290 to +299
# Cleanup GPU memory
base_model.cpu()
del model_with_adapter
del base_model
import gc

gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be cleaning up after the second test as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so, let me have a look at it.

Comment on lines +482 to +483
# 1. Clear the LRU cache (holds DynamicCache KV tensors on GPU)
if hasattr(backend, "_cache") and hasattr(backend._cache, "cache"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should just grab the backend by class; I'm not the biggest fan of this hasattr calls when we know the types of backends we will be processing.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. This is just general purpose enough to clear any CUDA memory if we decide to reuse the LRU cache in another backend.

Also, would like to keep this more generic cause I don't want to assume that the user does/does not have hf installed in their mellea installation.

Comment on lines +470 to +481
try:
import torch

if torch.cuda.is_available():
free_before, total = torch.cuda.mem_get_info()
logger.info(
f" GPU before cleanup: {free_before / 1024**3:.1f}GB free "
f"/ {total / 1024**3:.1f}GB total"
)
else:
free_before = 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meta question: is this backend cleanup stuff we should just being doing in the del functions of each backend? Like if I spin up a bunch of huggingface backends in my own code, shouldn't this garbage collection code be executed in those cases as well?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm. maybe? Right now, the reason this exisits for hf backends is cause we do tend to hold on to GPU memory in our LRUCache. This could be included in the tear down for hf (and vllm) backends, but we might need to figure out where that would happen in the execution of a mellea program (at the end of a session? are the end on the program?)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we might need to figure out where that would happen in the execution of a mellea program (at the end of a session? are the end on the program?)

could we put it in a function and let the user (or test) decide when to call it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ on Alex's point - it seems like we should call on end of a session, and should ensure we document the behaviour (maybe it's obvious). But given the size of these models being able to explicitly do this may be needed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to let the user call it, but the issue is that we will be introducing complexities again asking the user to do GPU mem management. Maybe this is a helper function in the LocalhfBackend that can be used based on the discretion of the user.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a helper function in the LocalhfBackend that can be used based on the discretion of the user.

This was what I was thinking

Comment on lines +664 to +697
# Reorder tests by backend if requested
if config.getoption("--group-by-backend", default=False):
logger = FancyLogger.get_logger()
logger.info("Grouping tests by backend (--group-by-backend enabled)")

# Group items by backend
grouped_items = []
seen = set()

for group_name in BACKEND_GROUP_ORDER:
marker = BACKEND_GROUPS[group_name]["marker"]
group_tests = [
item
for item in items
if item.get_closest_marker(marker) and id(item) not in seen
]

if group_tests:
logger.info(
f"Backend group '{group_name}': {len(group_tests)} tests ({BACKEND_GROUPS[group_name]['description']})"
)
grouped_items.extend(group_tests)
for item in group_tests:
seen.add(id(item))

# Add tests without backend markers at the end
unmarked = [item for item in items if id(item) not in seen]
if unmarked:
logger.info(f"Unmarked tests: {len(unmarked)} tests")
grouped_items.extend(unmarked)

# Reorder in place
items[:] = grouped_items
logger.info(f"Total tests reordered: {len(items)}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do switch to using session fixtures for the backends like we do for vllm, this code becomes unnecessary.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why do you say so? we do still need to group the tests so that we can tear down back ends between fixtures. But maybe I'm missing something here.

Comment on lines +736 to +748
if prev_group in ("vllm", "openai_vllm"):
try:
shared_backend_defs = (
item.session._fixturemanager._arg2fixturedefs.get(
"shared_vllm_backend"
)
)
if shared_backend_defs:
backend_instance = shared_backend_defs[-1].cached_result[0]
if backend_instance is not None:
cleanup_gpu_backend(
backend_instance, "shared-vllm-transition"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't the individual fixtures call this as well? Why call it again here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this happens for vllm tests cause the backend is at session level and so the teardown is also happening at session level.

@planetf1
Copy link
Copy Markdown
Contributor

I'm doing some test classification in #742 which, whilst not colliding at a code level (I hope), does effectively depend on getting this change in. So as a general point, I think if we think this improves things, even if there's followups, I'd err on merging. But I will comment on discussions above.

Copy link
Copy Markdown
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me; I think there might be additional improvements that could be done but we should try to get nightly tests running; thanks @avinash2692

@avinash2692 avinash2692 enabled auto-merge March 26, 2026 16:22
@avinash2692 avinash2692 added this pull request to the merge queue Mar 26, 2026
Merged via the queue into main with commit 19fd5c8 Mar 26, 2026
8 checks passed
planetf1 added a commit to planetf1/mellea that referenced this pull request Mar 27, 2026
…M gates

- Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(),
  and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR generative-computing#721
- Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py
- Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans —
  correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was
  wrong and masked by the now-fixed MPS pool leak
- Add adapter accumulation signals to audit-markers skill
- Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references
planetf1 added a commit to planetf1/mellea that referenced this pull request Mar 28, 2026
…M gates

- Remove --isolate-heavy flag, _run_heavy_modules_isolated(), pytest_collection_finish(),
  and require_gpu_isolation() predicate — superseded by cleanup_gpu_backend() from PR generative-computing#721
- Remove dead requires_gpu/requires_api_key branches from docs/examples/conftest.py
- Bump min_vram_gb from 8 → 12 on test_guardian, test_core, test_rag, test_spans —
  correct gate for 3B base model (6 GB) + adapters + inference overhead; 8 GB was
  wrong and masked by the now-fixed MPS pool leak
- Add adapter accumulation signals to audit-markers skill
- Update AGENTS.md, test/README.md, MARKERS_GUIDE.md to remove --isolate-heavy references
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

creating multiple LocalHFBackend in pytest caused a memory leak test_huggingface_token_metrics_integration is too heavy

4 participants