fix: flush MPS cache in alora test GPU cleanup (#790)#800
fix: flush MPS cache in alora test GPU cleanup (#790)#800planetf1 merged 2 commits intogenerative-computing:mainfrom
Conversation
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
jakelorocco
left a comment
There was a problem hiding this comment.
Should we extract this into a function and then utilize it everywhere we currently call this pattern? For instance, we do a very similar garbage collection dance pytest_runtest_setup in test/conftest.py without the mps cleanup. But in other cleanup functions in that same file, we do the mps style cleanup as well.
Or, are there reasons we might not want to do mps cleanup every time it's available?
Consolidate the duplicated gc.collect + CUDA/MPS cache flush pattern into a single flush_device_caches() function in test/conftest.py. - Replaces 4 inline flush sites with a single call - Adds MPS support to sites that previously only handled CUDA (pytest_runtest_setup backend transitions, memory_cleaner fixture) - Fixes a bug where gc.collect() was conditional on CUDA availability in pytest_runtest_setup (now runs unconditionally) - Adds torch.mps.synchronize() for parity with CUDA synchronize() - Enriches cleanup_gpu_backend() VRAM logging: device-aware reporting for both CUDA (free/total/allocated/reserved/fragmentation) and MPS (allocated/max), with reclaimed bytes on both paths - Removes unused shutil/sys imports from test_alora_train_integration
a3e95ae to
29922a3
Compare
|
@jakelorocco Done - I think the cleanup is safe everywhere— extracted a
|
|
Have also run on LSF as well as locally (macOS). Just one qualitative failure - test/backends/test_ollama.py — one of the 4 raw prompts ("What is 2+2?") returned an empty string. The other 3 prompts got valid responses. So an LLM qual failure -- ie happy with test run |
839eead
Flush MPS cache in alora test GPU cleanup
Type of PR
Description
PR #765 added CUDA GPU cleanup to the alora integration tests, but the
MPS equivalent was missing. On Apple Silicon with MPS-capable PyTorch,
GPU memory isn't reclaimed between tests.
Changes in this PR:
torch.mps.empty_cache()after the CUDA cleanup blocks in bothtest_alora_training_integrationandtest_lora_training_integrationMirrors the existing pattern in
test/conftest.py:373-374.Testing