fix: hf metrics tests run out of memory#623
fix: hf metrics tests run out of memory#623ajbozarth merged 2 commits intogenerative-computing:mainfrom
Conversation
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Would this be useful for other HF tests (in addition to the telemetry ones)? |
For any test that uses |
|
When you were digging into this, did you figure out why this is happening? I know we have an explicit garbage collection method used when CICD=1: https://github.com/generative-computing/mellea/blob/main/test/conftest.py#L528. Were you able to run the other huggingface tests when CICD=0 before this? And only this one new tests was causing issues? I don't understand what the difference is between instantiating multiple hf backends in a single test file vs. across several tests that would cause different issues? |
This fix copies what other hugging faces tests do by only loading the model in once and making sure to clean it up after the tests are done, otherwise as the pytest suite goes on the memory fills up more and more, this is also why it passes fine when only running the telemetry tests.
I looked into this to double check there were no other tests running into this and only my test pushed the RAM over the edge. It does seem like there were already other tests with this issue, I'll address those as well and push an update |
|
After some more testing this doesn't actually solve the problem. The core issue is that when a LocalHFBackend is created it loads the model into memory and that model stays there until the process is done. No form of python garbage collection can solve this. AFAIK it can only be solved with the isolation that #605 is working with. I would argue that that PR needs to be updated to isolate any As it stand this PR does add good practice, but it doesn't solve #620 we can still merge it but I'll be removing the auto-close |
Add @pytest.mark.requires_heavy_ram to tests that instantiate LocalHFBackend to address memory leak issues when running these tests in pytest. This ensures tests are skipped on systems without sufficient RAM. Changes: - test/telemetry/test_metrics_backend.py: Added marker to test_huggingface_token_metrics_integration - test/stdlib/test_spans.py: Added marker to module-level pytestmark Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
5411760
Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
* fix(vllm): implement shared backend to prevent GPU OOM errors - Add session-scoped shared_vllm_backend fixture using Granite 4 Micro - Update test_vllm.py and test_vllm_tools.py to use shared backend - Fall back to module-scoped backends when --isolate-heavy flag is set - Both modules now use consistent Granite 4 Micro model - Enhance CUDA OOM error message with actionable solutions - Maintains backward compatibility with existing isolation mechanism Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * reduce vllm GPU allocation for tests * implement backend test grouping via reordering * add gpu cleanup between backend groups * delay vllm backend creation until after openai vllm group * adding explicit served model name for vllm openai test * fix: rag intrinscis are not for the hybrid model (I think) * testing a fix in tests for all the gpu issues * more gpu cleaning * adding docs tooling to mypy exclude * removing kv cache also from GPU in cleanup for tests * moving test order around and also fixing a fixture bug * rolling back some changes from exclusive process * some changes to the error message in vllm and also conftest cleaning * adding an end-to-end script for tests with ollama * adding a port finder (just in case) * adding direct download of ollama binary from github * warm starting ollama * warm starting ollama * adding cuda paths for ollama * some extra checks for vllm and teardown * making group by backend default * making the script executable * test: remove heavy ram pytest marks added in #623 Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> * ruff formatting * small changes to script and adding cleaup to guardian and core * making log dir more easy to set * increasing ollama startup to 2 mins * adding pytest-json-report --------- Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Co-authored-by: avinash.bala@us.ibm.com;0J8455897;AVINASH BALAKRISHNAN <avinashbala@p5-r03-n2.bluevela.rmf.ibm.com> Co-authored-by: Alex Bozarth <ajbozart@us.ibm.com>
Misc PR
Type of PR
Description
Previous huggingface tests all fell under the high GPU/RAM usage flags and were run in isolation. So we were not aware that by default there is no cleanup happening, causing each test to import a model into memory and leave it there until tests were finished.
This adds a fixture to only load the model once and clean up after the test file finishes.
Testing