fix: huggingface memory leak by avinash2692 · Pull Request #544 · generative-computing/mellea

avinash2692 · 2026-02-13T00:15:11Z

Misc PR

Type of PR

Bug Fix
New Feature
Documentation
Other

Description

Link to Issue:
- bug: LocalHFBackend seems to have a memory blow up if you repeatedly call Instruct #378: This is the main issue that this PR solves
- Support proper caching on huggingface backends #29 : based on the current implementation, it should also solve this

Problem

Each instruct() call triggers 4+ model.generate() calls (main generation + LLM-as-a-Judge validation + retries). GPU memory grew continuously because:

KV caches and output tensors were stored in mot._meta["hf_output"] and never cleaned up
LRU cache eviction only removed Python references, not GPU memory
use_cache and output_scores were always True, creating large tensors even when unused

Changes

mellea/backends/cache.py

Added on_evict callback to SimpleLRUCache - called when entries are evicted to free resources

mellea/backends/huggingface.py

Added _cleanup_kv_cache() function that properly frees GPU memory (gc.collect() + torch.cuda.empty_cache())
Wired up cleanup callback in backend constructor
Added return_scores parameter (default False) to avoid storing large logit tensors
Pass use_cache=self._use_caches to model.generate() so KV caches aren't created when disabled
Store KV cache in LRU separately, clear past_key_values from hf_output
When use_caches=False: clear hf_output from mot._meta after processing and call torch.cuda.empty_cache()

###Result

use_caches=False: No GPU memory growth over iterations
use_caches=True: Memory plateaus at LRU capacity with proper cleanup on eviction

Testing

Tests added to the respective file if code was changed
New code has 100% coverage if code as added
Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

github-actions · 2026-02-13T00:15:24Z

The PR description has been updated. Please fill out the template for your PR to be reviewed.

mergify · 2026-02-13T00:15:44Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:$.+$)?:

nrfulton

Thanks for the root cause analysis.

Left some thoughts on the interface decisions and guarding against edge cases.

nrfulton · 2026-02-13T16:32:12Z

            custom_config (Optional[TransformersTorchConfig]): Overrides loading from the `model_id`. If set, then the specified tokenizer/model/device will be used instead of auto-loading from the model_id.
            default_to_constraint_checking_alora: If set to False then aloras will be deactivated. This is primarily for performance benchmarking and debugging.
            model_options (Optional[dict]): Default model options.
+            return_scores (bool): If True, return output logits from model.generate(). Default False to save GPU memory.


Should we remove this and instead just ensure there's always a cache? Cache eviction is exactly solving this problem. It seems like LRUCache(0) should be the way to do this?

So do you mean always setting this to True in model.generate? that could work and we could just default to SimpleLRUCache(0) as a default (though one coulddl envision this to be set to window_size of the context). I think this also simplifies things for the user with a default behavior.

My only reason for keeping this here is for folks familiar with the huggingface ecosystem to be able to pass args to generate

So do you mean always setting this to True in model.generate?

Yeah, unless it's set in model options I suppose.

that could work and we could just default to SimpleLRUCache(0) as a default

Or SimpleLRUCache(x) where x is determined based on the available memory and max kv cache size for model_id. (not to max out memeory use but to leave sufficient head room). For now choose n=3 seems pretty reasonable. We should add to the docs somewhere the stack trace you get when you're OOM with a suggestion to set SimpleLRUCache(0) so that debugging that issue is easier. Even better if we can systematically catch the exact exception and give the suggestion there, but idk if that's possible to do in a robust fashion. You'd know better than me.

I think this also simplifies things for the user with a default behavior.

That's the goal.

My only reason for keeping this here is for folks familiar with the huggingface ecosystem to be able to pass args to generate

Reasonable counter-argument. But we have model_options already, why give this pride-of-place? On face my first comment gives that reason. But note we still would have to handle this arg happening both in __init__ and possibly in any of the model_options. So even still, in favor of removing the constructor arg.

nrfulton · 2026-02-13T16:32:48Z

+        self._return_scores = return_scores
+        self._cache = (
+            cache
+            if cache is not None
+            else SimpleLRUCache(3, on_evict=_cleanup_kv_cache)
+        )


Especially because of this I don't really see the point of return_scores.

Open to being persuaded, strong opinion loosely held n'at.

nrfulton · 2026-02-13T16:35:03Z

            )

-            self.cache_put(mot.value, cache_info)
+            cache_key = id(mot.value)


Okay. So now it's REALLY important that we make it to and through at least this code point in post_process, right? Should we update the finally block in core to at least give a warning if the cache doesn't have the shape we expect immediately after generation?

Not necessary. If use_cache is set to false, the default behaviour here is to just return the output and discard everything else. But I think scores are left in limbo here. So it might make sense with your earlier suggestion to keep return scores is true if use_cache==True, we then populate the cache with the kv_cache and scores and we can set the cache len=0 for the default case?

ok, sgtm modulo choice for len.

I'm also open to setting len=0 for now and we can revisit once we actually have the block attention mechanism working on the main code paths, at which point we should do the adaptive thing mentioned in my other comment.

…d dynamic LRU size

anpendyal · 2026-02-16T21:28:56Z

Hi, I'm still having issues with the memory leak.
I used the following command to submit the job:
bsub -q normal \ -J kvcache_mass_oom_fix \ -gpu "num=1" \ -cwd /dccstor/nathan-ckpts/anooshka \ -o /dccstor/nathan-ckpts/anooshka/logs/out.%J.txt \ -e /dccstor/nathan-ckpts/anooshka/logs/err.%J.txt \ bash -lc 'export PATH="$HOME/.local/bin:$PATH"; mkdir -p /dccstor/nathan-ckpts/anooshka/logs; micromamba run -p /u/apendyal/mamba_envs/mellea_py310 \ python -u /dccstor/nathan-ckpts/anooshka/kv_cache_dataset.py \ --out_root /dccstor/nathan-ckpts/anooshka/caselaw_kv/mass \ --log "/dccstor/nathan-ckpts/anooshka/logs/caselaw_kv.${LSB_JOBID}.log"'

My job ended up finishing early due to this error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.55 GiB is free. Including non-PyTorch memory, this process has 77.69 GiB memory in use. Of the allocated memory 68.07 GiB is allocated by PyTorch, and 9.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

One thing I noticed is that this job was able to get through twice as many cases than before I used the version of Mellea fixing the memory leak.

Also, I kept logs for every case that I ran and those logs contain information like: gpu_alloc and gpu_reserved.

I can send my complete code/logs if needed!

avinash2692 · 2026-02-16T22:36:37Z

Hi, I'm still having issues with the memory leak. I used the following command to submit the job: bsub -q normal \ -J kvcache_mass_oom_fix \ -gpu "num=1" \ -cwd /dccstor/nathan-ckpts/anooshka \ -o /dccstor/nathan-ckpts/anooshka/logs/out.%J.txt \ -e /dccstor/nathan-ckpts/anooshka/logs/err.%J.txt \ bash -lc 'export PATH="$HOME/.local/bin:$PATH"; mkdir -p /dccstor/nathan-ckpts/anooshka/logs; micromamba run -p /u/apendyal/mamba_envs/mellea_py310 \ python -u /dccstor/nathan-ckpts/anooshka/kv_cache_dataset.py \ --out_root /dccstor/nathan-ckpts/anooshka/caselaw_kv/mass \ --log "/dccstor/nathan-ckpts/anooshka/logs/caselaw_kv.${LSB_JOBID}.log"'

My job ended up finishing early due to this error: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.55 GiB is free. Including non-PyTorch memory, this process has 77.69 GiB memory in use. Of the allocated memory 68.07 GiB is allocated by PyTorch, and 9.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

One thing I noticed is that this job was able to get through twice as many cases than before I used the version of Mellea fixing the memory leak.

Also, I kept logs for every case that I ran and those logs contain information like: gpu_alloc and gpu_reserved.

I can send my complete code/logs if needed!

Hmm, interesting. Could you share the logs with me if possible? Also, what's the exact script that you're trying to run?

avinash2692 · 2026-02-17T01:14:01Z

Hi, I'm still having issues with the memory leak. I used the following command to submit the job: bsub -q normal \ -J kvcache_mass_oom_fix \ -gpu "num=1" \ -cwd /dccstor/nathan-ckpts/anooshka \ -o /dccstor/nathan-ckpts/anooshka/logs/out.%J.txt \ -e /dccstor/nathan-ckpts/anooshka/logs/err.%J.txt \ bash -lc 'export PATH="$HOME/.local/bin:$PATH"; mkdir -p /dccstor/nathan-ckpts/anooshka/logs; micromamba run -p /u/apendyal/mamba_envs/mellea_py310 \ python -u /dccstor/nathan-ckpts/anooshka/kv_cache_dataset.py \ --out_root /dccstor/nathan-ckpts/anooshka/caselaw_kv/mass \ --log "/dccstor/nathan-ckpts/anooshka/logs/caselaw_kv.${LSB_JOBID}.log"'
My job ended up finishing early due to this error: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.55 GiB is free. Including non-PyTorch memory, this process has 77.69 GiB memory in use. Of the allocated memory 68.07 GiB is allocated by PyTorch, and 9.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
One thing I noticed is that this job was able to get through twice as many cases than before I used the version of Mellea fixing the memory leak.
Also, I kept logs for every case that I ran and those logs contain information like: gpu_alloc and gpu_reserved.
I can send my complete code/logs if needed!

Hmm, interesting. Could you share the logs with me if possible? Also, what's the exact script that you're trying to run?

Had a chat with @anpendyal and this seems to be a problem in her script and not this bug.

avinash2692 added 4 commits February 12, 2026 14:53

adding logic to cleanup on evict in cache

c6fd1a8

implement cleanup logic for KV cache eviction to free GPU memory

1713a44

adding uuid for cache key

ffb589d

reverting changes to precommit

42e2f27

nrfulton self-requested a review February 13, 2026 16:29

Merge branch 'main' into fix/378-hf-memory-leak

4473556

nrfulton requested a review from a team as a code owner February 13, 2026 16:29

nrfulton requested changes Feb 13, 2026

View reviewed changes

avinash2692 and others added 7 commits February 13, 2026 15:21

adding scores to cache and removing it from the constructor

1407c12

adding more robust type checking to run hf tests

1755fc0

suppressing code cov output to stdout to make tests more readable

b58df5a

suppressing code cov output to stdout to make tests more readable

83c1977

small fix

2d9bd20

removing cov-report from subprocesses

218e7ed

Merge branch 'main' into fix/378-hf-memory-leak

1ac144f

nrfulton self-requested a review February 16, 2026 18:39

avinash2692 added 2 commits February 16, 2026 10:50

setting lru cache to 0 for now; till we figure out block_attention an…

835d048

…d dynamic LRU size

removing return_scores from docs

035c4c2

Merge branch 'main' into fix/378-hf-memory-leak

36106df

nrfulton approved these changes Feb 17, 2026

View reviewed changes

nrfulton added this pull request to the merge queue Feb 17, 2026

Merged via the queue into main with commit 2f74853 Feb 17, 2026
4 checks passed

avinash2692 deleted the fix/378-hf-memory-leak branch February 17, 2026 20:05

Conversation

avinash2692 commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Misc PR

Type of PR

Description

Problem

Changes

Testing

Uh oh!

github-actions Bot commented Feb 13, 2026

Uh oh!

mergify Bot commented Feb 13, 2026

Merge Protections

🟢 Enforce conventional commit

Uh oh!

nrfulton left a comment

Choose a reason for hiding this comment

Uh oh!

nrfulton Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

avinash2692 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

nrfulton Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

nrfulton Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

nrfulton Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

avinash2692 Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

nrfulton Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

anpendyal commented Feb 16, 2026

Uh oh!

avinash2692 commented Feb 16, 2026

Uh oh!

avinash2692 commented Feb 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

avinash2692 commented Feb 13, 2026 •

edited

Loading