Skip to content

fix: huggingface memory leak#544

Merged
nrfulton merged 15 commits intomainfrom
fix/378-hf-memory-leak
Feb 17, 2026
Merged

fix: huggingface memory leak#544
nrfulton merged 15 commits intomainfrom
fix/378-hf-memory-leak

Conversation

@avinash2692
Copy link
Copy Markdown
Member

@avinash2692 avinash2692 commented Feb 13, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

Problem

Each instruct() call triggers 4+ model.generate() calls (main generation + LLM-as-a-Judge validation + retries). GPU memory grew continuously because:

  1. KV caches and output tensors were stored in mot._meta["hf_output"] and never cleaned up
  2. LRU cache eviction only removed Python references, not GPU memory
  3. use_cache and output_scores were always True, creating large tensors even when unused

Changes

mellea/backends/cache.py

  • Added on_evict callback to SimpleLRUCache - called when entries are evicted to free resources

mellea/backends/huggingface.py

  • Added _cleanup_kv_cache() function that properly frees GPU memory (gc.collect() + torch.cuda.empty_cache())
  • Wired up cleanup callback in backend constructor
  • Added return_scores parameter (default False) to avoid storing large logit tensors
  • Pass use_cache=self._use_caches to model.generate() so KV caches aren't created when disabled
  • Store KV cache in LRU separately, clear past_key_values from hf_output
  • When use_caches=False: clear hf_output from mot._meta after processing and call torch.cuda.empty_cache()

###Result

  • use_caches=False: No GPU memory growth over iterations
  • use_caches=True: Memory plateaus at LRU capacity with proper cleanup on eviction

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@mergify
Copy link
Copy Markdown

mergify Bot commented Feb 13, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert|release)(?:\(.+\))?:

@nrfulton nrfulton self-requested a review February 13, 2026 16:29
@nrfulton nrfulton requested a review from a team as a code owner February 13, 2026 16:29
Copy link
Copy Markdown
Member

@nrfulton nrfulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the root cause analysis.

Left some thoughts on the interface decisions and guarding against edge cases.

Comment thread mellea/backends/huggingface.py Outdated
custom_config (Optional[TransformersTorchConfig]): Overrides loading from the `model_id`. If set, then the specified tokenizer/model/device will be used instead of auto-loading from the model_id.
default_to_constraint_checking_alora: If set to False then aloras will be deactivated. This is primarily for performance benchmarking and debugging.
model_options (Optional[dict]): Default model options.
return_scores (bool): If True, return output logits from model.generate(). Default False to save GPU memory.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this and instead just ensure there's always a cache? Cache eviction is exactly solving this problem. It seems like LRUCache(0) should be the way to do this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you mean always setting this to True in model.generate? that could work and we could just default to SimpleLRUCache(0) as a default (though one coulddl envision this to be set to window_size of the context). I think this also simplifies things for the user with a default behavior.

My only reason for keeping this here is for folks familiar with the huggingface ecosystem to be able to pass args to generate

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do you mean always setting this to True in model.generate?

Yeah, unless it's set in model options I suppose.

that could work and we could just default to SimpleLRUCache(0) as a default

Or SimpleLRUCache(x) where x is determined based on the available memory and max kv cache size for model_id. (not to max out memeory use but to leave sufficient head room). For now choose n=3 seems pretty reasonable. We should add to the docs somewhere the stack trace you get when you're OOM with a suggestion to set SimpleLRUCache(0) so that debugging that issue is easier. Even better if we can systematically catch the exact exception and give the suggestion there, but idk if that's possible to do in a robust fashion. You'd know better than me.

I think this also simplifies things for the user with a default behavior.

That's the goal.

My only reason for keeping this here is for folks familiar with the huggingface ecosystem to be able to pass args to generate

Reasonable counter-argument. But we have model_options already, why give this pride-of-place? On face my first comment gives that reason. But note we still would have to handle this arg happening both in __init__ and possibly in any of the model_options. So even still, in favor of removing the constructor arg.

Comment thread mellea/backends/huggingface.py Outdated
Comment on lines +287 to +292
self._return_scores = return_scores
self._cache = (
cache
if cache is not None
else SimpleLRUCache(3, on_evict=_cleanup_kv_cache)
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Especially because of this I don't really see the point of return_scores.

Open to being persuaded, strong opinion loosely held n'at.

)

self.cache_put(mot.value, cache_info)
cache_key = id(mot.value)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. So now it's REALLY important that we make it to and through at least this code point in post_process, right? Should we update the finally block in core to at least give a warning if the cache doesn't have the shape we expect immediately after generation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary. If use_cache is set to false, the default behaviour here is to just return the output and discard everything else. But I think scores are left in limbo here. So it might make sense with your earlier suggestion to keep return scores is true if use_cache==True, we then populate the cache with the kv_cache and scores and we can set the cache len=0 for the default case?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, sgtm modulo choice for len.

I'm also open to setting len=0 for now and we can revisit once we actually have the block attention mechanism working on the main code paths, at which point we should do the adaptive thing mentioned in my other comment.

@nrfulton nrfulton self-requested a review February 16, 2026 18:39
@anpendyal
Copy link
Copy Markdown

Hi, I'm still having issues with the memory leak.
I used the following command to submit the job:
bsub -q normal \ -J kvcache_mass_oom_fix \ -gpu "num=1" \ -cwd /dccstor/nathan-ckpts/anooshka \ -o /dccstor/nathan-ckpts/anooshka/logs/out.%J.txt \ -e /dccstor/nathan-ckpts/anooshka/logs/err.%J.txt \ bash -lc 'export PATH="$HOME/.local/bin:$PATH"; mkdir -p /dccstor/nathan-ckpts/anooshka/logs; micromamba run -p /u/apendyal/mamba_envs/mellea_py310 \ python -u /dccstor/nathan-ckpts/anooshka/kv_cache_dataset.py \ --out_root /dccstor/nathan-ckpts/anooshka/caselaw_kv/mass \ --log "/dccstor/nathan-ckpts/anooshka/logs/caselaw_kv.${LSB_JOBID}.log"'

My job ended up finishing early due to this error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.55 GiB is free. Including non-PyTorch memory, this process has 77.69 GiB memory in use. Of the allocated memory 68.07 GiB is allocated by PyTorch, and 9.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

One thing I noticed is that this job was able to get through twice as many cases than before I used the version of Mellea fixing the memory leak.

Also, I kept logs for every case that I ran and those logs contain information like: gpu_alloc and gpu_reserved.

I can send my complete code/logs if needed!

@avinash2692
Copy link
Copy Markdown
Member Author

Hi, I'm still having issues with the memory leak. I used the following command to submit the job: bsub -q normal \ -J kvcache_mass_oom_fix \ -gpu "num=1" \ -cwd /dccstor/nathan-ckpts/anooshka \ -o /dccstor/nathan-ckpts/anooshka/logs/out.%J.txt \ -e /dccstor/nathan-ckpts/anooshka/logs/err.%J.txt \ bash -lc 'export PATH="$HOME/.local/bin:$PATH"; mkdir -p /dccstor/nathan-ckpts/anooshka/logs; micromamba run -p /u/apendyal/mamba_envs/mellea_py310 \ python -u /dccstor/nathan-ckpts/anooshka/kv_cache_dataset.py \ --out_root /dccstor/nathan-ckpts/anooshka/caselaw_kv/mass \ --log "/dccstor/nathan-ckpts/anooshka/logs/caselaw_kv.${LSB_JOBID}.log"'

My job ended up finishing early due to this error: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.55 GiB is free. Including non-PyTorch memory, this process has 77.69 GiB memory in use. Of the allocated memory 68.07 GiB is allocated by PyTorch, and 9.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

One thing I noticed is that this job was able to get through twice as many cases than before I used the version of Mellea fixing the memory leak.

Also, I kept logs for every case that I ran and those logs contain information like: gpu_alloc and gpu_reserved.

I can send my complete code/logs if needed!

Hmm, interesting. Could you share the logs with me if possible? Also, what's the exact script that you're trying to run?

@avinash2692
Copy link
Copy Markdown
Member Author

Hi, I'm still having issues with the memory leak. I used the following command to submit the job: bsub -q normal \ -J kvcache_mass_oom_fix \ -gpu "num=1" \ -cwd /dccstor/nathan-ckpts/anooshka \ -o /dccstor/nathan-ckpts/anooshka/logs/out.%J.txt \ -e /dccstor/nathan-ckpts/anooshka/logs/err.%J.txt \ bash -lc 'export PATH="$HOME/.local/bin:$PATH"; mkdir -p /dccstor/nathan-ckpts/anooshka/logs; micromamba run -p /u/apendyal/mamba_envs/mellea_py310 \ python -u /dccstor/nathan-ckpts/anooshka/kv_cache_dataset.py \ --out_root /dccstor/nathan-ckpts/anooshka/caselaw_kv/mass \ --log "/dccstor/nathan-ckpts/anooshka/logs/caselaw_kv.${LSB_JOBID}.log"'
My job ended up finishing early due to this error: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.55 GiB is free. Including non-PyTorch memory, this process has 77.69 GiB memory in use. Of the allocated memory 68.07 GiB is allocated by PyTorch, and 9.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
One thing I noticed is that this job was able to get through twice as many cases than before I used the version of Mellea fixing the memory leak.
Also, I kept logs for every case that I ran and those logs contain information like: gpu_alloc and gpu_reserved.
I can send my complete code/logs if needed!

Hmm, interesting. Could you share the logs with me if possible? Also, what's the exact script that you're trying to run?

Hi, I'm still having issues with the memory leak. I used the following command to submit the job: bsub -q normal \ -J kvcache_mass_oom_fix \ -gpu "num=1" \ -cwd /dccstor/nathan-ckpts/anooshka \ -o /dccstor/nathan-ckpts/anooshka/logs/out.%J.txt \ -e /dccstor/nathan-ckpts/anooshka/logs/err.%J.txt \ bash -lc 'export PATH="$HOME/.local/bin:$PATH"; mkdir -p /dccstor/nathan-ckpts/anooshka/logs; micromamba run -p /u/apendyal/mamba_envs/mellea_py310 \ python -u /dccstor/nathan-ckpts/anooshka/kv_cache_dataset.py \ --out_root /dccstor/nathan-ckpts/anooshka/caselaw_kv/mass \ --log "/dccstor/nathan-ckpts/anooshka/logs/caselaw_kv.${LSB_JOBID}.log"'
My job ended up finishing early due to this error: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.12 GiB. GPU 0 has a total capacity of 79.25 GiB of which 1.55 GiB is free. Including non-PyTorch memory, this process has 77.69 GiB memory in use. Of the allocated memory 68.07 GiB is allocated by PyTorch, and 9.13 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
One thing I noticed is that this job was able to get through twice as many cases than before I used the version of Mellea fixing the memory leak.
Also, I kept logs for every case that I ran and those logs contain information like: gpu_alloc and gpu_reserved.
I can send my complete code/logs if needed!

Hmm, interesting. Could you share the logs with me if possible? Also, what's the exact script that you're trying to run?

Had a chat with @anpendyal and this seems to be a problem in her script and not this bug.

@nrfulton nrfulton added this pull request to the merge queue Feb 17, 2026
Merged via the queue into main with commit 2f74853 Feb 17, 2026
4 checks passed
@avinash2692 avinash2692 deleted the fix/378-hf-memory-leak branch February 17, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants