Conversation
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
nrfulton
left a comment
There was a problem hiding this comment.
Thanks for the root cause analysis.
Left some thoughts on the interface decisions and guarding against edge cases.
| custom_config (Optional[TransformersTorchConfig]): Overrides loading from the `model_id`. If set, then the specified tokenizer/model/device will be used instead of auto-loading from the model_id. | ||
| default_to_constraint_checking_alora: If set to False then aloras will be deactivated. This is primarily for performance benchmarking and debugging. | ||
| model_options (Optional[dict]): Default model options. | ||
| return_scores (bool): If True, return output logits from model.generate(). Default False to save GPU memory. |
There was a problem hiding this comment.
Should we remove this and instead just ensure there's always a cache? Cache eviction is exactly solving this problem. It seems like LRUCache(0) should be the way to do this?
There was a problem hiding this comment.
So do you mean always setting this to True in model.generate? that could work and we could just default to SimpleLRUCache(0) as a default (though one coulddl envision this to be set to window_size of the context). I think this also simplifies things for the user with a default behavior.
My only reason for keeping this here is for folks familiar with the huggingface ecosystem to be able to pass args to generate
There was a problem hiding this comment.
So do you mean always setting this to True in model.generate?
Yeah, unless it's set in model options I suppose.
that could work and we could just default to
SimpleLRUCache(0)as a default
Or SimpleLRUCache(x) where x is determined based on the available memory and max kv cache size for model_id. (not to max out memeory use but to leave sufficient head room). For now choose n=3 seems pretty reasonable. We should add to the docs somewhere the stack trace you get when you're OOM with a suggestion to set SimpleLRUCache(0) so that debugging that issue is easier. Even better if we can systematically catch the exact exception and give the suggestion there, but idk if that's possible to do in a robust fashion. You'd know better than me.
I think this also simplifies things for the user with a default behavior.
That's the goal.
My only reason for keeping this here is for folks familiar with the huggingface ecosystem to be able to pass args to generate
Reasonable counter-argument. But we have model_options already, why give this pride-of-place? On face my first comment gives that reason. But note we still would have to handle this arg happening both in __init__ and possibly in any of the model_options. So even still, in favor of removing the constructor arg.
| self._return_scores = return_scores | ||
| self._cache = ( | ||
| cache | ||
| if cache is not None | ||
| else SimpleLRUCache(3, on_evict=_cleanup_kv_cache) | ||
| ) |
There was a problem hiding this comment.
Especially because of this I don't really see the point of return_scores.
Open to being persuaded, strong opinion loosely held n'at.
| ) | ||
|
|
||
| self.cache_put(mot.value, cache_info) | ||
| cache_key = id(mot.value) |
There was a problem hiding this comment.
Okay. So now it's REALLY important that we make it to and through at least this code point in post_process, right? Should we update the finally block in core to at least give a warning if the cache doesn't have the shape we expect immediately after generation?
There was a problem hiding this comment.
Not necessary. If use_cache is set to false, the default behaviour here is to just return the output and discard everything else. But I think scores are left in limbo here. So it might make sense with your earlier suggestion to keep return scores is true if use_cache==True, we then populate the cache with the kv_cache and scores and we can set the cache len=0 for the default case?
There was a problem hiding this comment.
ok, sgtm modulo choice for len.
I'm also open to setting len=0 for now and we can revisit once we actually have the block attention mechanism working on the main code paths, at which point we should do the adaptive thing mentioned in my other comment.
|
Hi, I'm still having issues with the memory leak. My job ended up finishing early due to this error: One thing I noticed is that this job was able to get through twice as many cases than before I used the version of Mellea fixing the memory leak. Also, I kept logs for every case that I ran and those logs contain information like: I can send my complete code/logs if needed! |
Hmm, interesting. Could you share the logs with me if possible? Also, what's the exact script that you're trying to run? |
Had a chat with @anpendyal and this seems to be a problem in her script and not this bug. |
Misc PR
Type of PR
Description
LocalHFBackendseems to have a memory blow up if you repeatedly callInstruct#378: This is the main issue that this PR solvesProblem
Each instruct() call triggers 4+ model.generate() calls (main generation + LLM-as-a-Judge validation + retries). GPU memory grew continuously because:
Changes
mellea/backends/cache.pymellea/backends/huggingface.py###Result
Testing