-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Proposal] Automatic Pump caching #120
Comments
I like this idea. One use case I've been running into lately is using multiple models with the same pump. It would be handy if a pump object could take an input data dictionary, and only compute the keys that are missing. This would be a bit simpler than relying on an external file-backed cache, but still cover a lot of scenarios. |
Something like this is along the lines of what I was thinking. Is this about what you want? class Pump:
def transform(self, audio_f=None, jam=None, y=None, sr=None, crop=False,
data=None, refresh=False):
if self.cache_dir and audio_f and not refresh:
cache_id = hashlib.md5(audio_f.encode('utf-8')).hexdigest() # or however
cache_file = os.path.join(self.cache_dir, cache_id + '.h5)
data = crema.utils.load_h5(cache_file, data=data) # skip keys already in `data`
else:
data = dict() if data is None else data
...
for operator in self.ops:
if operator.name in data:
continue
...
if self.cache_dir and audio_f:
crema.utils.save_h5(cache_file, **data)
return data
I'm not sure what you mean by this, do you mean always computing features on the fly? I think there’s room for both. As long as you hold on to the data dict reference you’ll have the computed features and the next time you need that file you can load from hdf5. This lets you train a model the same with or without running crema |
My 2c: I’m not sure what brian’s Scenario is for pump sharing, but I think I understand Bea’s position and proposal. However I kinda have a different view on how to handle use cases that Bea mentioned. I think while Bea asks for a combination of the file handling capabilities into pump to clean up prepare/train, I would actually call for a complete pump agnostic train.py script, and modify pump accordingly... One other consideration is.... the less there is in pumpp the less can go wrong and less things to maintain? Anyway, just my personal thoughts..... |
That's a fair position and I see where you are coming from. I'm just looking to reduce the boilerplate needed to setup a model and reuse as much as possible. Like ideally, I'd like to have a file that defines some config, To your point about separating pumpp and training, what would be your plan for removing that dependency? I rather like pumpp's ability to generate it's own keras imports. And seeing as we use pumpp samplers for training, I don't know if it's realistic for them to be independent. |
Just had a chat with @tomxi , so I'll try to clarify my thoughts here.
Does that make sense? |
Yeah that sounds good. And # 2 is an interesting idea. |
@beastrees I don’t think I want to propose any changes at the moment besides the ones you are already working on. Cheers! |
Allow a user to pass
Pump(*ops, cache_dir=‘my/cached/dataset/‘)
which will automatically save transform results to an hdf5 file and load if they exist.If no
cache_dir
is set, it will behave normally.The cache ID could be hashed from the audio file path and ignored when only PCM is passed.
This isn’t very important, but it would be convenient and would simplify logic and file handling.
Sent with GitHawk
The text was updated successfully, but these errors were encountered: