[Feature Proposal] Automatic Pump caching #120

beasteers · 2019-08-31T15:19:14Z

Allow a user to pass Pump(*ops, cache_dir=‘my/cached/dataset/‘) which will automatically save transform results to an hdf5 file and load if they exist.

If no cache_dir is set, it will behave normally.

The cache ID could be hashed from the audio file path and ignored when only PCM is passed.

This isn’t very important, but it would be convenient and would simplify logic and file handling.

_{Sent with GitHawk}

The text was updated successfully, but these errors were encountered:

bmcfee · 2019-08-31T20:29:45Z

I like this idea. One use case I've been running into lately is using multiple models with the same pump. It would be handy if a pump object could take an input data dictionary, and only compute the keys that are missing. This would be a bit simpler than relying on an external file-backed cache, but still cover a lot of scenarios.

beasteers · 2019-08-31T22:18:26Z

Something like this is along the lines of what I was thinking. Is this about what you want?

class Pump:
    def transform(self, audio_f=None, jam=None, y=None, sr=None, crop=False, 
                  data=None, refresh=False):
        if self.cache_dir and audio_f and not refresh:
            cache_id = hashlib.md5(audio_f.encode('utf-8')).hexdigest() # or however
            cache_file = os.path.join(self.cache_dir, cache_id + '.h5)
            data = crema.utils.load_h5(cache_file, data=data) # skip keys already in `data`
        else:
            data = dict() if data is None else data

        ...

        for operator in self.ops:
            if operator.name in data:
                continue 
            
            ...

        if self.cache_dir and audio_f:
            crema.utils.save_h5(cache_file, **data)
        return data

This would be a bit simpler than relying on an external file-backed cache

I'm not sure what you mean by this, do you mean always computing features on the fly?

I think there’s room for both. As long as you hold on to the data dict reference you’ll have the computed features and the next time you need that file you can load from hdf5.

This lets you train a model the same with or without running crema prepare.py.

tomxi · 2019-09-07T16:53:06Z

My 2c: I’m not sure what brian’s Scenario is for pump sharing, but I think I understand Bea’s position and proposal.

However I kinda have a different view on how to handle use cases that Bea mentioned.
Currently the crema training scripts are divided into the prepare.py step where audio and jams are pumped through pumps to become numpy.arrays, and the train step where these arrays are consumed by some ML model.
I welcome this divide, and in fact I think it’s not divided enough: The only places where the train.py step access info in the pump that built the nparrays are 1) to figure out the frame rate in time of the arrays and 2) to build the appropriate input layer for a specific downstream ML package.

I think while Bea asks for a combination of the file handling capabilities into pump to clean up prepare/train, I would actually call for a complete pump agnostic train.py script, and modify pump accordingly...

One other consideration is.... the less there is in pumpp the less can go wrong and less things to maintain?

Anyway, just my personal thoughts.....

beasteers · 2019-09-09T17:11:55Z

That's a fair position and I see where you are coming from. I'm just looking to reduce the boilerplate needed to setup a model and reuse as much as possible. Like ideally, I'd like to have a file that defines some config, get_pump(), construct_model(), and evaluate_model() and then crema could handle generic augment/prepare/sampling/training with room for overriding behavior, but idk. Maybe something like this would be better suited for crema. ($ crema.augment stretch=..., $ crema.prepare input=...

To your point about separating pumpp and training, what would be your plan for removing that dependency? I rather like pumpp's ability to generate it's own keras imports. And seeing as we use pumpp samplers for training, I don't know if it's realistic for them to be independent.

bmcfee · 2019-09-09T18:05:42Z

Just had a chat with @tomxi , so I'll try to clarify my thoughts here.

I think a simpler pattern would be to accept a data= parameter, which can be a dict (or None by default), and skip over computing fields already present in data. That's as far as I think pumpp should go with caching. If you have to compute new outputs, they get stored in data[somekey] and the entire object is returned just like it currently does.
If you want a file-backed cache, this could be implemented more generally by making a class that implements the dict API, but can pull data from a file (eg h5) or write-through. This would be generally useful, I think, outside of pumpp, and wouldn't require pumpp to do any file io.

Does that make sense?

beasteers · 2019-09-09T18:11:29Z

Yeah that sounds good. And # 2 is an interesting idea.

tomxi · 2019-09-09T21:14:31Z

@beastrees
Yea actually I like the generate Keras input layer function as well. And I don’t see a clean way of decoupling pump and training.

I don’t think I want to propose any changes at the moment besides the ones you are already working on.

Cheers!

beasteers mentioned this issue Sep 5, 2019

skip over pre-computed ops #121

Closed

beasteers mentioned this issue Oct 8, 2019

skip over pre-computed ops #128

Merged

bmcfee closed this as completed Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Proposal] Automatic Pump caching #120

[Feature Proposal] Automatic Pump caching #120

beasteers commented Aug 31, 2019

bmcfee commented Aug 31, 2019

beasteers commented Aug 31, 2019 •

edited

Loading

tomxi commented Sep 7, 2019

beasteers commented Sep 9, 2019

bmcfee commented Sep 9, 2019

beasteers commented Sep 9, 2019 •

edited

Loading

tomxi commented Sep 9, 2019

[Feature Proposal] Automatic Pump caching #120

[Feature Proposal] Automatic Pump caching #120

Comments

beasteers commented Aug 31, 2019

bmcfee commented Aug 31, 2019

beasteers commented Aug 31, 2019 • edited Loading

tomxi commented Sep 7, 2019

beasteers commented Sep 9, 2019

bmcfee commented Sep 9, 2019

beasteers commented Sep 9, 2019 • edited Loading

tomxi commented Sep 9, 2019

beasteers commented Aug 31, 2019 •

edited

Loading

beasteers commented Sep 9, 2019 •

edited

Loading