Modifying a task via scheduler plugin #1384

limx0 · 2017-09-11T07:20:40Z

I am trying to modify a task by using a scheduler plugin, unsure whether this is possible? @mrocklin you mentioned here dask/dask#2119 (comment) that this could be done - although this was the dask scheduler, not distributed.

My use case is; I would like to combine the distributed scheduler with joblib.Memory for building a data pipeline that has some smart caching. Joblib does this well by saving a copy of the the source code of the function as well as the inputs. I would like to extend this notion by invalidating any child nodes where a parent is to be recomputed.

Now, my first thought would be to do something like;

class MemoryCachePlugin(SchedulerPlugin):

    def update_graph(self, scheduler, dsk=None, keys=None, restrictions=None, **kwargs):

        def node_invalidated(node):
            func = cloudpickle.loads(tasks[node]['function'])
            args = cloudpickle.loads(tasks[node]['args'])
            return not (is_memoize_func(func) and cache_is_valid(func, *args))

        tasks = kwargs['tasks']
        for key, task in tasks.items():
            # Find parents of this task by checking for arg names in the tasks dict
            parent_nodes = filter(lambda arg: arg in tasks, cloudpickle.loads(task['args']))
            
            if any(filter(node_invalidated, parent_nodes)):
                # A parent of this node is required to be update, invalidate the cache of this node.
                task_func = cloudpickle.loads(kwargs['tasks'][key]['function'])
                task_args = cloudpickle.loads(kwargs['tasks'][key]['args'])
                cache_file = get_cache_file(task_func, *task_args)
                os.remove(cache_file)

I would like to simply remove the cache file that joblib uses, however at the update_graph stage the task_args are only references to other tasks, not the actual result values.

My next thought was that I could mark the function with some sort of force_recompute flag, but it appears (please correct me if I am wrong) that modifying the functions in kwargs has no effect on the tasks that are sent to the workers. Is this conclusion correct?
edit: this was an issue in cloudpickle hard-coding attributes to pickle

What is the most suitable way for me to achieve the above?

The text was updated successfully, but these errors were encountered:

mrocklin · 2017-09-11T11:51:56Z

Modifying tasks as they arrive would be tricky. You would have to rearrange update_graph to run the update_graph plugin callbacks before taking any actions, though this seems possible.

Another option would be to do this on the client side by overriding collections_to_dsk. This is the last point of modification before the tasks make it up to the scheduler. It's a nice choke-point to modify things.

jakirkham · 2018-02-22T19:55:59Z

Given PR ( dask/dask#2748 ) and PR ( #930 ) add dask.base.collections_to_dsk and reduce Client.collections_to_dsk to calling dask.base.collections_to_dsk respectively, I wonder if it would make sense to allow collections_to_dsk to be overridden using something like dask.set_options to make this easier to tap into.

Edit: Maybe PR ( dask/dask#3196 ) would make this possible.

jakirkham · 2018-06-25T00:24:56Z

This can be done pretty easily these days by adding to the "optimizations" in dask.config.

lorenzolucido · 2018-11-12T15:12:50Z

I have the exact same use case. @jakirkham, do you mind sharing an example on how this can be achieved on a simple custom graph (using dask.config) ? Thanks!

Edit: In fact, I am wondering if nowadays @mrocklin 's streamz library would be a good fit for this, where the parameters are simply sources, i.e. when you change a parameter node you simply emit the new value.

limx0 closed this as completed Sep 22, 2017

jakirkham mentioned this issue Jan 29, 2018

Filesystem persistence and using dask as a make/snakemake alternative dask/dask#2119

Closed

jakirkham mentioned this issue Feb 22, 2018

WIP: Allow collections_to_dsk to be overridden dask/dask#3196

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modifying a task via scheduler plugin #1384

Modifying a task via scheduler plugin #1384

limx0 commented Sep 11, 2017 •

edited

mrocklin commented Sep 11, 2017

jakirkham commented Feb 22, 2018 •

edited

jakirkham commented Jun 25, 2018

lorenzolucido commented Nov 12, 2018 •

edited

Modifying a task via scheduler plugin #1384

Modifying a task via scheduler plugin #1384

Comments

limx0 commented Sep 11, 2017 • edited

mrocklin commented Sep 11, 2017

jakirkham commented Feb 22, 2018 • edited

jakirkham commented Jun 25, 2018

lorenzolucido commented Nov 12, 2018 • edited

limx0 commented Sep 11, 2017 •

edited

jakirkham commented Feb 22, 2018 •

edited

lorenzolucido commented Nov 12, 2018 •

edited