Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filesystem persistence and using dask as a make/snakemake alternative #2119

Closed
nbren12 opened this issue Mar 23, 2017 · 43 comments
Closed

Filesystem persistence and using dask as a make/snakemake alternative #2119

nbren12 opened this issue Mar 23, 2017 · 43 comments

Comments

@nbren12
Copy link
Contributor

nbren12 commented Mar 23, 2017

I brought this issue/feature request on an issue on at the cachey repo and @mrocklin suggested that I open up a new issue here.

It would be useful to add a persistent file-system caching mechanism to dask, to complement its in-memory caching ability. While the main advantage of the in-memory caching is to speed up computations by saving redundant ouputs, a file-system based caching is more about maintaining persistence across sessions, and potential saving interesting intermediate outputs. This is particularly important when developing computational pipelines, because there are frequently bugs in new code which cause crashes. And it is also useful in maintaining persistence between ipython or jupyter notebook sessions. For example, suppose I am developing a workflow in a typical exploratory fashion where A is some dask object:

B = correct_but_expensive_function(A)
C = sketchy_under_development_function(B)
C.to_hdf5("done.h5")

The correct_but_expensive_function could successfully finish after 5 minutes work, but the sketchy_under_development_function might use up all the memory on my linux desktop and force me to quit the process, if I am lucky, or hard-reset the computer, if I am unlucky.

There are many tools out there which are used for automatic scientific workflows by having the users specify some sort of dependency graph between files. These tools are especially popular in bioinformatics, but are broadly applicable in many disciplines. Some popular tools for this include make, luigi, Snakemake, nextflow, and many others. I like these tools a lot, but it can be tedious breaking up a lengthy python script into pieces and manually specifing the dependency graph by hand. This effort seems especially redudant because dask already builds a computational graph behind the scenes.

It would be nice to manually specify nodes of a given dask graph that should be cached and be automatically reloaded from disk or recomputed depending on certain conditions. In theory this could provide a very convenient Make-like replacement which does not require manually specifying file names and allows one to use python data structures. Moreover, some simple wrappers around dask.delayed could be written that allow users to run external command line utilities on the intermediate outputs.

Possible syntax

Some possible syntax for this could be something like this:

cache = FileSystemCache(...)
c = cache.saved(a + b)
d = c**10

The conditions for reloading or rerunning the a+b computation should depend on the type of cache object. Some nice caching reloading/ recomputing conditions could be things like

  1. file modication time (like make)
  2. argument memoization

Another possible syntax, which is slightly more verbose, could be a dict-like cache object:

cache['c'] = a + b
d = cache['c']**10

Automatic memoization of functions using some heuristic like Cachey does could be useful, but I am personally okay with manually indicating which steps should be saved to disk.

@mrocklin
Copy link
Member

I suspect that you could implement this today with a scheduler plugin and an on-disk mutable mapping like shelve or chest.

The plugin would look at a dask graph and its on-disk mutable mapping and replace any keys in the dask graph with the values from the mutable mapping (or functions that obtain those values). It would then cull (dask.optimize.cull) the graph to remove excess nodes that no longer need to be there.

This is what the opportunistic caching plugin does already (dask.diagnostics.Cache) except that it also includes heuristics for what to store and, by default, it uses an in-memory dict (though this can be changed).

Is this enough information to get started on @nbren12 ?

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 24, 2017

Thanks for the suggestions @mrocklin. I am not very familiar with the dask internals, but I am willing to make a stab at it, since this would be very useful to me. I will take some time to understand how the opportunistic caching plugin works.

Ultimately, it would be also very useful to let the user specify the exact file-type (and possibly path) that they want the data to be stored on disk as. This would then enable including calls to command line tools as part of the graph, like in the following image.

Is this something that the could be handled using delayed?

@mrocklin
Copy link
Member

mrocklin commented Mar 24, 2017 via email

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 24, 2017

Thanks.

Re: marking and caching individual nodes, I implemented a simple Callback with a save method that takes a dask.array object and adds the dask keys corresponding to its chunks to internal list of objects to be saved. It get's called like this

cache = Saved()

np.random.seed(seed=0)
r= rand(100,100)


a = da.from_array(r, (50,50))
b = a**2
c = cache.save(b + a) # just returns a + b, while storing its keys
d = c**10

with cache:
    d.compute()

Unfortunately, it seems like dask.optimize is removing the nodes corresponding to c behind the scenes. Is there some easy way to prevent this?

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 24, 2017

Okay. I think I can make cache.save modify the dask graph of a+b to add some opaque function call like lambda x: x, which dask.optimize won't be able to inline. Does that make sense?

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 24, 2017

Actually, that doesn't work because optimize.fuse still removes the nodes I want to store to disk.

@mrocklin
Copy link
Member

You can avoid optimization by adding the keyword optimize_graph=False to your compute call.

Docstring:

Compute several dask collections at once.

Parameters
----------
args : object
    Any number of objects. If it is a dask object, it's computed and the
    result is returned. By default, python builtin collections are also
    traversed to look for dask objects (for more information see the
    ``traverse`` keyword). Non-dask arguments are passed through unchanged.
traverse : bool, optional
    By default dask traverses builtin python collections looking for dask
    objects passed to ``compute``. For large collections this can be
    expensive. If none of the arguments contain any dask objects, set
    ``traverse=False`` to avoid doing this traversal.
get : callable, optional
    A scheduler ``get`` function to use. If not provided, the default is
    to check the global settings first, and then fall back to defaults for
    the collections.
optimize_graph : bool, optional
    If True [default], the optimizations for each collection are applied
    before computation. Otherwise the graph is run as is. This can be
    useful for debugging.
kwargs
    Extra keywords to forward to the scheduler ``get`` function.

Examples
--------
>>> import dask.array as da
>>> a = da.arange(10, chunks=2).sum()
>>> b = da.arange(10, chunks=2).mean()
>>> compute(a, b)
(45, 4.5)

By default, dask objects inside python collections will also be computed:

>>> compute({'a': a, 'b': b, 'c': 1})  # doctest: +SKIP
({'a': 45, 'b': 4.5, 'c': 1},)

@nbren12
Copy link
Contributor Author

nbren12 commented Mar 30, 2017

Thanks. I did see that command line option, and have been working on a
prototype. I can't decide if it is better to implement persistence on the
low-level dask graph level, or to use the dask.array's store method.
Ideally, I would like to be able to be able to do things like change the
number of chunks of a dask array without having to recompute the whole
graph.

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 6, 2017

@mrocklin I have been thinking some more about how to integrate native dask operations with external command line utilities. Stringing together several external calls using dask.delayed should be pretty straightforward, but reading the output of this data back into dask is less so. To do this, I think there needs to be some way to store/read a dask array from a delayed object (like a filename). For example, if f is a function returning a filename, it would be nice to be able to load data like this da.from_array(delayed(f)(*args), ...). Do you think this would be difficult to implement?

With this functionality, it would be possible to implement a really slick interface for combining command line tools with dask.array operations. I have something like the following in mind

# start with dask arrays a and b

@shell(chunks_shape='preserve')
def f(input, output):
    subprocess.check_call(["external_command", input, output])
    return output

# some python operations...
c = a+b

# The @shell decorator automatically saves c to disk, runs f, and loads output
d = f(c)

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2017 via email

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 6, 2017

Thank you!

Unfortunately, the docs say that "the dask array will consist of a single chunk", which won't scale for large datasets, but I guess I could concatenate the individually chunks manually to construct the full array..

What about a delayed store to a delayed object (like an hdf5 file which has yet to be created)? I could imagine this need arising if all the workers cannot see the same filesystem, and need to be able to dump a dask array to disk for processing with some external utiliity.

@mrocklin
Copy link
Member

mrocklin commented Apr 6, 2017 via email

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 6, 2017 via email

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 7, 2017

ok. I made a small modification to dask.array.insert_to_ooc which allows storing a dask array to a delayed task, I also added a simple test. Would you be open to a PR?

@mrocklin
Copy link
Member

mrocklin commented Apr 7, 2017 via email

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 7, 2017

Indeed I have, but it doesn't work when the output array is a delayed value, so I had to modify insert_to_ooc to make it work. Here is the commit on my fork.

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 7, 2017

The use case I am thinking of is this

@delayed
def create_on_disk_temporary():
     f = h5py.File(...)
     v = f.create_dataset(...)
     return v

delayed_store =x.store(create_on_disk_temporary(), compute=False)

Currently, the store method is not fully lazy because the output array needs to be created when the graph is constructed. My modification allows the output dataset to be created when the graph is executed.

@mrocklin
Copy link
Member

mrocklin commented Apr 7, 2017 via email

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 7, 2017

I thought some more about it, and it might be better for the delayed to return a generator which yields the storage array, and then performs any necessary cleanup like closing files, etc. This is a bit trickier to implement though.

@mrocklin
Copy link
Member

mrocklin commented Apr 7, 2017 via email

@nbren12
Copy link
Contributor Author

nbren12 commented Apr 7, 2017

Ok. my current code should work then. I guess it would be possible for users to keep track of the opened files in some sort of queue, and close them if necessary.

@eserie
Copy link

eserie commented Apr 20, 2017

Hello,
I jump in the discussion maybe a little bit late, but I want to share a question/proposition.
I am quite a beginner in dask so take it as a very naive proposition and I could understand that It is not relevant at all!

Should it make sens to have a persistency mechanism for Delayedcollections using an option similar to dask_key_name?
I imagine something like:

delayed(f)(dask_serializer=serializer, *args, **kwargs)

where serializer is an object with a method serializer.dump(key, value).
If the object serializer is also able to give the state of the data and to load it (with methods like serializer.load(key), serializer.is_computed(key)), it could permits to reload the data instead to compute them again.
But, maybe the drawback of that is that it necessits to introduce a notion of serializer, which should be not desired in the core a dask.
I played with that in this and this files on my github repo. This is done with a little different spirit than Delayed but it might be adapted for core Delayed object.

@limx0
Copy link
Contributor

limx0 commented Sep 7, 2017

+1 I would love to see this functionality in dask.

The below is a brain dump; I hope it doesn't drag on too much (skip to the end if you don't want context). @mrocklin following up on python-streamz/streamz#14, when building a data pipeline in the financial modelling / trading space, my typical workflow is something like

  1. Create a simple DAG with some largely static input data (say stock prices)
  2. Develop some computed data (calculate stock returns or some other metric)
  3. Run some evaluation metric
  4. Repeat

I've used Luigi pretty extensively in the past, and also joblibs memory.cache, and my thoughts are:

Luigi
Pros:

  • Pretty stable
  • Linking dependencies is very clear and disk storage can be intuitive (custom defined filenames and lots of custom plugins for persistence)
  • Error handling aggregation is very good, easy to see systematic bugs across a grid of parameters (can easily see if we are missing data for a particular reporting date for example)
    Cons:
  • The scheduler isn't great and it isn't really designed to run sub 1 min operations.
  • Changes to the data in an upstream node do not trigger the re-computation of downstream nodes
  • Lots of boilerplate for simple operations

joblib.cache
Pros:

  • Speed
  • intelligent hashing of function & inputs. I believe that joblib actually hashes the whole function and any changes to the function code invalidates the cache.

Cons:

  • No easy way to specify dependencies / a pipeline
  • Filenames are a hash rather than something intuitive - which probably can't really be avoided but makes debugging slightly more complex.

I think I am would ideally like to have the following for my specific pipeline needs:

  • Reasonable fast scheduler (diagnostics a bonus)
  • DAG capabilities
  • Persistence to disk
  • "Flow-on" computation; if a node cache is invalidated, child nodes are also invalidated.
  • Aggregated exception handling / visibility

So finally, my actual questions:

  1. how to add a disk based cache into dask. I believe this is answered above - some sort of joblib.memory scheduler plugin.
  2. How best to go about adding in some more scheduler visualisation re task status, particularly errors.
  3. How would I go about passing information down the DAG if a parent nodes cache is invalidated?

Edit:

  • It looks like the scheduler plugin gets all of the information I require for (3), probably just a question of how best to implement the invalidation

@mrocklin
Copy link
Member

mrocklin commented Sep 7, 2017

how to add a disk based cache into dask. I believe this is answered above - some sort of joblib.memory scheduler plugin.

What is your objective for a disk-based cache? Are you worried about failure? Or do you want to permanently store every intermediate value?

How best to go about adding in some more scheduler visualisation re task status, particularly errors.

You might want to take a look at the current visualization code in the dask.distributed scheduler here: https://github.com/dask/distributed/tree/master/distributed/bokeh

How would I go about passing information down the DAG if a parent nodes cache is invalidated?

That probably depends on how you decide to build your DAG system (which as I understand it is separate from Dask's task graph).

I hope that whatever you build I get to see eventually :) It would be interesting to see your situation more concretely

@limx0
Copy link
Contributor

limx0 commented Sep 7, 2017

What is your objective for a disk-based cache? Are you worried about failure? Or do you want to permanently store every intermediate value?

Permanently store immediate values. Typically this data is all time series, so this would actually be intermediate Dataframes. We have multiple complex DAGs (sharing certain components), so the ability to just rerun all graphs, and intelligently fill in gaps, or invalidate changes (changes to operations/functions could come from end users) would be amazing. We typically run a DAG, then fit a model and evaluate, so the ability to make some changes (perhaps to multiple operations/functions) and rerun the entire DAG for model evaluation is very powerful.

That probably depends on how you decide to build your DAG system (which as I understand it is separate from Dask's task graph).

In the past it has been separate but I would much prefer it to be a dask task graph. Assuming it is in dask, how would I go about invalidating children?

@nbren12
Copy link
Contributor Author

nbren12 commented Sep 8, 2017

FWIW, since I opened this issue I have been using snakemake lately for this stuff, and for each step of the workflow, I may or may not use dask. While it seems redundant to use two separate DAG systems--- an inner one for computations using dask, and an outer dependency graph between input/output files using snakemake---it might be a lot of work to implement some the conveniences of snakemake (e.g. job submission, re-runnable workflows, make style modification time based rerunning) using dask. @mrocklin Do think these kinds of things are within the scope of this project? Or should dask only be used within the individual steps of a luigi/make/snakemake workflow?

@limx0
Copy link
Contributor

limx0 commented Sep 8, 2017

@nbren12 what has been your experience with snakemake thus far? What do you see as the downside? The appeal of dask IMO is the low task overhead and the intelligent scheduler. These two things are somewhat important, which is why I am even considering trying to fit dask to my problem rather than just using Luigi (or snakemake etc).

It may be that this work is outside of the scope of dask/distributed and the answer lies somewhere between snakemake and dask but it is not clear to me yet.

@jakirkham
Copy link
Member

Have been giving this issue a lot of thought as it is of particular interest for my research for a variety of use cases generally in the realm of data lineage and data versioning. Namely what operations have we performed on a piece of data through a workflow, what were the various intermediate results in that process, how do those results relate to each other, and how can others reproduce my work. If we want to get very technical, the lineage goes all the way back to data acquisition, but I think that part gets very workflow specific and out of scope for Dask.

Something that we did very recently was add support for reusing results after they have been stored. ( #2980 ) This does allow one to store a computational result and then reload it later. This comes in two forms. First a persist to disk solution, which triggers a computation that is written out immediately and can be chained in later computations. Second a lazy storing solution, which allows chaining computations and storage operations until some final compute step triggers the full run. Neither of these exactly equals caching, but I believe it should be possible to build something on top of that functionality that does reuse the store for caching.

One way of tackling this problem would be to build off the fact that Dask creates a sort of Merkle DAG where each node gets a unique name. The only thing we need to ensure is that the first node of that DAG (loading the input data) is deterministically generated from the input data and not randomly assigned. This is easy enough for anyone to do by overriding the name keyword for the Dask Array. With this constraint in place, we should get the same names on nodes downstream as long as we perform the same operations. Using the naming constraint, it is possible to create some form of MutableMapping, which will use the name as a key and store the result in each value. This MutableMapping can be anything we decide to implement (e.g. an HDF5 Group, Zarr Group, a directory we write multiple files in, etc.).

Now all we need is some way to replace Dask nodes with values from this MutableMapping cache or add them to the cache if they are missing. A good way to tackle this problem is an array plugin. This allows us to run a function every time a Dask Array is constructed. We can enable the plugin all the time or for selected blocks of code using a with statement. The plugin ends up being pretty simple. If we find the expected name of the Dask Array provided in the cache, construct a Dask Array that loads this data and returns it. If we don't find the Dask Array, we can use the persist to disk store solution to trigger an immediate computation and write out the result as we proceed. Finally to handle persistence across sessions, we initialize the MutableMapping on each start up to point back at the same store.

It should be pretty simple to include an outline of this solution in Dask so that it is pretty easy to pick up and reuse for different types of MutableMapping if that is of interest.

@nbren12
Copy link
Contributor Author

nbren12 commented Jan 16, 2018

@jakirkham I am glad other have been thinking about this issue! I think you have outlined a plausible way to implement this idea of dask.arrays. Do you think a custom callback would be better than an array plugin here? Also, I am concerned about the robustness of inspecting the dask graph based on names.

I have lately been thinking about making dask play well with luigi target objects. Luigi already has wide-array of targets that represent different types of persistent data stores (e.g. database, file system, etc). I think we can write a simple API for this that looks like dask.delayed. For example,

from luigi import Target

a = ... # dask object with name 'a'

tgt = LocalTarget("some/file/path")
b = targeted(a, tgt, reader=..., writer=...)

with TargetedCallback():
    b.compute()

Behind the scenes, the graph of b could look like this:

{'b': (load_if_exists_or_call, 'a', tgt, reader, writer),
'a': ...}

Then, the callback would search the graph for load_if_exists_or_call and replace it with {'b': (reader, tgt)} if the target exists. If the target doesn't exist the replacement would be {'b': (write_and_return, writer, tgt, 'a')}.

This approach should work with arbitrary dask objects as long as a reader/writer functions are given that work with the dask object. For instance, this could easily work immediately with dask.delayed functions that return Target objects. This would provide the same functionality of Makefiles or luigi, but would use the dask scheduler.

@nbren12
Copy link
Contributor Author

nbren12 commented Jan 17, 2018

It is probably simpler to just move the graph editing to targeted which would remove the need for the callback. On the other hand, deciding which computations to perform based on modification times like Makefiles do would require the whole graph.

nbren12 added a commit to nbren12/dask.targeted that referenced this issue Jan 17, 2018
This function follows the discussion laid out on
dask/dask#2119.

This function simply changes the dask graph of the returned value, and does not
need to be used with a callback. However, this is something that would probably
be work adding since it could show exactly the dependencies which need to be run.
@nbren12
Copy link
Contributor Author

nbren12 commented Jan 17, 2018

I just made a little github project which follows the API I suggest above. I am calling it dask.targeted for now, but I am open to any naming suggestions. So far I have the basic API I outlined above working. Ultimately, it might be nice to use multiple dispatching to allow a syntax like
targeted(a, tgt) to use the appropriate reader and writer functions based on the types of a and tgt.

@jakirkham
Copy link
Member

Sorry for the slow reply. Have had a lot of meetings of late.

Do you think a custom callback would be better than an array plugin here?

The main advantage of a custom callback would be to enable substitution of values in the Dask graph with ones from the cache without engagement from the developer/user.

Also, I am concerned about the robustness of inspecting the dask graph based on names.

What concerns come to mind?

I have lately been thinking about making dask play well with luigi target objects.

Can certainly see the value in that. Luigi does have a nice API. At least for me, I'd like to tackle this without adding Luigi as a dependency and think it should be possible.

Then, the callback would search the graph for load_if_exists_or_call and replace it with {'b': (reader, tgt)}` if the target exists.

So I think this is really where the plugin shines. Namely it separates the concerns of operating on the graph from the graph's contents. It also avoids editing the graph per se as it can just replace a node in the graph with its contents loaded from disk.

@nbren12
Copy link
Contributor Author

nbren12 commented Jan 23, 2018

No problem.

What concerns come to mind?

I don't really see how we can generate keys in a deterministic way when the functions are not pure. The only feasible option I can imagine is to have the user manually specify a good identifier to a persistent resource.

tackle this without adding Luigi as a dependency...

I agree. A MutableMapping interface could be sufficient, but I think luigi targets can really add a lot. For example, you do things like write to a remote file over ssh very easily. Maybe it is possible to implement a MutableMapping front end to arbitrary luigi targets.

It also avoids editing the graph per se as it can just replace a node in the graph with its contents loaded from disk.

I think an argument can be made either way about which approach is cleaner, since plugins edit the graph at construction time, while callbacks "optimize" the graph at evaluation time. However, callbacks are probably more flexible because they can work with any dask object, and not just arrays. In addition, we can do things like analyze the whole graph and print useful logging information at evaluation time. FWIW I think the callback I wrote is fairly straightforward.

@jakirkham
Copy link
Member

As another option to consider, @mrocklin suggested to me last week that overriding a Client's collections_to_dsk method could be another option. This would allow one to make changes once Dask collections are submitted for computation. So it avoids having to add any logic to the graph directly. Issue ( dask/distributed#1384 ) should help clarify this suggestion a bit.

@jakirkham
Copy link
Member

Wanted to share an Array Cache that can be used with Dask Distributed. Much like has been discussed here, the interest in this cache would be to store intermediate values that have long term value while allowing for their retrieval. As there is a lot of value in being able to name the intermediate value placed in the cache, found the MutableMapping interface to be a nice fit. Have found come up with a simple solution shown in this Gist.

@syagev
Copy link

syagev commented May 1, 2019

Is this issue still being considered? I think it's priceless.

As mentioned multiple times in the docs, dask with a distributed scheduler is a great for running workloads also on a single machine. Persisting on disk can also be very useful when RAM is limited but computation time is long + you want transparent caching to disk without having to explicitly write out a parquet file or such.

@lsorber
Copy link

lsorber commented May 4, 2019

Graphchain might be of interest to this thread. It is a caching optimiser for dask graphs. Some of its features:

  1. Caches dask computations to any PyFilesystem FS URL, including for example the OS filesystem, to Memory, and S3.
  2. Cache keys are based on a chain of hashes (hence the name graphchain) so that we can identify a cached result almost immediately. This is different from a joblib.Memory-style approach of hashing a computation's inputs (which can be very expensive when these inputs are large numpy arrays or pandas DataFrames).
  3. A result is only cached if it is expected that it will save time compared to just computing the computation (which depends on the latency and bandwidth characteristics of the cache's PyFilesystem).

It can be used as a dask graph optimiser (e.g., with dask.config.set(delayed_optimize=graphchain.optimize):), or with the built-in get convenience function (i.e., graphchain.get(dsk, keys, location='s3://mybucket/__graphchain_cache__')).

There's two years worth of comments here, so I'm not sure if it fits all the use cases described, but if you have any feature requests we'd be happy to take a look!

@zulissi
Copy link

zulissi commented Dec 14, 2020

This looks a little stale, but including a note here since I spent all weekend trying to do similar things!

As long as your distributed nodes have access to a shared filesystem, you can use joblib.memory cached functions directly with dask. The cached functions gets pickled/sent as normal, and the workers apply the normal memory logic when calling the function. For example:

mybag = db.from_sequence(long_list)
results = mybag.map(expensive_function)

could be cached at the individual function call level with

mybag = db.from_sequence(long_list)
memory = Memory('./cachedir')
results = mybag.map(memory.cache(expensive_function))

and as long as all of the workers see the same cachedir in their working directory it works out. This scaled very nicely (>10k tasks, >200 workers) since the cache logic was distributed and solved most of the problems I was facing.

I was able to get graphchain to work for simpler cases, but for map operations on partitions it looked like

  • The resulting caches were specific to individual partitions, so if you change partitions or datasets it might have to recompute and
  • The graphchain.optimize logic got very time consuming for large numbers of partitions (>1000 tasks)

Hope this helps someone, and this might be related to dask/distributed#3811

@jakirkham
Copy link
Member

It might be worth filing an issue with the graphchain team about the things it struggled with (if you haven't already).

@nbren12
Copy link
Contributor Author

nbren12 commented Dec 15, 2020

After some thought, I now think that dask and a larger workflow scheduler occupy different niches. Yes tools like Dask and airflow both represent tasks as a DAG, but the similarity ends there. Tools like airflow schedule and observe individual long-running tasks, whereas I think dask is focused on coordinating a swarm of tiny tasks.

That said, the ability to move data to disk for global shuffles is a really nice feature of spark.

@zulissi
Copy link

zulissi commented Dec 15, 2020

I think that makes Prefect particularly interesting as it uses dask for the scheduling of larger workflows.

That said, there's additional engineering overhead in understanding and using Prefect or larger schedulers, but joblib+dask is pretty easy to explain/use!

@nbren12
Copy link
Contributor Author

nbren12 commented Dec 15, 2020

Agreed. joblib + dask will work really well for non-distributed computations.

@jsignell
Copy link
Member

It seems like this discussion has reached a reasonable conclusion. Please feel free to open a documentation pull request if there is a place where you think this conversation can be distilled. I am closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants