Consolidating all tasks that write to a file on a single worker #2163

shoyer · 2018-08-06T02:56:39Z

A common pattern when using xarray with dask is to have a large number of tasks writing to a smaller number of files, e.g., an xarray.Dataset consisting of a handful of dask arrays gets stored into a single netCDF file.

This works pretty well with the non-distributed version of dask, but doing it with dask-distributed presents two challenges:

There's significant overhead associated with opening/closing netCDF files (comparable to the cost of writing a single chunk), so we'd really prefer to avoid doing so for every write task -- yet this is what we currently do. I haven't measure this directly but I have a strong suspicion this is part of why xarray can be so slow for writing netCDF files with dask-distributed.
We need to coordinate a bunch of distributed locks to ensure that we don't try to write to the same file from multiple processes at the same time. These are tricky to reason about (see fix distributed writes pydata/xarray#1793 and xarray.backends refactor pydata/xarray#2261) and unfortunately we still don't have it right yet in xarray -- we only ever got this working with netCDF4-Python, not h5netcdf or scipy.

It would be nice if we could simply consolidate all tasks that involve writing to a single file onto a single worker. This would avoid the necessity to reopen files, pass around open files between processes or worry about distributed locks.

Does dask-distributed have any sort of existing machinery that would facilitate this? In particular, I wonder if this could be a good use-case for actors (#2133).

The text was updated successfully, but these errors were encountered:

mrocklin · 2018-08-06T12:17:43Z

Actors are experimental and may be removed at any time. I don't recommend that XArray depend on them. They're also advanced technology, and probably bring along problems that are hard to foresee. That being said, yes, they would be a possible solution here to manage otherwise uncomfortable state.

I wonder how much of this problem could be removed by consolidating data beforehand into a single task or into a chain of dependant tasks? Are files are going to be much larger than an individual task? Would creating artificial dependencies between tasks help in some way?

Do you have more information about what is wrong with distributed locking? This approach seems simplest to me if it is cheap.

shoyer · 2018-08-06T16:02:10Z

Are files are going to be much larger than an individual task?

Often, yes. It's pretty common to encounter netCDF files consisting of 1-20 arrays with total size in the 200MB-10GB range. This is solidly in the "medium data" range where streaming computation is valuable.

We could encourage changing best practices to write smaller files, but users will be surprised/disappointed if switching to dask-distributed suddenly means they can't write netCDF files that don't fit in memory on a single node.

This does probably make sense when using netCDF backends like scipy that don't (yet?) support writes without loading the entire file into memory.

Would creating artificial dependencies between tasks help in some way?

Yes, I think this could also work nicely, at least to resolve any need for lockings. The downside is that we would need a priori knowledge of the proper task ordering to handle streaming computation use cases.

Do you have more information about what is wrong with distributed locking? This approach seems simplest to me if it is cheap.

I'm pretty sure that with more futzing/refactoring I could locks and reopening files for every operation working. The overhead could be minimized with appropriate (automatic?) rechunking.

Maybe this is the better way to go.

mrocklin · 2018-08-06T16:07:24Z

We could also write to something else that was more concurrent friendly and then have a final task that copied/merged things over to a single NetCDF file. This would double/triple our I/O costs but would remove dask complications.

…

On Mon, Aug 6, 2018 at 12:02 PM, Stephan Hoyer ***@***.***> wrote: Are files are going to be much larger than an individual task? Often, yes. It's pretty common to encounter netCDF files consisting of 1-20 arrays with total size in the 200MB-10GB range. This is solidly in the "medium data" range where streaming computation is valuable. We could encourage changing best practices to write smaller files, but users will be surprised/disappointed if switching to dask-distributed suddenly means they can't write netCDF files that don't fit in memory on a single node. This does probably make sense when using netCDF backends like scipy that don't (yet?) support writes without loading the entire file into memory. Would creating artificial dependencies between tasks help in some way? Yes, I think this could also work nicely, at least to resolve any need for lockings. The downside is that we would need a priori knowledge of the proper task ordering to handle streaming computation use cases. Do you have more information about what is wrong with distributed locking? This approach seems simplest to me if it is cheap. I'm pretty sure that with more futzing/refactoring I could locks and reopening files for every operation working. The overhead could be minimized with appropriate (automatic?) rechunking. Maybe this is the better way to go. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2163 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGQDlwJ41Zs_m9rBnKwrdHlXG0NJks5uOGiDgaJpZM4VvrXD> .

jakirkham · 2018-08-07T17:34:49Z

If you do go the copying direction, this may be helpful.

jakirkham · 2018-08-07T17:42:18Z

Other things to consider, had been playing with the idea of sending the graph over to a worker to run. ( dask/dask#3275 ) Maybe something with Variable makes sense. We could revisit locking for each write. ( dask/dask#3179 ) Maybe something involving per worker resources could allow us to force tasks to jobs with the needed resource. That might work well, but we will have to probably do a few passes to get syntax that succinctly captures the intent.

shoyer mentioned this issue Aug 20, 2018

xarray.backends refactor pydata/xarray#2261

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidating all tasks that write to a file on a single worker #2163

Consolidating all tasks that write to a file on a single worker #2163

shoyer commented Aug 6, 2018

mrocklin commented Aug 6, 2018

shoyer commented Aug 6, 2018

mrocklin commented Aug 6, 2018 via email

jakirkham commented Aug 7, 2018

jakirkham commented Aug 7, 2018

Consolidating all tasks that write to a file on a single worker #2163

Consolidating all tasks that write to a file on a single worker #2163

Comments

shoyer commented Aug 6, 2018

mrocklin commented Aug 6, 2018

shoyer commented Aug 6, 2018

mrocklin commented Aug 6, 2018 via email

jakirkham commented Aug 7, 2018

jakirkham commented Aug 7, 2018