Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain intermediate storage #4025

Merged
merged 10 commits into from May 1, 2019
39 changes: 39 additions & 0 deletions docs/source/array-creation.rst
Expand Up @@ -352,6 +352,45 @@ Files may be retrieved by running `da.from_tiledb` with the same URI, and any
necessary arguments.


Intermediate storage
--------------------

.. autosummary::
store

In some cases, one may wish to store an intermediate result in long term
storage. This differs from ``persist``, which is mainly used to manage
intermediate results within Dask that don't necessarily have longevity.
Also it differs from storing final results as these mark the end of the Dask
graph. So lack anyway to be reused in other graphs without reloading the data.
jakirkham marked this conversation as resolved.
Show resolved Hide resolved
Intermediate storage is mainly useful in cases where the data is needed
outside of Dask (e.g. on disk, in a database, in the cloud, etc.). It can
be useful as a checkpoint for long running or flaky computations.
jakirkham marked this conversation as resolved.
Show resolved Hide resolved

The intermediate storage use case differs from the typical storage use case as
a Dask Array is returned to the user that represents the result of that
storage operation. This is typically done by setting the ``store`` function's
``return_stored`` flag to ``True``. The user can then decide whether the
jakirkham marked this conversation as resolved.
Show resolved Hide resolved
storage operation happens immediately (by setting the ``compute`` flag to
``True``) or later (by setting the ``compute`` flag to ``False``). In all
other ways, this behaves the same as a normal call to ``store``. Some examples
are shown below.

.. code-block:: Python

>>> import dask.array as da
>>> import zarr as zr
>>> c = (2, 2)
>>> d = da.ones((10, 11), chunks=c)
>>> z1 = zr.open_array('lazy.zarr', shape=d.shape, dtype=d.dtype, chunks=c)
>>> z2 = zr.open_array('eager.zarr', shape=d.shape, dtype=d.dtype, chunks=c)
>>> d1 = d.store(z1, compute=False, return_stored=True)
>>> d2 = d.store(z2, compute=True, return_stored=True)

This can be combined with any other storage strategies either noted above, in
the docs or for any specialized storage types.


Plugins
=======

Expand Down