Often we want to do a bit of custom work with dask.delayed
(for example
for complex data ingest), then leverage the algorithms in dask.array
or
dask.dataframe
, and then switch back to custom work. To this end, all
collections support from_delayed
functions and to_delayed
methods.
As an example, consider the case where we store tabular data in a custom format
not known by dask.dataframe
. This format is naturally broken apart into
pieces and we have a function that reads one piece into a Pandas DataFrame.
We use dask.delayed
to lazily read these files into Pandas DataFrames,
use dd.from_delayed
to wrap these pieces up into a single
dask.dataframe
, use the complex algorithms within dask.dataframe
(groupby, join, etc..) and then switch back to delayed to save our results
back to the custom format.
import dask.dataframe as dd
from dask.delayed import delayed
from my_custom_library import load, save
filenames = ...
dfs = [delayed(load)(fn) for fn in filenames]
df = dd.from_delayed(dfs)
df = ... # do work with dask.dataframe
dfs = df.to_delayed()
writes = [delayed(save)(df, fn) for df, fn in zip(dfs, filenames)]
dd.compute(*writes)
Data science is often complex, dask.delayed
provides a release valve for
users to manage this complexity on their own, and solve the last mile problem
for custom formats and complex situations.