Add from_map function to Dask-DataFrame by rjzamora · Pull Request #8911 · dask/dask

rjzamora · 2022-04-11T18:31:49Z

This is an exploratory PR to add a new from_map function to dask.dataframe. The purpose of this function is to mimic the standard python map function, but for DataFrame-collection generation. For example, this would allow a simple Parquet read to be written as:

ddf = dd.from_map(pd.read_parquet, file_list, columns=["a"], engine="fastparquet")

This allows the user to easily do things like interact with the pyarrow.dataset API directly:

import pyarrow.dataset as ds

ddf = from_map(lambda x: x.to_table().to_pandas(), ds.dataset("tmpdir").get_fragments())

(Note that the pd.read_parquet function is mapped onto every element of file_list to generate a DataFrame collection.)

The motivation for such an API is to replace from_delayed as the "suggested" mechanism for custom DataFrame generation. Although #8852 is expected to improve the performance of from_delayed, creating a distinct Delayed object for every new partition will still lead to ugliness at scale (and is unnecessary when the same callable can be used to generate every partition anyway).

Proposed API/Docstring

@insert_meta_param_description
def from_map(
    func,
    *iterables,
    args=None,
    meta=None,
    divisions=None,
    label=None,
    token=None,
    enforce_metadata=True,
    **kwargs,
):
    """Create a DataFrame collection from a custom function map

    Parameters
    ----------
    func : callable
        Function used to create each partition. If ``func`` satisfies the
        ``DataFrameIOFunction`` protocol, column projection will be enabled.
    *iterables : Iterable objects
        Iterable objects to map to each output partition. all iterables must
        be the same length. This length determines the number of partitions
        in the output collection (only one element of each iterable will
        be passed to ``func`` for each partition).
    args : list or tuple, optional
        Positional arguments to broadcast to each output partition. Note
        that these arguments will always be passed to ``func`` after the
        ``iterables`` positional arguments.
    $META
    divisions : tuple, str, optional
        Partition boundaries along the index.
        For tuple, see https://docs.dask.org/en/latest/dataframe-design.html#partitions
        For string 'sorted' will compute the delayed values to find index
        values.  Assumes that the indexes are mutually sorted.
        If None, then won't use index information
    label : str, optional
        String to use as the function-name label in the output
        collection-key names.
    token : str, optional
        String to use as the "token" in the output collection-key names.
    enforce_metadata : bool, default True
        Whether to enforce at runtime that the structure of the DataFrame
        produced by ``func`` actually matches the structure of ``meta``.
        This will rename and reorder columns for each partition,
        and will raise an error if this doesn't work or types don't match.
    **kwargs:
        Key-word arguments to broadcast to each output partition. These
        same arguments will be passed to ``func`` for every output partition.

    Examples
    --------
    >>> import pandas as pd
    >>> import dask.dataframe as dd
    >>> func = lambda x, size=0: pd.Series([x] * size)
    >>> inputs = ["A", "B"]
    >>> dd.from_map(func, inputs, size=2).compute()
    0    A
    1    A
    0    B
    1    B
    dtype: object

    See Also
    --------
    dask.dataframe.from_delayed
    dask.layers.DataFrameIOLayer
    """

Other Alternatives

I also considered expanding the existing map_partitions function to handle the special case that an existing DataFrame object is not included in the args. However, I suspect that we loose more by complicating that API than we do by adding a new function with a simpler scope.

@ian-r-rose - Any thoughts on this? If others are supportive of this idea (or something similar), I will add tests and revise, etc.

douglasdavis · 2022-04-12T20:10:22Z

I like this! Independent of the extended functionality you mentioned with the pyarrow example, just abstracting away the pattern of layer --> highlevelgraph --> new_object:

layer = DataFrameIOLayer(callable, inputs, ...)
hlg = HighLevelGraph(layer, ...)
return new_dd_object(hlg, ...)

into

return from_map(callable, inputs, ...)

seems like a good improvement IMO. I've taken some inspiration and started playing with something similar at https://github.com/ContinuumIO/dask-awkward/pull/49

And I agree with your thought about extending map_partitions, the separation between this new case (which is prefixed with from_) versus the case of dataframe-collection-already-exists (and now let's operate on the existing partitions via map_partitions) seems natural.

rjzamora · 2022-04-13T17:03:08Z

Thanks for the feedback @douglasdavis !

I completely agree that an API like this is a worthwhile improvement, so I will continue pushing on this PR.   I suppose it may also make sense to consider if such an API would be useful in dask.array (cc @jrbourbeau). If so, it probably makes sense to ensure that a similar function signature will work for an array-based collection as well.

rjzamora · 2022-04-21T17:51:26Z

Any thoughts on this new API proposal (or interested reviewers)? :) cc @dask/dataframe

douglasdavis

Added a few nit-picky comments

dask/dataframe/io/csv.py

dask/dataframe/io/hdf.py

dask/dataframe/io/demo.py

dask/dataframe/io/orc/core.py

dask/dataframe/io/parquet/core.py

douglasdavis · 2022-04-21T20:01:30Z

Something that is opaque to me: in from_map the HighLevelGraph.from_collections(graph, name, dependencies=[]) call always sets dependencies to an empty list. How is this possible with something like read_csv which uses dask.bytes to create delayed blocks of bytes? It doesn't seem to be an issue but it's something I'd like to better understand!

bryanwweber

Thanks @rjzamora! This looks like a nice generalization. I saw a small typo and had a question as well.

dask/dataframe/io/io.py

jsignell · 2022-04-29T14:58:34Z

I am still reading through, but I am strongly in favor of adding a new level from_map function (another option would be to name it from_partitions. I tried to do something like this for arrays in #6294 but I was too attached to the map_blocks interface. I like how this PR deviates from map_partitions to really serve this particular usecase well.

dask/dataframe/io/io.py

jrbourbeau · 2022-04-29T15:17:43Z

I am still reading through, but I am strongly in favor of adding a new level from_map function (another option would be to name it from_partitions

Same here. I'll throw from_function in the pool of possible names too -- it seems to convey the intent of this method (at least in my mind). FWIW there's also a similar np.fromfunction method which it might be nice to be close-ish to when choosing a name.

rjzamora · 2022-04-29T15:41:33Z

another option would be to name it from_partitions

Same here. I'll throw from_function in the pool of possible names too -- it seems to convey the intent of this method (at least in my mind). FWIW there's also a similar np.fromfunction method which it might be nice to be close-ish to when choosing a name.

Thanks for taking a look at this @jsignell and @jrbourbeau !

I'm very open to other alternatives to the from_map name. The from_partitions name feels a bit off to me since we are actually mapping to partitions. The from_function name was originally on my list, but I actually leaned away from it because of the numpy function which seemed quite different to me (but I may just be misunderstanding what that function does). My impression was that the numpy function doesn't really allow you to "map" anything to the function besides the indices, whereas the from_map function I am proposing is much more similar to python's map function.

dask/dataframe/io/__init__.py

dask/dataframe/io/csv.py

jsignell · 2022-04-29T15:18:57Z

dask/dataframe/io/csv.py

        )
        if project_after_read:
-            return df[self.columns]
+            return df[self._columns]


I'm a little confused about all these columns changes. Is the goal to have self.columns always be a list?

Also I think this should be self.columns since self._columns can theoretically be None right?

I'm a little confused about all these columns changes. Is the goal to have self.columns always be a list?

These changes just make it easier to conform to the new DataFrameIOFunction Protocol for column-projection. The subtle differences between self.columns and self.full_columns for csv made this tricky before.

Also I think this should be self.columns since self._columns can theoretically be None right?

Yep - Good call!

dask/dataframe/io/io.py

jsignell · 2022-04-29T15:31:57Z

dask/dataframe/io/io.py

+        Function used to create each partition. If ``func`` satisfies the
+        ``DataFrameIOFunction`` protocol, column projection will be enabled.
+    *iterables : Iterable objects
+        Iterable objects to map to each output partition. all iterables must


typo:

Suggested change

Iterable objects to map to each output partition. all iterables must

Iterable objects to map to each output partition. All iterables must

Also I was initially a little confused about the hierarchy of this arg. My current understanding is that iterables is a list of arbitratry len where each item within must have len = npartitions. So an example of an iterable would be a list of filenames right? I think this is what the description says, but it wasn't immediately obvious to me 🤷

I believe your understanding is correct, but I did find it difficult to come up with good wording here.

dask/dataframe/io/io.py

jsignell · 2022-04-29T15:39:56Z

dask/dataframe/io/io.py

+    # Input validation
+    if not callable(func):
+        raise ValueError("`func` argument must be `callable`")
+    lengths = set()


is this equivalent to npartitions? That seems like it would be a slightly more legible var name.

I guess it's a set of the npartitions that the iterables suggest. I'm fine with this name, but it did take me a bit to understand the logic of this validation section.

dask/dataframe/io/io.py

jsignell · 2022-04-29T17:52:20Z

Thinking it over again, I like from_map as the name.

For the array side, I think the biggest difference is that we need to know the block location for each output block.

dask/dataframe/io/io.py

jsignell

I'm good with this as is!

rjzamora · 2022-05-03T16:22:55Z

Thanks to everyone who provided feedback here! (@jsignell @jrbourbeau @ian-r-rose @bryanwweber @douglasdavis )

I'd like to merge this EOD today (so it sits in main for a bit before the next release), so please do let me know if there are any remaining suggestions/concerns :)

rjzamora added 5 commits April 7, 2022 12:45

map_partitions experiments

2f99911

add from_map functoin

d417596

Merge remote-tracking branch 'upstream/main' into from_map

6337227

roll back temporary map_partitions changes

1c6372c

fix kwarg typo

df494fc

github-actions bot added dataframe io labels Apr 11, 2022

rjzamora added 9 commits April 11, 2022 11:44

use DataFrameIOLayer

fbf1c3b

handle general Iterable objects

b4a73ba

use from_map in read_parquet

e9744a6

use from_map in read_orc

3ecc15e

use from_map in read_orc

3108eb1

use from_map in read_csv

d9fc510

use from_map in timeseries

f2bbeb5

be consistent about token arg

394127d

use from_map in read_hdf

21650ef

douglasdavis mentioned this pull request Apr 12, 2022

IO via Blockwise dask-contrib/dask-awkward#49

Merged

rjzamora added 3 commits April 20, 2022 07:44

Merge remote-tracking branch 'upstream/main' into from_map

a657407

add test coverage for from_map

84a0f9f

fix Blockwise typo

31bc08b

rjzamora marked this pull request as ready for review April 20, 2022 16:09

rjzamora changed the title ~~[WIP] Add from_map function to Dask-DataFrame~~ Add from_map function to Dask-DataFrame Apr 20, 2022

douglasdavis reviewed Apr 21, 2022

View reviewed changes

bryanwweber reviewed Apr 21, 2022

View reviewed changes

dask/dataframe/io/io.py Outdated Show resolved Hide resolved

dask/dataframe/io/io.py Outdated Show resolved Hide resolved

dask/dataframe/io/io.py Outdated Show resolved Hide resolved

rjzamora added 2 commits April 22, 2022 06:44

address smaller code-review items

fa2e578

fix column-projection bug

6cbdc2f

rjzamora added 5 commits April 25, 2022 13:48

add docstring from private _PackedArgCallable class

e7f03f5

disallow multiple iterables when produces_tasks=True

876d9f3

avoid packing when there is only one iterable

fb37db9

fix column projection

f41347a

add simple-case code path

2372366

ian-r-rose mentioned this pull request Apr 28, 2022

Fix bug in blockwise fusion after from_delayed #8989

Merged

Merge remote-tracking branch 'upstream/main' into from_map

3becde6

rjzamora mentioned this pull request Apr 29, 2022

Release 2022.4.2 dask/community#240

Closed

add stability warning to from_map

b4640bb

ian-r-rose reviewed Apr 29, 2022

View reviewed changes

dask/dataframe/io/io.py Show resolved Hide resolved

jsignell reviewed Apr 29, 2022

View reviewed changes

code review

b6da76b

update docs

a53d762

github-actions bot added the documentation Improve or add to documentation label Apr 29, 2022

fix section name

285b4f3

rjzamora commented Apr 29, 2022

View reviewed changes

dask/dataframe/io/io.py Show resolved Hide resolved

Merge remote-tracking branch 'upstream/main' into from_map

6de352c

jsignell approved these changes May 2, 2022

View reviewed changes

rjzamora merged commit b946406 into dask:main May 4, 2022

rjzamora deleted the from_map branch May 4, 2022 02:40

jrbourbeau mentioned this pull request May 5, 2022

Handle empty dataframes and/or mismatched columns in dd.from_delayed #9034

Open

jakirkham mentioned this pull request May 6, 2022

Adding from_map like API for Dask Arrays #9049

Open

douglasdavis mentioned this pull request May 9, 2022

Maintain dask-pandas parity for read_ methods (read_excel, etc) #9055

Open

charlesbluca mentioned this pull request May 10, 2022

Use from_map for offset / limit computation dask-contrib/dask-sql#514

Closed

quasiben mentioned this pull request May 19, 2022

Docs Not Updating? #9101

Closed

	Iterable objects to map to each output partition. all iterables must
	Iterable objects to map to each output partition. All iterables must

Uh oh!

Conversation

rjzamora commented Apr 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed API/Docstring

Other Alternatives

Uh oh!

douglasdavis commented Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Apr 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rjzamora commented Apr 21, 2022

Uh oh!

douglasdavis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

douglasdavis commented Apr 21, 2022

Uh oh!

bryanwweber left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jsignell commented Apr 29, 2022

Uh oh!

Uh oh!

jrbourbeau commented Apr 29, 2022

Uh oh!

rjzamora commented Apr 29, 2022

Uh oh!

Uh oh!

Uh oh!

jsignell Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jsignell Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

rjzamora Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jsignell Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

jsignell Apr 29, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jsignell commented Apr 29, 2022

Uh oh!

Uh oh!

jsignell left a comment

Choose a reason for hiding this comment

Uh oh!

rjzamora commented May 3, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rjzamora commented Apr 11, 2022 •

edited

Loading

douglasdavis commented Apr 12, 2022 •

edited

Loading

rjzamora commented Apr 13, 2022 •

edited

Loading