Add from_map function to Dask-DataFrame#8911
Conversation
|
I like this! Independent of the extended functionality you mentioned with the pyarrow example, just abstracting away the pattern of layer --> highlevelgraph --> new_object: layer = DataFrameIOLayer(callable, inputs, ...)
hlg = HighLevelGraph(layer, ...)
return new_dd_object(hlg, ...)into return from_map(callable, inputs, ...)seems like a good improvement IMO. I've taken some inspiration and started playing with something similar at https://github.com/ContinuumIO/dask-awkward/pull/49 And I agree with your thought about extending |
|
Thanks for the feedback @douglasdavis ! I completely agree that an API like this is a worthwhile improvement, so I will continue pushing on this PR.
I suppose it may also make sense to consider if such an API would be useful in |
|
Any thoughts on this new API proposal (or interested reviewers)? :) cc @dask/dataframe |
douglasdavis
left a comment
There was a problem hiding this comment.
Added a few nit-picky comments
|
Something that is opaque to me: in |
bryanwweber
left a comment
There was a problem hiding this comment.
Thanks @rjzamora! This looks like a nice generalization. I saw a small typo and had a question as well.
|
I am still reading through, but I am strongly in favor of adding a new level |
Same here. I'll throw |
Thanks for taking a look at this @jsignell and @jrbourbeau ! I'm very open to other alternatives to the |
dask/dataframe/io/csv.py
Outdated
| ) | ||
| if project_after_read: | ||
| return df[self.columns] | ||
| return df[self._columns] |
There was a problem hiding this comment.
I'm a little confused about all these columns changes. Is the goal to have self.columns always be a list?
Also I think this should be self.columns since self._columns can theoretically be None right?
There was a problem hiding this comment.
I'm a little confused about all these columns changes. Is the goal to have self.columns always be a list?
These changes just make it easier to conform to the new DataFrameIOFunction Protocol for column-projection. The subtle differences between self.columns and self.full_columns for csv made this tricky before.
Also I think this should be self.columns since self._columns can theoretically be None right?
Yep - Good call!
dask/dataframe/io/io.py
Outdated
| Function used to create each partition. If ``func`` satisfies the | ||
| ``DataFrameIOFunction`` protocol, column projection will be enabled. | ||
| *iterables : Iterable objects | ||
| Iterable objects to map to each output partition. all iterables must |
There was a problem hiding this comment.
typo:
| Iterable objects to map to each output partition. all iterables must | |
| Iterable objects to map to each output partition. All iterables must |
Also I was initially a little confused about the hierarchy of this arg. My current understanding is that iterables is a list of arbitratry len where each item within must have len = npartitions. So an example of an iterable would be a list of filenames right? I think this is what the description says, but it wasn't immediately obvious to me 🤷
There was a problem hiding this comment.
I believe your understanding is correct, but I did find it difficult to come up with good wording here.
| # Input validation | ||
| if not callable(func): | ||
| raise ValueError("`func` argument must be `callable`") | ||
| lengths = set() |
There was a problem hiding this comment.
is this equivalent to npartitions? That seems like it would be a slightly more legible var name.
There was a problem hiding this comment.
I guess it's a set of the npartitions that the iterables suggest. I'm fine with this name, but it did take me a bit to understand the logic of this validation section.
|
Thinking it over again, I like For the array side, I think the biggest difference is that we need to know the block location for each output block. |
|
Thanks to everyone who provided feedback here! (@jsignell @jrbourbeau @ian-r-rose @bryanwweber @douglasdavis ) I'd like to merge this EOD today (so it sits in main for a bit before the next release), so please do let me know if there are any remaining suggestions/concerns :) |
This is an exploratory PR to add a new
from_mapfunction todask.dataframe. The purpose of this function is to mimic the standard pythonmapfunction, but for DataFrame-collection generation. For example, this would allow a simple Parquet read to be written as:This allows the user to easily do things like interact with the
pyarrow.datasetAPI directly:(Note that the
pd.read_parquetfunction is mapped onto every element offile_listto generate a DataFrame collection.)The motivation for such an API is to replace
from_delayedas the "suggested" mechanism for custom DataFrame generation. Although #8852 is expected to improve the performance offrom_delayed, creating a distinctDelayedobject for every new partition will still lead to ugliness at scale (and is unnecessary when the samecallablecan be used to generate every partition anyway).Proposed API/Docstring
Other Alternatives
I also considered expanding the existing
map_partitionsfunction to handle the special case that an existing DataFrame object is not included in theargs. However, I suspect that we loose more by complicating that API than we do by adding a new function with a simpler scope.@ian-r-rose - Any thoughts on this? If others are supportive of this idea (or something similar), I will add tests and revise, etc.