Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pass place-holder metadata to map_partitions in ACA code path #8643

Merged
merged 4 commits into from Feb 2, 2022

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Jan 31, 2022

The introduction of map_partitions into ACA in #8468 was very convenient, but does not seem to work for all cases covered by ACA. This PR revises the change in that PR to use the lower-level partitionwise_graph and blockwise APIs always define the meta argument to map_partitions (even though it is not the "correct" metadata) to avoid any "metadata emulation" logic within.

cc @gjoseph92 (In case you have other suggestions)

@gjoseph92
Copy link
Collaborator

@rjzamora instead of bypassing map_partitions entirely, why don't we just pass in a (incorrect) meta so one doesn't have to be inferred?

diff --git a/dask/dataframe/core.py b/dask/dataframe/core.py
index 7350252f..f5f58fe2 100644
--- a/dask/dataframe/core.py
+++ b/dask/dataframe/core.py
@@ -5878,6 +5878,7 @@ def apply_concat_apply(
         chunk,
         *args,
         token=chunk_name,
+        meta=dfs[0],  # NOTE: incorrect; just pass it to prevent inferring meta
         enforce_metadata=False,
         transform_divisions=False,
         align_dataframes=False,
@@ -5893,6 +5894,7 @@ def apply_concat_apply(
             split_out_setup_kwargs,
             ignore_index,
             token="split-%s" % token_key,
+            meta=dfs[0],  # NOTE: incorrect; just pass it to prevent inferring meta
             enforce_metadata=False,
             transform_divisions=False,
             align_dataframes=False,

@rjzamora
Copy link
Member Author

instead of bypassing map_partitions entirely, why don't we just pass in a (incorrect) meta so one doesn't have to be inferred?

I can experiment again later, but I believe I was still running into problems in one place if not the other. In the end, map_partitions is just not intended for anything other than a partition-wise DataFrame collection to DataFrame collection operartion. Unfortunately, this is just not true in many cases where ACA is used.

@gjoseph92
Copy link
Collaborator

Hm. With my diff on main, your test on this branch passes. I'd be curious to see what the other problems are.

map_partitions is just not intended for anything other than a partition-wise DataFrame collection to DataFrame collection operartion

I think of it the opposite way: once you turn off meta emulation, enforce_metadata=False, transform_divisions=False, align_dataframes=False, map_partitions is nothing more than syntactic sugar for a basic blockwise. It's equivalent, just nicer to write and read.

@rjzamora
Copy link
Member Author

@gjoseph92 - You are right. Your suggestion does work (not sure what variation of this I tried earlier). My latest commit uses this approach (d2442db).

I agree that it is best to use map_partitions if/when possible, but I don't really agree that it is nothing more than syntactic sugar for blockwise. The logic is certainly doing exactly what you say, but I feel that the goal of map_partitions is a bit different than what we want at a "functional level." The fact that we need to lie to the function to get what we need is pretty much my entire point. The purpose of map_partitions is to generate a new DataFrame collection by adding a new Blockwise layer to the HLG. We want the new Blockwise layer, but we don't want a new DataFrame collection (at least not a valid one). Therefore, using map_partitions is convenient, but still technically a hack.

@gjoseph92
Copy link
Collaborator

We want the new Blockwise layer, but we don't want a new DataFrame collection (at least not a valid one). Therefore, using map_partitions is convenient, but still technically a hack.

Great point. I do agree that it's a hack—a hack around the difficult-to-use interface for blockwise and HLGs. There's no intermediate representation to be able to say, "just treat this thing as a collection of tasks, and map a function over all of them". You can either say, "produce a new DataFrame from this collection of DataFrames", or "build another task graph that depends on the tasks in this graph". Having something in between would be quite nice.

Stylistically, I do find map_partitions easier to read, but I see your point. Do you think the code makes more sense dropping down the the graph level?

@rjzamora
Copy link
Member Author

Stylistically, I do find map_partitions easier to read, but I see your point. Do you think the code makes more sense dropping down the the graph level?

No no - I think I like "readable" code a bit more than I dislike the hackiness of using map_partitions. I guess the only real solution to all my concerns is to establish something "in the middle" (like you said). Not yet sure exactly how to accomplish this, because the convenience of map_partitions comes from its DataFrame relationship, and I am effectively complaining that out result is not a DataFrame collection anymore.

dask/dataframe/core.py Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

Checking out this branch, I can confirm that the tests of dask-geopandas pass with this change.

@jsignell
Copy link
Member

jsignell commented Feb 1, 2022

I have been creeping on this discussion and I tend to agree that map_partitions feels more readable to me. I had always assumed that was just because I am more familiar with it, but if you both find it more readable then I guess there is more to it than that. I am kind of anti adding another mechanism for creating this kind of graph, but I was wondering if just accessing the partitions iterator directly (or blocks) for array, might be a way to use blockwise in a more readable way without having to trick map_partitions

@rjzamora
Copy link
Member Author

rjzamora commented Feb 1, 2022

I have been creeping on this discussion and I tend to agree that map_partitions feels more readable to me. I had always assumed that was just because I am more familiar with it, but if you both find it more readable then I guess there is more to it than that. I am kind of anti adding another mechanism for creating this kind of graph, but I was wondering if just accessing the partitions iterator directly (or blocks) for array, might be a way to use blockwise in a more readable way without having to trick map_partitions

Thanks for the feedback @jsignel! Your perspective is valuable here. I don’t think I can imagine a way that iterating over partitions would capture the same fusion behavior while being more readable, but you may be right. The only other option I can imagine (without needing to make significant changes), is to (1) add an option to DataFrame.map_partitions to output a dask.bag.Bag collection, and (2) to revise Bag.map_partitions to use Blockwise. With those two changes in place, we could just use map_partitions without “lying” about the underlying metadata at all. [EDIT: If we did decide to do something like this - I suggest we leave it for a distinct/dedicated PR]

@rjzamora rjzamora changed the title Remove map_partitions from ACA code path Pass place-holder metadata to map_partitions in ACA code path Feb 1, 2022
@rjzamora
Copy link
Member Author

rjzamora commented Feb 1, 2022

For the sake of discussion/exploration, I implemented a rough POC of the Bag-based solution here: #8646

@jsignell
Copy link
Member

jsignell commented Feb 1, 2022

I think we should merge this while we carry on the conversation about the best approach.

@rjzamora rjzamora merged commit 1107a91 into dask:main Feb 2, 2022
@rjzamora rjzamora deleted the aca-no-map-partitions branch February 2, 2022 03:21
@ian-r-rose
Copy link
Collaborator

I'm coming to this late, and still ruminating over it. But at least right now, I kind of prefer how this was the first way, and in particular agree with @rjzamora that

map_partitions is just not intended for anything other than a partition-wise DataFrame collection to DataFrame collection operartion

To @gjoseph92's point:

I think of it the opposite way: once you turn off meta emulation, enforce_metadata=False, transform_divisions=False, align_dataframes=False, map_partitions is nothing more than syntactic sugar for a basic blockwise. It's equivalent, just nicer to write and read.

This is the type of reasoning that would make me quite nervous about doing any serious refactoring or even maintenance of map_partitions or blockwise. In order to know that that combination of kwargs is syntactic sugar for blockwise, I'd basically need to read and fully understand the implementation. In general, lying to the function to trigger an implementation detail seems destined to break.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants