Pass place-holder metadata to map_partitions in ACA code path #8643

rjzamora · 2022-01-31T18:48:59Z

The introduction of map_partitions into ACA in #8468 was very convenient, but does not seem to work for all cases covered by ACA. This PR revises the change in that PR to ~~use the lower-level partitionwise_graph and blockwise APIs~~ always define the meta argument to map_partitions (even though it is not the "correct" metadata) to avoid any "metadata emulation" logic within.

Closes Dictionary outputs (dtype "object") being interpreted as strings #8636
Closes regression in a custom groupby.agg when split_out is not None #8611
Tests added / passed
Passes pre-commit run --all-files

cc @gjoseph92 (In case you have other suggestions)

gjoseph92 · 2022-01-31T20:11:43Z

@rjzamora instead of bypassing map_partitions entirely, why don't we just pass in a (incorrect) meta so one doesn't have to be inferred?

diff --git a/dask/dataframe/core.py b/dask/dataframe/core.py
index 7350252f..f5f58fe2 100644
--- a/dask/dataframe/core.py
+++ b/dask/dataframe/core.py
@@ -5878,6 +5878,7 @@ def apply_concat_apply(
         chunk,
         *args,
         token=chunk_name,
+        meta=dfs[0],  # NOTE: incorrect; just pass it to prevent inferring meta
         enforce_metadata=False,
         transform_divisions=False,
         align_dataframes=False,
@@ -5893,6 +5894,7 @@ def apply_concat_apply(
             split_out_setup_kwargs,
             ignore_index,
             token="split-%s" % token_key,
+            meta=dfs[0],  # NOTE: incorrect; just pass it to prevent inferring meta
             enforce_metadata=False,
             transform_divisions=False,
             align_dataframes=False,

rjzamora · 2022-01-31T20:18:41Z

instead of bypassing map_partitions entirely, why don't we just pass in a (incorrect) meta so one doesn't have to be inferred?

I can experiment again later, but I believe I was still running into problems in one place if not the other. In the end, map_partitions is just not intended for anything other than a partition-wise DataFrame collection to DataFrame collection operartion. Unfortunately, this is just not true in many cases where ACA is used.

gjoseph92 · 2022-01-31T20:32:27Z

Hm. With my diff on main, your test on this branch passes. I'd be curious to see what the other problems are.

map_partitions is just not intended for anything other than a partition-wise DataFrame collection to DataFrame collection operartion

I think of it the opposite way: once you turn off meta emulation, enforce_metadata=False, transform_divisions=False, align_dataframes=False, map_partitions is nothing more than syntactic sugar for a basic blockwise. It's equivalent, just nicer to write and read.

rjzamora · 2022-01-31T21:07:47Z

@gjoseph92 - You are right. Your suggestion does work (not sure what variation of this I tried earlier). My latest commit uses this approach (d2442db).

I agree that it is best to use map_partitions if/when possible, but I don't really agree that it is nothing more than syntactic sugar for blockwise. The logic is certainly doing exactly what you say, but I feel that the goal of map_partitions is a bit different than what we want at a "functional level." The fact that we need to lie to the function to get what we need is pretty much my entire point. The purpose of map_partitions is to generate a new DataFrame collection by adding a new Blockwise layer to the HLG. We want the new Blockwise layer, but we don't want a new DataFrame collection (at least not a valid one). Therefore, using map_partitions is convenient, but still technically a hack.

gjoseph92 · 2022-01-31T21:53:06Z

We want the new Blockwise layer, but we don't want a new DataFrame collection (at least not a valid one). Therefore, using map_partitions is convenient, but still technically a hack.

Great point. I do agree that it's a hack—a hack around the difficult-to-use interface for blockwise and HLGs. There's no intermediate representation to be able to say, "just treat this thing as a collection of tasks, and map a function over all of them". You can either say, "produce a new DataFrame from this collection of DataFrames", or "build another task graph that depends on the tasks in this graph". Having something in between would be quite nice.

Stylistically, I do find map_partitions easier to read, but I see your point. Do you think the code makes more sense dropping down the the graph level?

rjzamora · 2022-01-31T22:52:37Z

Stylistically, I do find map_partitions easier to read, but I see your point. Do you think the code makes more sense dropping down the the graph level?

No no - I think I like "readable" code a bit more than I dislike the hackiness of using map_partitions. I guess the only real solution to all my concerns is to establish something "in the middle" (like you said). Not yet sure exactly how to accomplish this, because the convenience of map_partitions comes from its DataFrame relationship, and I am effectively complaining that out result is not a DataFrame collection anymore.

dask/dataframe/core.py

jorisvandenbossche · 2022-02-01T09:01:45Z

Checking out this branch, I can confirm that the tests of dask-geopandas pass with this change.

jsignell · 2022-02-01T14:57:56Z

I have been creeping on this discussion and I tend to agree that map_partitions feels more readable to me. I had always assumed that was just because I am more familiar with it, but if you both find it more readable then I guess there is more to it than that. I am kind of anti adding another mechanism for creating this kind of graph, but I was wondering if just accessing the partitions iterator directly (or blocks) for array, might be a way to use blockwise in a more readable way without having to trick map_partitions

rjzamora · 2022-02-01T15:55:31Z

I have been creeping on this discussion and I tend to agree that map_partitions feels more readable to me. I had always assumed that was just because I am more familiar with it, but if you both find it more readable then I guess there is more to it than that. I am kind of anti adding another mechanism for creating this kind of graph, but I was wondering if just accessing the partitions iterator directly (or blocks) for array, might be a way to use blockwise in a more readable way without having to trick map_partitions

Thanks for the feedback @jsignel! Your perspective is valuable here. I don’t think I can imagine a way that iterating over partitions would capture the same fusion behavior while being more readable, but you may be right. The only other option I can imagine (without needing to make significant changes), is to (1) add an option to DataFrame.map_partitions to output a dask.bag.Bag collection, and (2) to revise Bag.map_partitions to use Blockwise. With those two changes in place, we could just use map_partitions without “lying” about the underlying metadata at all. [EDIT: If we did decide to do something like this - I suggest we leave it for a distinct/dedicated PR]

rjzamora · 2022-02-01T19:19:29Z

For the sake of discussion/exploration, I implemented a rough POC of the Bag-based solution here: #8646

jsignell · 2022-02-01T21:44:57Z

I think we should merge this while we carry on the conversation about the best approach.

ian-r-rose · 2022-02-08T00:53:02Z

I'm coming to this late, and still ruminating over it. But at least right now, I kind of prefer how this was the first way, and in particular agree with @rjzamora that

map_partitions is just not intended for anything other than a partition-wise DataFrame collection to DataFrame collection operartion

To @gjoseph92's point:

I think of it the opposite way: once you turn off meta emulation, enforce_metadata=False, transform_divisions=False, align_dataframes=False, map_partitions is nothing more than syntactic sugar for a basic blockwise. It's equivalent, just nicer to write and read.

This is the type of reasoning that would make me quite nervous about doing any serious refactoring or even maintenance of map_partitions or blockwise. In order to know that that combination of kwargs is syntactic sugar for blockwise, I'd basically need to read and fully understand the implementation. In general, lying to the function to trigger an implementation detail seems destined to break.

rjzamora added 2 commits January 31, 2022 07:13

simple fix

8a91f6f

avoid using map_partitions altogether in ACA

0ef06c7

github-actions bot added the dataframe label Jan 31, 2022

rjzamora mentioned this pull request Jan 31, 2022

Avoid invalid metadata emulation in DataFrame ACA #8641

Closed

3 tasks

kee[ using map_partitions, but lie about meta

d2442db

gjoseph92 approved these changes Jan 31, 2022

View reviewed changes

dask/dataframe/core.py Show resolved Hide resolved

add comments

9608de3

rjzamora changed the title ~~Remove map_partitions from ACA code path~~ Pass place-holder metadata to map_partitions in ACA code path Feb 1, 2022

rjzamora mentioned this pull request Feb 1, 2022

Move Bag.map_partitions to Blockwise #8646

Merged

rjzamora merged commit 1107a91 into dask:main Feb 2, 2022

rjzamora deleted the aca-no-map-partitions branch February 2, 2022 03:21

martinfleis mentioned this pull request Feb 3, 2022

Failing tests with geopandas and dask main geopandas/dask-geopandas#135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass place-holder metadata to map_partitions in ACA code path #8643

Pass place-holder metadata to map_partitions in ACA code path #8643

rjzamora commented Jan 31, 2022 •

edited

gjoseph92 commented Jan 31, 2022

rjzamora commented Jan 31, 2022

gjoseph92 commented Jan 31, 2022

rjzamora commented Jan 31, 2022

gjoseph92 commented Jan 31, 2022

rjzamora commented Jan 31, 2022

jorisvandenbossche commented Feb 1, 2022

jsignell commented Feb 1, 2022

rjzamora commented Feb 1, 2022 •

edited

rjzamora commented Feb 1, 2022

jsignell commented Feb 1, 2022

ian-r-rose commented Feb 8, 2022

Pass place-holder metadata to map_partitions in ACA code path #8643

Pass place-holder metadata to map_partitions in ACA code path #8643

Conversation

rjzamora commented Jan 31, 2022 • edited

gjoseph92 commented Jan 31, 2022

rjzamora commented Jan 31, 2022

gjoseph92 commented Jan 31, 2022

rjzamora commented Jan 31, 2022

gjoseph92 commented Jan 31, 2022

rjzamora commented Jan 31, 2022

jorisvandenbossche commented Feb 1, 2022

jsignell commented Feb 1, 2022

rjzamora commented Feb 1, 2022 • edited

rjzamora commented Feb 1, 2022

jsignell commented Feb 1, 2022

ian-r-rose commented Feb 8, 2022

rjzamora commented Jan 31, 2022 •

edited

rjzamora commented Feb 1, 2022 •

edited