New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pass place-holder metadata to map_partitions in ACA code path #8643
Conversation
@rjzamora instead of bypassing diff --git a/dask/dataframe/core.py b/dask/dataframe/core.py
index 7350252f..f5f58fe2 100644
--- a/dask/dataframe/core.py
+++ b/dask/dataframe/core.py
@@ -5878,6 +5878,7 @@ def apply_concat_apply(
chunk,
*args,
token=chunk_name,
+ meta=dfs[0], # NOTE: incorrect; just pass it to prevent inferring meta
enforce_metadata=False,
transform_divisions=False,
align_dataframes=False,
@@ -5893,6 +5894,7 @@ def apply_concat_apply(
split_out_setup_kwargs,
ignore_index,
token="split-%s" % token_key,
+ meta=dfs[0], # NOTE: incorrect; just pass it to prevent inferring meta
enforce_metadata=False,
transform_divisions=False,
align_dataframes=False, |
I can experiment again later, but I believe I was still running into problems in one place if not the other. In the end, |
Hm. With my diff on main, your test on this branch passes. I'd be curious to see what the other problems are.
I think of it the opposite way: once you turn off meta emulation, |
@gjoseph92 - You are right. Your suggestion does work (not sure what variation of this I tried earlier). My latest commit uses this approach (d2442db). I agree that it is best to use map_partitions if/when possible, but I don't really agree that it is nothing more than syntactic sugar for blockwise. The logic is certainly doing exactly what you say, but I feel that the goal of |
Great point. I do agree that it's a hack—a hack around the difficult-to-use interface for blockwise and HLGs. There's no intermediate representation to be able to say, "just treat this thing as a collection of tasks, and map a function over all of them". You can either say, "produce a new DataFrame from this collection of DataFrames", or "build another task graph that depends on the tasks in this graph". Having something in between would be quite nice. Stylistically, I do find |
No no - I think I like "readable" code a bit more than I dislike the hackiness of using |
Checking out this branch, I can confirm that the tests of dask-geopandas pass with this change. |
I have been creeping on this discussion and I tend to agree that |
Thanks for the feedback @jsignel! Your perspective is valuable here. I don’t think I can imagine a way that iterating over partitions would capture the same fusion behavior while being more readable, but you may be right. The only other option I can imagine (without needing to make significant changes), is to (1) add an option to |
For the sake of discussion/exploration, I implemented a rough POC of the Bag-based solution here: #8646 |
I think we should merge this while we carry on the conversation about the best approach. |
I'm coming to this late, and still ruminating over it. But at least right now, I kind of prefer how this was the first way, and in particular agree with @rjzamora that
To @gjoseph92's point:
This is the type of reasoning that would make me quite nervous about doing any serious refactoring or even maintenance of |
The introduction of
map_partitions
into ACA in #8468 was very convenient, but does not seem to work for all cases covered by ACA. This PR revises the change in that PR touse the lower-levelalways define thepartitionwise_graph
andblockwise
APIsmeta
argument tomap_partitions
(even though it is not the "correct" metadata) to avoid any "metadata emulation" logic within.pre-commit run --all-files
cc @gjoseph92 (In case you have other suggestions)