Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add align_dataframes to map_partitions to allow passing a dataframe as an arg directly #6628

Merged
merged 7 commits into from Nov 5, 2021

Conversation

jsignell
Copy link
Member

@jsignell jsignell commented Sep 11, 2020

When dataframes are passed to map_partitions, by default they are aligned according to the partitions of the first dataframe. This can be surprising when the user intended to broadcast them instead. This PR adds align_dataframes as a map_partitions kwarg (just like align_arrays in map_blocks) so you can work around this when necessary (without going the hacky route of using kwargs when you intend to broadcast).

Not sure if this is a good idea or not since map_paritions accepts arbitrary user kwargs.

@jsignell
Copy link
Member Author

@dask/maintenance this is ready for review.

Base automatically changed from master to main March 8, 2021 20:19
@jsignell jsignell changed the title Add enforce_alignment to map_partitions to allow passing a dataframe as an arg directly Add enforce_alignment to map_partitions to allow passing a dataframe as an arg directly Nov 3, 2021
Copy link
Collaborator

@gjoseph92 gjoseph92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM implementation-wise, just have some ideas for making the docs more descriptive. Thanks @jsignell, I'd love to see this!

applying the function.
pandas) will be repartitioned to align (if necessary) before
applying the function (see ``enforce_alignment`` to control).
enforce_metadata : bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, could you document transform_divisions as well?

dask/dataframe/core.py Outdated Show resolved Hide resolved
Whether or not to enforce the structure of the metadata at runtime.
This will rename and reorder columns for each partition,
and will raise an error if this doesn't work or types don't match.
enforce_alignment : bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd slightly prefer something like align_dataframes or align_inputs instead, since it's less "enforcement" in the way that enforce_metadata is runtime enforcement. Plus, that matches with align_arrays in Array.map_blocks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that makes total sense as a rename. I was never very happy with this name.

dask/dataframe/core.py Outdated Show resolved Hide resolved
@jsignell
Copy link
Member Author

jsignell commented Nov 4, 2021

@gjoseph92 is it enough for you to have this switch or are you hoping to change the default behavior?

@gjoseph92
Copy link
Collaborator

I think we should have this switch either way. It's also handy when you have unknown divisions, but happen to know when it's okay to skip alignment anyway.

I still think broadcasting single-partition DataFrames would be more intuitive, but surely that will break other things, so with the docs this is maybe sufficient for now.

@jsignell
Copy link
Member Author

jsignell commented Nov 4, 2021

ok great! I'll fix this up and get it in.

@jsignell jsignell changed the title Add enforce_alignment to map_partitions to allow passing a dataframe as an arg directly Add align_dataframes to map_partitions to allow passing a dataframe as an arg directly Nov 5, 2021
@jsignell jsignell merged commit 5b38dda into dask:main Nov 5, 2021
@jsignell jsignell deleted the enforce-alignment branch November 5, 2021 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Behaviour of map_partitions with a Pandas dataframe as argument - In this case uncorrect merge
2 participants