New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add align_dataframes
to map_partitions
to allow passing a dataframe as an arg directly
#6628
Conversation
@dask/maintenance this is ready for review. |
enforce_alignment
to map_partitions
to allow passing a dataframe as an arg directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM implementation-wise, just have some ideas for making the docs more descriptive. Thanks @jsignell, I'd love to see this!
dask/dataframe/core.py
Outdated
applying the function. | ||
pandas) will be repartitioned to align (if necessary) before | ||
applying the function (see ``enforce_alignment`` to control). | ||
enforce_metadata : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're at it, could you document transform_divisions
as well?
dask/dataframe/core.py
Outdated
Whether or not to enforce the structure of the metadata at runtime. | ||
This will rename and reorder columns for each partition, | ||
and will raise an error if this doesn't work or types don't match. | ||
enforce_alignment : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd slightly prefer something like align_dataframes
or align_inputs
instead, since it's less "enforcement" in the way that enforce_metadata
is runtime enforcement. Plus, that matches with align_arrays
in Array.map_blocks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that makes total sense as a rename. I was never very happy with this name.
@gjoseph92 is it enough for you to have this switch or are you hoping to change the default behavior? |
I think we should have this switch either way. It's also handy when you have unknown divisions, but happen to know when it's okay to skip alignment anyway. I still think broadcasting single-partition DataFrames would be more intuitive, but surely that will break other things, so with the docs this is maybe sufficient for now. |
ok great! I'll fix this up and get it in. |
Co-authored-by: Gabe Joseph <gjoseph92@gmail.com>
…into enforce-alignment
enforce_alignment
to map_partitions
to allow passing a dataframe as an arg directlyalign_dataframes
to map_partitions
to allow passing a dataframe as an arg directly
black dask
/flake8 dask
When dataframes are passed to
map_partitions
, by default they are aligned according to the partitions of the first dataframe. This can be surprising when the user intended to broadcast them instead. This PR addsalign_dataframes
as amap_partitions
kwarg (just likealign_arrays
inmap_blocks
) so you can work around this when necessary (without going the hacky route of using kwargs when you intend to broadcast).Not sure if this is a good idea or not since
map_paritions
accepts arbitrary user kwargs.