-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] MultiIndex #8153
base: main
Are you sure you want to change the base?
[WIP] MultiIndex #8153
Conversation
def _collapse(partition): | ||
return pd.Series( | ||
list(partition.itertuples(index=False, name=None)), | ||
index=partition.index, | ||
name=tuple(partition.columns), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooo - It may make sense to precede this PR with a simpler PR to support multi-column sort_values
using this trick.
@charlesbluca - Note that this approach is not as performant as direct DataFrame.quantiles
/DataFrame.searchsorted
support in pandas, but it should "unblock" multi-column sorting :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this looks nice - thanks for the heads up! I can start up a WIP using this in sort_values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool yeah! I expect this to take a while to work out. There are still some open questions about how things should behave. So anything that can come out of here and be useful is great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charlesbluca - I started exploring this a bit in this branch (couldn't help myself). It is quite slow compared to 0th column partitioning, but does seem to work for cases where multiple columns are required for sufficient repartitioning.
black dask
/flake8 dask
/isort dask
This first commit is pulled from @TomAugspurger's original branch: TomAugspurger@0e741e1
My plan is to try to keep moving forward with that work and raise NotImplemented all over the place.