New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify stable sort in indexing #14907
Conversation
Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.
|
👋 Thank you for your draft pull request! Do you know that you can use |
I think this is technically a bug and therefore needs a change log? |
I don't really have to backport this to 5.0.x because we pinned numpy<1.25 there for other reasons. But if this bug is only exposed by new numpy and not strictly caused by it, then we can backport. Let me know. Thanks! |
@taldcroft - agreed that a stable sort is preferable, especially as we have had it and likely people rely on it. |
|
I'd backport to 5.3 only - for anyone upgrading to numpy 1.25 it is more like a numpy-dev bug. (in principle, one could do a |
So is this really a numpy-dev bug, or a change in numpy behavior exposed a bug in our package? If this bug is upstream, we should report back. Thanks! |
This is not a bug in numpy-dev. The default quicksort algorithm is not stable, so it is just an accident that our current tests are passing. I.e. we have a small table and just by luck the sort ends up being stable.
With this patch I can confirm that the sort is stable and that within each group the |
It is also worth noting that our docs do not say anything about row stability with grouping, and there have not been any issues opened. But I'm fine with making that promise and putting it in the docs and back-porting. Also I checked pandas and their |
47a1f1e
to
76932dd
Compare
OK, if it was never stable in the first place, then backporting is more optional; still feels worth doing for 5.2, but not really for 5.0. |
axis : int, optional | ||
Axis along which to sort. Default is -1, which means sort along the last | ||
axis. | ||
kind : 'stable', optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "stable" the only acceptable option now? If not, please list other options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the next line, this argument is actually ignored, but since lexsort
is used the kind
will always be equivalent to "stable"
. So I don't know what to do, really.
The driver was that the call in the indexing code to col.argsort(kind="stable")
is being done for Time
objects. I suppose that code could specifically look for Time
objects and change the calling args, but that seemed more ugly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused. Why even add that keyword if it is ignored? Can you just chuck in a **kwargs
if the API must match some global API requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can do **kwargs
, makes sense. But maybe let @mhvk weigh in before doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first, I wondered whether we should just let the defaults for Column.argsort
and Table.argsort
become stable, but looking further, for Time
we made a bit of an effort to make sure the various ndarray
-like methods have signatures similar to those of ndarray
. E.g., we include out=None
even though we don't allow passing in an output Time
. So, I think on balance I'd suggest just sticking with what you have here.
BTW looks like numpy 1.25 is impending so we should wrap this up sooner than later. |
Seeing the discussion about |
This sounds like a good idea. Then we could index on any column that implements I was a little concerned about performance of |
@taldcroft / @mhvk , any chance we can wrap this up soon? |
@pllim - thanks for reminding me. I'm busy today but should be able to get to this tomorrow morning. |
@mhvk - As I dug into this a bit it reminded me of doing construction on an old house, digging through layers of time and seeing problems with each new turn... Basically there are at least 3 different implementations of sorting columns and none of them seem quite right.
It seems like the ideal way to argsort a list of table columns is to call All of this is a good idea, but the scope is much bigger and will take a little time since it gets into core routines and there are potential performance concerns. So what about just applying this current PR as a band-aid so we can get CI passing again? Then a bigger follow-up PR to clean up / unify table sorting. |
OK, I think that is a good plan - let's merge this PR and deal with sorting more generally for 6.0! |
Thanks @mhvk. I'll open a new issue with basically the above comment and milestone for 6.0. |
Thanks, all! |
…907-on-v5.3.x Backport PR #14907 on branch v5.3.x (Specify stable sort in indexing)
Description
This pull request is an attempt to fix #14882, which might be due to numpy/numpy#22315. This only happens on AVX512-based processors with numpy 1.25+, but it would appear that these apply to GitHub CI actions.
One concern with this PR is that changing from the default
quicksort
tostable
makes creating a table index slower.Fixes #14882