Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify stable sort in indexing #14907

Merged
merged 1 commit into from Jun 14, 2023
Merged

Conversation

taldcroft
Copy link
Member

Description

This pull request is an attempt to fix #14882, which might be due to numpy/numpy#22315. This only happens on AVX512-based processors with numpy 1.25+, but it would appear that these apply to GitHub CI actions.

One concern with this PR is that changing from the default quicksort to stable makes creating a table index slower.

Fixes #14882

@github-actions
Copy link

github-actions bot commented Jun 3, 2023

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@github-actions
Copy link

github-actions bot commented Jun 3, 2023

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@taldcroft
Copy link
Member Author

@pllim - this seems to have done the trick, tests are passing now.

@mhvk - I don't know if table guarantees a stable sort for grouping / indexing, but maybe it's a good idea?

@pllim pllim added this to the v5.0.7 milestone Jun 3, 2023
@pllim pllim added Bug numpy-dev 💤 backport-v5.0.x on-merge: backport to v5.0.x 💤 backport-v5.3.x on-merge: backport to v5.3.x labels Jun 3, 2023
@pllim
Copy link
Member

pllim commented Jun 3, 2023

I think this is technically a bug and therefore needs a change log?

@pllim
Copy link
Member

pllim commented Jun 3, 2023

I don't really have to backport this to 5.0.x because we pinned numpy<1.25 there for other reasons. But if this bug is only exposed by new numpy and not strictly caused by it, then we can backport. Let me know. Thanks!

@mhvk
Copy link
Contributor

mhvk commented Jun 3, 2023

@taldcroft - agreed that a stable sort is preferable, especially as we have had it and likely people rely on it.

@taldcroft
Copy link
Member Author

taldcroft commented Jun 4, 2023

I vote to not backport since it might change results and possibly break code for a bug release. I'll make this into a complete PR now, the draft was just to see if it works.

@mhvk
Copy link
Contributor

mhvk commented Jun 4, 2023

I'd backport to 5.3 only - for anyone upgrading to numpy 1.25 it is more like a numpy-dev bug. (in principle, one could do a NUMPY_LT_1_25 but that seems overkill).

@pllim pllim removed the 💤 backport-v5.0.x on-merge: backport to v5.0.x label Jun 4, 2023
@pllim pllim modified the milestones: v5.0.7, v5.3.1 Jun 4, 2023
@pllim
Copy link
Member

pllim commented Jun 4, 2023

So is this really a numpy-dev bug, or a change in numpy behavior exposed a bug in our package? If this bug is upstream, we should report back. Thanks!

@taldcroft
Copy link
Member Author

This is not a bug in numpy-dev. The default quicksort algorithm is not stable, so it is just an accident that our current tests are passing. I.e. we have a small table and just by luck the sort ends up being stable.

# Astropy 5.2
In [27]: a = np.random.randint(0, 100, 1000)
In [28]: b = np.arange(len(a))
In [29]: t = Table([a, b], names=["a", "b"])
In [30]: t.add_index("a")
In [31]: tg = t.group_by("a")
In [32]: tg
Out[32]: 
<Table length=1000>
  a     b  
int64 int64
----- -----
    0    25
    0   627
    0   155  <== Not stable
    0   808
    0   944
    0   642
  ...   ...
   98   202
   98   452
   99   335
   99   973
   99   767
   99   154
   99   340

With this patch I can confirm that the sort is stable and that within each group the b column is sorted.

@taldcroft
Copy link
Member Author

It is also worth noting that our docs do not say anything about row stability with grouping, and there have not been any issues opened. But I'm fine with making that promise and putting it in the docs and back-porting. Also I checked pandas and their groupby is stable.

@taldcroft taldcroft changed the title WIP: Specify stable sort in indexing Specify stable sort in indexing Jun 5, 2023
@taldcroft taldcroft marked this pull request as ready for review June 5, 2023 21:53
@mhvk
Copy link
Contributor

mhvk commented Jun 6, 2023

OK, if it was never stable in the first place, then backporting is more optional; still feels worth doing for 5.2, but not really for 5.0.

axis : int, optional
Axis along which to sort. Default is -1, which means sort along the last
axis.
kind : 'stable', optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "stable" the only acceptable option now? If not, please list other options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the next line, this argument is actually ignored, but since lexsort is used the kind will always be equivalent to "stable". So I don't know what to do, really.

The driver was that the call in the indexing code to col.argsort(kind="stable") is being done for Time objects. I suppose that code could specifically look for Time objects and change the calling args, but that seemed more ugly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Why even add that keyword if it is ignored? Can you just chuck in a **kwargs if the API must match some global API requirements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can do **kwargs, makes sense. But maybe let @mhvk weigh in before doing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I wondered whether we should just let the defaults for Column.argsort and Table.argsort become stable, but looking further, for Time we made a bit of an effort to make sure the various ndarray-like methods have signatures similar to those of ndarray. E.g., we include out=None even though we don't allow passing in an output Time. So, I think on balance I'd suggest just sticking with what you have here.

@pllim
Copy link
Member

pllim commented Jun 6, 2023

BTW looks like numpy 1.25 is impending so we should wrap this up sooner than later.

@taldcroft taldcroft requested a review from mhvk June 7, 2023 10:23
@mhvk
Copy link
Contributor

mhvk commented Jun 7, 2023

Seeing the discussion about argsort, I wonder a little about the approach more generally. Would it be better to rely on .info.get_sortable_arrays() and sort using those? Then one is no longer dependent on whether a given mixin column provides argsort.

@taldcroft
Copy link
Member Author

Would it be better to rely on .info.get_sortable_arrays() and sort using those?

This sounds like a good idea. Then we could index on any column that implements get_sortable_arrays. That makes me think of SkyCoord, which doesn't have a unique way to sort but could still be sorted uniquely (if that makes sense).

I was a little concerned about performance of np.argsort vs np.lexsort, but it looks to be the same. I'll have a look at doing that.

@pllim
Copy link
Member

pllim commented Jun 13, 2023

@taldcroft / @mhvk , any chance we can wrap this up soon? numpy wants to release as soon as this weekend, so we might only have a few more days before this problem breaks all the CI jobs. 😬 🙏

@taldcroft
Copy link
Member Author

@pllim - thanks for reminding me. I'm busy today but should be able to get to this tomorrow morning.

@taldcroft
Copy link
Member Author

@mhvk - As I dug into this a bit it reminded me of doing construction on an old house, digging through layers of time and seeing problems with each new turn... Basically there are at least 3 different implementations of sorting columns and none of them seem quite right.

  • Table.argsort() makes a numpy structured array using .as_array() and calls argsort. So for a Time column this will make an object column with each row as a new Time object. Bad. And Table.argsort exposes a kind kwarg which will cause an exception when sorting the table on a Time column.

    astropy/astropy/table/table.py

    Lines 3444 to 3445 in aa2c05d

    data = self.as_array(names=keys)
    else:
  • table/groups.py uses represent_mixins_as_columns to flatten the sorting table. This is one way to do it but for Time this gives columns with jd1 and jd2 instead of using the info.get_sortable_arrays() machinery.
    if not table_index and isinstance(table_keys, Table):
  • table/index.py does something different as well, making a list of columns to sort lexically, including a special-case for Time that repeates the get_sortable_arrays code.
    new_columns = []

It seems like the ideal way to argsort a list of table columns is to call info.get_sortable_arrays on each one and assemble all the results into a list of arrays which are then passed to np.lexsort. But one problem is that this (once again) would render moot the kind argument of Table.argsort (and Table.sort).

All of this is a good idea, but the scope is much bigger and will take a little time since it gets into core routines and there are potential performance concerns.

So what about just applying this current PR as a band-aid so we can get CI passing again? Then a bigger follow-up PR to clean up / unify table sorting.

@mhvk
Copy link
Contributor

mhvk commented Jun 14, 2023

OK, I think that is a good plan - let's merge this PR and deal with sorting more generally for 6.0!

@mhvk mhvk merged commit 196e3c2 into astropy:main Jun 14, 2023
23 of 25 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/astropy that referenced this pull request Jun 14, 2023
@taldcroft
Copy link
Member Author

Thanks @mhvk. I'll open a new issue with basically the above comment and milestone for 6.0.

@pllim
Copy link
Member

pllim commented Jun 14, 2023

Thanks, all!

pllim added a commit that referenced this pull request Jun 14, 2023
…907-on-v5.3.x

Backport PR #14907 on branch v5.3.x (Specify stable sort in indexing)
@taldcroft taldcroft deleted the table-stable-indexing branch June 20, 2023 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TST: test_table_group_by[True] and test_group_by_masked[True] failed with numpy 1.25rc1
3 participants