Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify stable sort in indexing #14907

Merged
merged 1 commit into from
Jun 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions astropy/table/index.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@
raise ValueError("Cannot create index without at least one column")
elif len(columns) == 1:
col = columns[0]
row_index = Column(col.argsort())
row_index = Column(col.argsort(kind="stable"))
data = Table([col[row_index]])
else:
num_rows = len(columns[0])
Expand All @@ -117,7 +117,7 @@
try:
lines = table[np.lexsort(sort_columns)]
except TypeError: # arbitrary mixins might not work with lexsort
lines = table[table.argsort()]
lines = table[table.argsort(kind="stable")]

Check warning on line 120 in astropy/table/index.py

View check run for this annotation

Codecov / codecov/patch

astropy/table/index.py#L120

Added line #L120 was not covered by tests
data = lines[lines.colnames[:-1]]
row_index = lines[lines.colnames[-1]]

Expand Down
20 changes: 20 additions & 0 deletions astropy/table/tests/test_groups.py
Original file line number Diff line number Diff line change
Expand Up @@ -690,3 +690,23 @@ def test_group_mixins_unsupported(col):
tg = t.group_by("a")
with pytest.warns(AstropyUserWarning, match="Cannot aggregate column 'mix'"):
tg.groups.aggregate(np.sum)


@pytest.mark.parametrize("add_index", [False, True])
def test_group_stable_sort(add_index):
"""Test that group_by preserves the order of the table.

This table has 5 groups with an average of 200 rows per group, so it is not
statistically possible that the groups will be in order by chance.

This tests explicitly the case where grouping is done via the index sort.
See: https://github.com/astropy/astropy/issues/14882
"""
a = np.random.randint(0, 5, 1000)
b = np.arange(len(a))
t = Table([a, b], names=["a", "b"])
if add_index:
t.add_index("a")
tg = t.group_by("a")
for grp in tg.groups:
assert np.all(grp["b"] == np.sort(grp["b"]))
25 changes: 20 additions & 5 deletions astropy/time/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1441,13 +1441,28 @@ def argmax(self, axis=None, out=None):

return dt.argmax(axis, out)

def argsort(self, axis=-1):
def argsort(self, axis=-1, kind="stable"):
"""Returns the indices that would sort the time array.

This is similar to :meth:`~numpy.ndarray.argsort`, but adapted to ensure
that the full precision given by the two doubles ``jd1`` and ``jd2``
is used, and that corresponding attributes are copied. Internally,
it uses :func:`~numpy.lexsort`, and hence no sort method can be chosen.
This is similar to :meth:`~numpy.ndarray.argsort`, but adapted to ensure that
the full precision given by the two doubles ``jd1`` and ``jd2`` is used, and
that corresponding attributes are copied. Internally, it uses
:func:`~numpy.lexsort`, and hence no sort method can be chosen.

Parameters
----------
axis : int, optional
Axis along which to sort. Default is -1, which means sort along the last
axis.
kind : 'stable', optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "stable" the only acceptable option now? If not, please list other options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the next line, this argument is actually ignored, but since lexsort is used the kind will always be equivalent to "stable". So I don't know what to do, really.

The driver was that the call in the indexing code to col.argsort(kind="stable") is being done for Time objects. I suppose that code could specifically look for Time objects and change the calling args, but that seemed more ugly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Why even add that keyword if it is ignored? Can you just chuck in a **kwargs if the API must match some global API requirements?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can do **kwargs, makes sense. But maybe let @mhvk weigh in before doing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first, I wondered whether we should just let the defaults for Column.argsort and Table.argsort become stable, but looking further, for Time we made a bit of an effort to make sure the various ndarray-like methods have signatures similar to those of ndarray. E.g., we include out=None even though we don't allow passing in an output Time. So, I think on balance I'd suggest just sticking with what you have here.

Sorting is done with :func:`~numpy.lexsort` so this argument is ignored, but
kept for compatibility with :func:`~numpy.argsort`. The sorting is stable,
meaning that the order of equal elements is preserved.

Returns
-------
indices : ndarray
An array of indices that sort the time array.
"""
# For procedure, see comment on argmin.
jd1, jd2 = self.jd1, self.jd2
Expand Down
3 changes: 3 additions & 0 deletions docs/changes/table/14907.bugfix.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Fix a bug where table indexes were not using a stable sort order. This was causing the
order of rows within groups to not match the original table order when an indexed table
was grouped.
7 changes: 7 additions & 0 deletions docs/table/operations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,13 @@ values and the indices of the group boundaries for those key values. The groups
here correspond to the row slices ``0:4``, ``4:7``, and ``7:10`` in the
``obs_by_name`` table.

The output grouped table has two important properties:

- The groups in the order of the lexically sorted key values (``M101``, ``M31``,
``M82`` in our example).
- The rows within each group are in the same order as they appear in the
original table.

The initial argument (``keys``) for the :func:`~astropy.table.Table.group_by`
function can take a number of input data types:

Expand Down