Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to preserve input order on table joins #11619

Closed
maxnoe opened this issue Apr 22, 2021 · 7 comments · Fixed by #16361
Closed

Add option to preserve input order on table joins #11619

maxnoe opened this issue Apr 22, 2021 · 7 comments · Fixed by #16361

Comments

@maxnoe
Copy link
Member

maxnoe commented Apr 22, 2021

Description

joins sort the resulting table by the join key. This is not documented in the actual API documentation of join here:
https://docs.astropy.org/en/stable/api/astropy.table.join.html

and just mentioned in passing here:
https://docs.astropy.org/en/stable/table/operations.html#join

This is undesired in many circumstances. There should be an option to preserve the input order, at least for left/right joins.

Additional context

Example:

from astropy.table import Table, join
import numpy as np
import astropy.units as u

rng = np.random.default_rng(0)

N_EVENTS = 10
N_TELESCOPES = 3
events = Table({
    'event_id': np.arange(N_EVENTS),
    'tel_id': rng.integers(0, 3, N_EVENTS),
})
telescopes = Table({
    'tel_id': np.arange(N_TELESCOPES),
    'x': rng.uniform(-100, 100, N_TELESCOPES) * u.m,
    'y': rng.uniform(-100, 100, N_TELESCOPES) * u.m,
})


events = join(events, telescopes, keys='tel_id', join_type='left')

print(events)

results in:

event_id tel_id         x                  y        
                        m                  m        
-------- ------ ------------------ -----------------
       3      0  82.55111545554433 8.724998293084568
       4      0  82.55111545554433 8.724998293084568
       5      0  82.55111545554433 8.724998293084568
       6      0  82.55111545554433 8.724998293084568
       7      0  82.55111545554433 8.724998293084568
       8      0  82.55111545554433 8.724998293084568
       1      1 21.327155153435967 87.01448475755365
       2      1 21.327155153435967 87.01448475755365
       0      2 45.899312196799684 63.17071082430644
       9      2 45.899312196799684 63.17071082430644
@taldcroft
Copy link
Member

Fair enough. As a workaround you can do something like this:

events['__index__'] = np.arange(len(events))
events_tel = join(events, telescopes, keys='tel_id', join_type='left')
events_tel.sort('__index__')
del events['__index__']

If this were implemented, my guess for the most natural ordering is:

  • inner : left side (you can always swap left/right on input if desired)
  • left : left side
  • right: right side
  • outer : no unambiguous ordering is possible so do nothing

Does sound reasonable? Maybe a new keep_order parameter?

If you'd like to take a crack at this, have a look at

def setdiff(table1, table2, keys=None):
for an example of putting in a temporary index to capture the sort order.

@maxnoe
Copy link
Member Author

maxnoe commented Apr 22, 2021

@taldcroft https://github.com/fact-project/aict-tools/pull/156/files ;)

@astrobatty
Copy link

What is status of this issue? I just faced this problem and I would be really happy to see it implemented.

In the meantime, I urge you to update the API documentation and specifically mention that the original order won't be preserved and the result will be sorted.

@neutrinoceros
Copy link
Contributor

@astrobatty Thanks for reviving this issue. I've started working on the implementation of the keep_order argument suggested by @taldcroft. The hope is that the patch, once polished, can land in astropy 6.1

@neutrinoceros
Copy link
Contributor

Actually, thinking more about this, I think we can do better than a boolean keep_order argument. Here are my arguments against it:

  • keep_order=True is not 100% clear in the case of the inner join: as mentioned we'd have to hard code a preference for one side, and the only way to reverse the preference from the user perspective would be to flip the first two arguments, which isn't very intuitive.
  • keep_order=False might turn out to be an outwright "lie" in cases where the join keys are already ordered (so there wouldn't be any observable difference with keep_order=True)
  • a boolean argument isn't easily extensible without changing its type somehow, so it's harder to change our minds about it later
  • as mentioned, keep_order=True would still be ambiguous with join_type="outer"

I think a better solution would be to expose a sort_by: str argument that would accept "left", "right" or "keys" (which would be the default value). This would neatly avoid all four problems I just mentioned. I also don't think it would add much complexity with respect to the simpler keep_order argument. In fact it would merely expose the complexity we need in all cases. What do you think @taldcroft ?

@taldcroft
Copy link
Member

taldcroft commented Feb 2, 2024

@neutrinoceros - that sounds like a good solution.

The one thing I often do is look for precedent in pandas. Although astropy is not even close to API compatible, it can reduce friction to be similar where possible. Pandas appears to have a single bool sort keyword that looks similar to keep_order: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html. In the case of pandas they define the default sort order separately for each join type (defined in the how keyword).

Your idea might be more clear and better in the end, but have a look at pandas and see what you think.

@neutrinoceros
Copy link
Contributor

Ok, will do ! I don't think it matters to much at this point as I think I'll implement the bulk of the feature the same way in both cases, it's really just about of how we expose it, so we can discuss the exact API in my future PR !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants