Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support table join on different keys #11954

Merged
merged 4 commits into from
Jul 19, 2021

Conversation

taldcroft
Copy link
Member

@taldcroft taldcroft commented Jul 17, 2021

Description

This pull request is to allow doing a table join on table columns that have different names. It also allows doing a join on arbitrary column-like data which serve as the join values. The API is inspired by left_on and right_on in pandas merge.

This also introduces a small API change by requiring keyword-only args after left, right, keys, join_type. I think that after this it is unlikely that much external code is using positional args. The next arg was previously uniq_col_name.

Fixes #3752

Checklist for package maintainer(s)

This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the Extra CI label.
  • Is a change log needed? If yes, did the change log check pass? If no, add the no-changelog-entry-needed label.
  • Is a milestone set? Milestone must be set but astropy-bot check might be missing; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate backport-X.Y.x label(s) before merge.

@taldcroft taldcroft added table API change PRs and issues that change an existing API, possibly requiring a deprecation period labels Jul 17, 2021
@taldcroft taldcroft added this to the v5.0 milestone Jul 17, 2021
@github-actions
Copy link

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@taldcroft taldcroft marked this pull request as ready for review July 17, 2021 20:48
@taldcroft taldcroft requested a review from mhvk July 18, 2021 04:22
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very; my comments are tiny.

One alternative for the API: to avoid having any API change and the checks on keys not being used when keys_left and keys_right are, you could also let the keys argument be a dict with two entries, 'left' and 'right'. Or, if it is a list, allow it to have entries that are 2-element tuples (i.e., the example would be either keys={'left': 'name', 'right': 'obs_id'} or keys=[('name', 'obs_id')] Note that I'm not convinced either is better - just thought I'd pass them by.

Since I do think this is all OK except for the typo in the docstring, I'll approve now.

@@ -354,6 +355,12 @@ def join(left, right, keys=None, join_type='inner',
Default is to use all columns which are common to both tables.
join_type : str
Join type ('inner' | 'outer' | 'left' | 'right' | 'cartesian'), default is 'inner'
keys_left : str or list of str of list of column-like, optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

second "of" should be "or list of column-like"

right_orig = right
left, right, keys = _join_keys_left_right(
left, right, keys, keys_left, keys_right, join_funcs)

# If we have a single key, put it in a tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really directly related, but this comment should be a few lines down.

astropy/table/operations.py Show resolved Hide resolved
@taldcroft
Copy link
Member Author

One alternative for the API: to avoid having any API change and the checks on keys ...

I also thought about the exact same API alternatives. In the end the precedent from pandas seems like a good thing, but I also think this API is more readable. Supplying a tuple of the left and right keys is definitely confusing (if you haven't studied the docs), so that one is easy to reject. The dict is OK, but I think at the end having distinct keywords is easier to read.

This will also give a better error message if someone runs code written for astropy 5.0 on astropy < 5.0. Right now if you do that it will effectively run as if you had done keys = list(keys) at the top and try to join on 'left and 'right' columns.

@mhvk
Copy link
Contributor

mhvk commented Jul 19, 2021

Yes, I figured you had thought it through already! So, will merge now. Nice little addition!

@mhvk mhvk merged commit 81c1d64 into astropy:main Jul 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change PRs and issues that change an existing API, possibly requiring a deprecation period Enhancement table
Projects
None yet
Development

Successfully merging this pull request may close these issues.

astropy.table.join on different colnames
2 participants