Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG [table]: fix a bug where column names would be lost when instanciating Table from a list of Row objects #15735

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

neutrinoceros
Copy link
Contributor

Description

Fixes #5923
This is a somewhat inelegant minimal patch; I'll be happy to iterate over it to improve quality if requested !

ping @hamogu @taldcroft

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

Copy link

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

Copy link

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@pllim pllim added this to the v6.1.0 milestone Dec 14, 2023
@pllim pllim added the Bug label Dec 14, 2023
@@ -593,7 +620,7 @@ class Table:
Copy the input data. If the input is a Table the ``meta`` is always
copied regardless of the ``copy`` parameter.
Default is True.
rows : numpy ndarray, list of list, optional
rows : numpy ndarray, list of list, sequence of ``Row``, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addition of new accepted type as input also feel like a feature, so I didn't mark this as backport but Table maintainers can always change that if they want.

Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initializing from a sequence of Row was always intended to work, so this is just documenting that correctly (and fixing a bug).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, @taldcroft , should we backport?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes on the backport in concept, though to be honest I don't know where it will be backported. I thought there is only 6.0.1 which is the next bugfix release forward from 6.0.0.

Also, of course, I haven't reviewed the code yet!

@neutrinoceros neutrinoceros marked this pull request as ready for review December 14, 2023 17:43
@neutrinoceros neutrinoceros changed the title BUG: fix a bug where column names would be lost when instanciating Table from a list of Row objects BUG [table]: fix a bug where column names would be lost when instanciating Table from a list of Row objects Dec 14, 2023
@neutrinoceros
Copy link
Contributor Author

There's one failure I didn't catch locally. I'll get back to fix it in the morning, most likely.

@neutrinoceros neutrinoceros marked this pull request as draft December 14, 2023 17:48
return names

for row in rows[1:]:
if not isinstance(row, Row) or len(row.colnames) != len(names):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewers: I'm assuming that passing rows of different sizes would raise an exception sooner or later during initialization, but I think this function should be as resilient as possible, so raising that exception is not its responsibility.

t1["a"] = [1, 2, 3]
t1["b"] = [2.0, 3.0, 4.0]

rows = [row for row in t1] # noqa: C416
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a noqa comment here to keep this intention of that line very clear. Otherwise, the linter would push for
rows = list(t1), which is equivalent, but a bit more opaque in my opinion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could also create a list of Rows manually. Here I just took the example from @astrofrog for easier comparison between the test and its linked issue.

@neutrinoceros neutrinoceros marked this pull request as ready for review December 15, 2023 10:23
@hamogu
Copy link
Member

hamogu commented Dec 15, 2023

We should backport.

@hamogu
Copy link
Member

hamogu commented Dec 15, 2023

Since I don't review many PRs, I don't know what I need tp do about that though. Set the backport label?

@pllim pllim modified the milestones: v6.1.0, v6.0.1 Dec 15, 2023
@pllim pllim added the 💤 backport-v6.0.x on-merge: backport to v6.0.x label Dec 15, 2023
@pllim
Copy link
Member

pllim commented Dec 15, 2023

Change the milestone and set the backport label, yes. I already did it here. Thanks for the feedback!

@taldcroft
Copy link
Member

@neutrinoceros - I did a little digging and noticed a solution that has a much smaller footprint, in fact it reduces the line count. Basically this relies on Row being mostly dict-like, though sadly list(row) gives the values not the keys. (If we could go back 11 years...).

This also fixes some old inelegant code for getting the column order which was written way before Python had an ordered dict. It's very easy now to get the column order as simply the order in which they appear in successive rows. The previous alphabetical sort was not very nice.

One slight unfortunate thing about this solution is that it could change the column ordering in code so it would count as an API change for the unusual circumstance where the first row does not have all the columns. So I think this would need to wait for the next release, but the original issue is 6 years old so it can wait a little longer.

diff --git a/astropy/table/table.py b/astropy/table/table.py
index ac5a06d791..65631d9b0a 100644
--- a/astropy/table/table.py
+++ b/astropy/table/table.py
@@ -205,9 +205,9 @@ def _get_names_from_list_of_dict(rows):
     if rows is None:
         return None
 
-    names = set()
+    names = dict()
     for row in rows:
-        if not isinstance(row, Mapping):
+        if not isinstance(row, (Mapping, Row)):
             return None
         names.update(row)
     return list(names)
@@ -1169,17 +1169,7 @@ class Table:
         MISSING = object()
 
         # Gather column names that exist in the input `data`.
-        names_from_data = set()
-        for row in data:
-            names_from_data.update(row)
-
-        if set(data[0].keys()) == names_from_data:
-            names_from_data = list(data[0].keys())
-        else:
-            names_from_data = sorted(names_from_data)
-
-        # Note: if set(data[0].keys()) != names_from_data, this will give an
-        # exception later, so NO need to catch here.
+        names_from_data = _get_names_from_list_of_dict(data)
 
         # Convert list of dict into dict of list (cols), keep track of missing
         # indexes and put in MISSING placeholders in the `cols` lists.

@neutrinoceros neutrinoceros force-pushed the table/bug/table_from_rows branch 2 times, most recently from 715ae5a to 3bea311 Compare December 18, 2023 20:45
@neutrinoceros
Copy link
Contributor Author

@taldcroft thank you so much, I agree your solution is much more elegant and it still passes the test, so I went ahead and replaced mine with it (with you as a "co-author") !

@taldcroft taldcroft added the API change PRs and issues that change an existing API, possibly requiring a deprecation period label Dec 19, 2023
@taldcroft taldcroft removed this from the v6.0.1 milestone Dec 19, 2023
@pllim pllim marked this pull request as draft December 19, 2023 21:58
@pllim
Copy link
Member

pllim commented Dec 19, 2023

As per Simon's recommendation, I moved milestone to 7.0, added label to request What's New entry, and turned this into draft so we don't merge before v6.1.

@pllim
Copy link
Member

pllim commented Dec 19, 2023

Re: #15735 (comment)

@saimn , I have not been super careful about API change and v6.1. Do we have to revert anything in https://github.com/astropy/astropy/pulls?q=is%3Apr+is%3Amerged+label%3A%22API+change%22+milestone%3Av6.1.0

@hamogu
Copy link
Member

hamogu commented Dec 19, 2023

I think pushing this to 7.0 is going too far. See #15735 (comment)
This was always intended to work -in my view that fact that it did not is a bug that's fixed here. In that sense it's similar to #15774 - it breaks stuff if you relied on undocumented, uncodified behavior that the was not intended. In the same vain, I don't think a what'-s new is justified for this bug fix.

@taldcroft ?

@neutrinoceros
Copy link
Contributor Author

it breaks stuff if you relied on undocumented, uncodified behavior that the was not intended.

@hamogu not exactly: the behaviour that's changed here was documented, even though it wasn't ideal (and it's a side effect from @taldcroft's patch that I adopted, not it's primary goal)
In that sense, I agree that it makes sense to push it to 6.1 in this kind of minor breakage was acceptable (it sounds like they're not) or 7.0 otherwise.

My own patch was backward compatible, albeit a lot less elegant than @taldcroft , so I think I can propose a way forward that should be satisfactory to all parties:

  • reinstate my own backward-compatible patch that just fixes the targeted bug (including a regression test) and move back this PR to 6.0.1
  • open a separate PR with @taldcroft's cleanup patch, aimed at 7.0

@saimn I am not sure how to go about adding a deprecation warning in this case: ideally users should have some sort of control to opt-in the new behaviour before it becomes the default, so they're able to dodge the warning once they are aware of it. In this case in do not see how to provide such an interface without adding a flag to Table.__init__'s signature, but the flag itself would then need to be deprecated too. All in all, is it really worth it ? I think there's a case to be had that a clean (documented !) break in 7.0 would help a lot with maintainability.

@saimn
Copy link
Contributor

saimn commented Dec 20, 2023

@hamogu - From what I understand, and @neutrinoceros confirms it above, there is both a bugfix (supporting list of Row) and an API change (ordering of the columns, which affects also instantiating from a list of dict). If the second part can be avoided then the bugfix can even go to 6.0.1, and maybe the API change can be delayed to 7.0, with a warning if possible.

@neutrinoceros - Not sure for the warning, it also depends if it makes sense to have an option to keep the old behavior. One option is to issue the warning when columns are sorted, stating that the default behavior will change in 7.0, and people can filter the warning if they want. Or add an option to choose the sort behavior which would allow to force the old behavior if it make sense to do so ?

@neutrinoceros
Copy link
Contributor Author

Or add an option to choose the sort behavior which would allow to force the old behavior if it make sense to do so ?

What's great about @taldcroft's patch is that it decreases complexity. If we were to support multiple behaviours then complexity would go back up again, and I'm honestly not sure it's worth it.

@taldcroft
Copy link
Member

taldcroft commented Dec 20, 2023

This is exactly the sort of situation I was thinking about in discussions for APE-21 which resulted in the statement about being pragmatic. Historically, API changes like this have been common in minor releases. Although we justified them by pointing to the LTS, in practice the project recognized that very few people were actually using the LTS.

Breakage could only occur for a pretty unusual situation:

  1. Initializing a table from a list of dict where the first row does not contain all of the columns.
  2. Downstream code that depends on the column ordering.

The pragmatic part is accepting that this situation can occur but it is unlikely. Not only does (1) need to be true, but it needs to always be true for all input data sets. Basically the current code is somewhat randomly changing the column ordering based on an arbitrary test.

It is true that we could test for situation (1), issue a warning and sort the columns alphabetically. But then users in that situation need to modify their code anyway to suppress the warning. And if they want to permanently keep alphabetical ordering they have to add code to do that sorting so it still works after 7.0, and then they are left with orphaned code to suppress the warning. Users that want the more natural ordering now would be out of luck unless we add some new configuration option.

This is just a LOT of complexity for a very little gain.

@hamogu hamogu modified the milestones: v7.0.0, v6.1.0 Dec 20, 2023
@hamogu
Copy link
Member

hamogu commented Dec 20, 2023

Like #15774 this sounds like space-bar heating: https://xkcd.com/1172/. I don't think we have to backport, but I suggest a 6.1 milestone. We also don't have to do the extra work of breaking this up into two PRs.

@taldcroft
Copy link
Member

In my day job (Chandra Flight Director dealing directly with NASA) I deal with process control authority, A LOT. In this case, there is a delegation of authority:

  • CoCo has ultimate authority for the release contents.
  • CoCo does not have time to deal with each PR, so they delegate to the release managers.
  • Release managers do not have time or (in many cases) the domain knowledge to assess each PR, so they delegate to the package maintainers.
  • Package maintainers normally have authority to make a decision about the impact of an API change and the target milestone.

Only in unusual circumstances would the release managers or the CoCo override the recommendation of the package maintainer(s). Of course a healthy discussion is always encouraged!

@saimn
Copy link
Contributor

saimn commented Dec 20, 2023

Not sure I understand where this discussion is going but, sure in the end it's up to the maintainers to decide.
But on the other side the change of philosophy of the release cycle is recent and it's easy to forget (e.g. #14977) that we decided to avoid breaking changes in minor versions. It's sometimes difficult to avoid, or sometimes justified to break something in order to fix a bug, fine.
Here the first version was backward compatible, and creating a Table from a list of dict is pretty common I guess, so it seems reasonable to ask if we can find another way to proceed.

@neutrinoceros
Copy link
Contributor Author

It's not up to me to decide and ultimately I'm happy to re-iterate as long as it takes to get everyone happy but I think that going with the patch as is and aiming it as 6.1 is a reasonable way forward and indeed falls into the "pragmatic" kind of low impact breakage that APE 21 defines as acceptable for a minor release.

@taldcroft
Copy link
Member

taldcroft commented Dec 20, 2023

@saimn - agreed to try avoiding API changes, I'm definitely 100% behind not breaking code unnecessarily. That said, in this case we have talked it through and I believe the complexity that is needed to maintain perfect back-compatibility outweighs the potential impact. Creating Tables from list of dict is common, but having both (1) and (2) from above is not.

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neutrinoceros - this is converging! I made a number of suggestions on the docs. The code and tests look fine. I think the CI failures are just doc tests (at least from the first one) but you should check them all.

docs/table/construct_table.rst Outdated Show resolved Hide resolved
docs/whatsnew/6.1.rst Outdated Show resolved Hide resolved
docs/whatsnew/6.1.rst Outdated Show resolved Hide resolved
docs/whatsnew/6.1.rst Outdated Show resolved Hide resolved
docs/whatsnew/6.1.rst Outdated Show resolved Hide resolved
docs/whatsnew/6.1.rst Outdated Show resolved Hide resolved
docs/whatsnew/6.1.rst Outdated Show resolved Hide resolved
docs/whatsnew/6.1.rst Outdated Show resolved Hide resolved
@neutrinoceros
Copy link
Contributor Author

rebased to resolve a merge conflict

@astrofrog astrofrog modified the milestones: v6.1.0, v7.0.0 Apr 4, 2024
neutrinoceros and others added 2 commits April 30, 2024 12:15
…ble from a list of Row objects

Co-authored-by: Tom Aldcroft <taldcroft@gmail.com>
@neutrinoceros
Copy link
Contributor Author

rebased again to resolve conflicts. This time I also squashed my commits to minimize the number of conflicts I need to resolve in a single rebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API change PRs and issues that change an existing API, possibly requiring a deprecation period Bug table whatsnew-needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Column names not preserved when initializing Table from a list of Rows
6 participants