Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix a bug where columns with dtype=object wouldn't be properly deep-copied using copy.deepcopy #15871

Merged
merged 2 commits into from Mar 27, 2024

Conversation

neutrinoceros
Copy link
Contributor

Description

Fixes #13435

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

Copy link

github-actions bot commented Jan 12, 2024

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

@github-actions github-actions bot added the table label Jan 12, 2024
Copy link

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

Copy link
Contributor Author

@neutrinoceros neutrinoceros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewers: both tests fail on main and are fixed with the same patch. To some extent, their redundancy is intentional, but I could drop their parametrization (or drop it for one of them) since it's a bit heavy handed: only 1 out of 4 scenarios was actually fixed, and the other 3 already worked as intended.

@pllim pllim added this to the v6.1.0 milestone Jan 12, 2024
@pllim pllim added the Bug label Jan 12, 2024
@neutrinoceros neutrinoceros marked this pull request as ready for review January 12, 2024 14:33
@neutrinoceros
Copy link
Contributor Author

@pllim Do you see a reason not to backport this to 6.0.x ?

@pllim
Copy link
Member

pllim commented Jan 12, 2024

It is a behavior change that is subtle. Some people might be unknowingly using this old behavior so backporting would break them. But if @taldcroft et al. think the risk is low and benefit of backporting outweighs the risk, then they are free to change the milestone. Hope this clarifies the matter!

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm wondering, though, if this goes a bit too far? I think Column as an ndarray subclass should work the same way as ndarray.

My own sense is that we rely on users being able to do deepcopy(...) and get the right result. For Column, that is the case already, but for Table, that needs an adjustment of __deepcopy__. And then we can mention in the Table docstring under copy that if one has object columns and wants those copied too, then one should use copy.deepcopy(table) (and also under Column.copy and Table.copy.

@taldcroft - what do you think?

assert c2 is not c1
assert c2[0] is not c1[0]

c3 = table.Column(c1, copy=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if we go with my suggestion, for this case one would have assert c3[0] is c1[0] for the object column.

@pytest.mark.parametrize(
"data",
[
np.array([1], dtype=np.int32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe keep the test focussed just on object? I'd do np.array([1], dtype=object) and [object()] only.

@pytest.mark.parametrize(
"data",
[
np.array([1], dtype=np.int32),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, I'd parametrize just some examples of object.

p.s. The test itself would be independent of my suggested change.

@neutrinoceros
Copy link
Contributor Author

Thanks for your input, I didn't realize that Column indeed already behaves like its parent class so the fix needs to happen some place else. I played around with the patch you suggested in the issue but it breaks how metadata and indices gets passed to copies. I've readjusted the test cases to the now clarified spec (locally), and I'll push when I manage to pass all tests locally.

@neutrinoceros neutrinoceros marked this pull request as draft January 13, 2024 10:04
@neutrinoceros
Copy link
Contributor Author

Locally I still see 2 failing tests. Specifically it seems to be the ones that exercise copying a Table that has an index column, is masked, and utilise Table.group_by. As there's a lot of features intertwined here, I could use a hand to debug it.

@@ -3654,6 +3654,9 @@ def copy(self, copy_data=True):
deepcopied regardless of the value for ``copy_data``.
"""
out = self.__class__(self, copy=copy_data)
if copy_data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I still think we should only adjust __deepcopy__ - right now, Table.copy() does the same thing as dict.copy() (which makes a shallow copy of the values = the data), and I think that's fine.

That will probably remove all errors, but obviously means one has to add extra tests...

But perhaps good to get @taldcroft's opinion first!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake! If I understand correctly what you're suggesting is more or less this patch ?

diff --git a/astropy/table/table.py b/astropy/table/table.py
index 2c6e6c4a40..ca00fbdb69 100644
--- a/astropy/table/table.py
+++ b/astropy/table/table.py
@@ -3663,7 +3663,10 @@ class Table:
         return out
 
     def __deepcopy__(self, memo=None):
-        return self.copy(True)
+        out = self.copy(True)
+        for name in out.colnames:
+            out[name] = deepcopy(out[name])
+        return out
 
     def __copy__(self):
         return self.copy(False)

There's probably some duplicated work with this version (but not more than with the current state of this PR), but I also note that it also hits the exact same 2 failures that I still don't quite get.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I meant.

And let me if I can figure out the test...

Copy link
Contributor

@mhvk mhvk Jan 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot say I necessarily understand it completely, but the following does work:

        out = self.copy(copy_data=False)
        for name in out.colnames:
            out.columns.__setitem__(name, deepcopy(self[name]), validated=True)
        return out

Note that the strange line is just the very final line of replace_column - the rest is not needed since in this case we know the data is OK. An alternative that works as well is,

        out = self.copy(copy_data=False)
        out._replace_cols({name: deepcopy(self[name]) for name in self.colnames})
        return out

Maybe @taldcroft can advice on what would be the best procedure!?

EDIT: I use copy_data=False since we are making a deep copy of the data right after.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much ! I pushed the first approach just so everyone can see that it indeed passes all tests, but happy to change it again if Tom advises we use a different approach.

@neutrinoceros neutrinoceros marked this pull request as ready for review January 16, 2024 10:24
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! From my side, this looks all good, but would like @taldcroft's opinion as well.

@neutrinoceros
Copy link
Contributor Author

Let's gently remind @taldcroft about this one :)

@taldcroft
Copy link
Member

Sorry, I entirely missed this. I'll have a look but I'm stepping away for the afternoon now. One quick thing is with self.copy(False) I'm not sure about .meta getting a deep copy.

@taldcroft
Copy link
Member

To answer my question, despite the documentation meta is not actually deepcopied unless copy=True.

>>> meta = {1: object()}
>>> t1 = Table([[1]], meta=meta)
>>> t2_nc = Table(t1, copy=False)
>>> t2_c = Table(t1, copy=True)
>>> t1.meta[1] is t2_nc.meta[1]
True
>>> t1.meta[1] is t2_c.meta[1]
False

This stems from #8404. It would appear in that PR that we entirely failed to update related docstrings about the behavior of .meta and copy. Uff.

out = self.copy(False)
for name in out.colnames:
out.columns.__setitem__(name, deepcopy(self[name]), validated=True)
return out
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here you just need to add:

out.meta = deepcopy(self.meta)

)
def test_deepcopy_object_column(data):
# see https://github.com/astropy/astropy/issues/13435
c1 = table.Column(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are here, you should define meta = {1: object()} and then ensure that the copy for meta is deep as well.

)
def test_deepcopy_object_column(data):
# see https://github.com/astropy/astropy/issues/13435
t1 = Table({"a": data})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for Column, add a meta attribute and check the deepcopy.

@taldcroft
Copy link
Member

We should take this opportunity to clean up the docs. In the Table and QTable docstrings update the copy kwarg to:

    copy : bool, optional
        Copy the input column data and make a deep copy of the input meta.
        Default is True.

Update the copy() docstring to:

        copy_data : bool
            If `True` (the default), copy the underlying data array and make
            a deep copy of the ``meta`` attribute. Otherwise, use the same 
            data array and make a shallow (key-only) copy of ``meta``.

@neutrinoceros
Copy link
Contributor Author

Thanks @taldcroft, I've added your suggestions to the batch !
Btw, I've kept the branch history as additive as possible on this one to simplifying reviews, so please squash-merge !

@neutrinoceros
Copy link
Contributor Author

ah, got a conflict... too bad, I'll just rebase and rewrite the history after all.

@neutrinoceros
Copy link
Contributor Author

(for reference, the conflict was with #16002)

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neutrinoceros - sorry for the long delay. This looks good now and I see that @mhvk also approved. It looks like this needs a rebase to get CI picking up the latest. I'll set it to squash-merge once CI passes.

@taldcroft taldcroft enabled auto-merge (squash) March 27, 2024 16:05
auto-merge was automatically disabled March 27, 2024 16:12

Head branch was pushed to by a user without write access

@neutrinoceros
Copy link
Contributor Author

rebased !

@taldcroft taldcroft merged commit 69dc2ac into astropy:main Mar 27, 2024
26 of 27 checks passed
@taldcroft
Copy link
Member

Done thanks!!

@neutrinoceros neutrinoceros deleted the table/bug/object_col_deepcopy branch March 27, 2024 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deepcopy of table with object column returns shallow copy
4 participants