Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix an unintended exception being raised when attempting to compare two unequal Table instances. #15845

Merged
merged 7 commits into from Jan 11, 2024

Conversation

neutrinoceros
Copy link
Contributor

Description

Will fix #13421

The essence of the patch is to wrap existing logic in a try/except block as discussed in the issue, but I also tried to reduce code duplication, which resulted in a larger diff.

The test is also incomplete at the moment (see embedded comment). I'll come back to it when I'm confident that the current state survives CI.

  • By checking this box, the PR author has requested that maintainers do NOT use the "Squash and Merge" button. Maintainers should respect this when possible; however, the final decision is at the discretion of the maintainer that merges the PR.

@github-actions github-actions bot added the table label Jan 10, 2024
Copy link

github-actions bot commented Jan 10, 2024

Thank you for your contribution to Astropy! 🌌 This checklist is meant to remind the package maintainers who will review this pull request of some common things to look for.

  • Do the proposed changes actually accomplish desired goals?
  • Do the proposed changes follow the Astropy coding guidelines?
  • Are tests added/updated as required? If so, do they follow the Astropy testing guidelines?
  • Are docs added/updated as required? If so, do they follow the Astropy documentation guidelines?
  • Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see instructions for rebase and squash.
  • Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the "Extra CI" label. Codestyle issues can be fixed by the bot.
  • Is a change log needed? If yes, did the change log check pass? If no, add the "no-changelog-entry-needed" label. If this is a manual backport, use the "skip-changelog-checks" label unless special changelog handling is necessary.
  • Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
  • Is a milestone set? Milestone must be set but we cannot check for it on Actions; do not let the green checkmark fool you.
  • At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate "backport-X.Y.x" label(s) before merge.

Copy link

👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?

@neutrinoceros neutrinoceros changed the title TST: add a regression test for bug 13421 BUG: fix comparing unequal tables Jan 10, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a few more words on what behavior was actually fixed. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better now ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original example in #13421 is a little confusing because it has two different things, the length and the column names. As noted earlier comparing a length=2 and length=3 table will fail in a different way. So maybe this test should only change one thing at a time. Or maybe test the three cases:

  1. Data same, names different
  2. Data different, names same
  3. Data different, names different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done !

@pllim pllim added this to the v6.0.1 milestone Jan 10, 2024
@pllim pllim added Bug 💤 backport-v6.0.x on-merge: backport to v6.0.x labels Jan 10, 2024
@neutrinoceros neutrinoceros changed the title BUG: fix comparing unequal tables BUG: fix an unintended exception being raised when attempting to compare two unequal Table instances. Jan 11, 2024
Comment on lines +3686 to +3694
eq = self.__eq__(other)
if isinstance(eq, bool):
# bitwise operators on bool values not reliable (e.g. `bool(~True) == True`)
# and are deprecated in Python 3.12
# see https://github.com/python/cpython/pull/103487
return not eq
else:
return ~eq
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fixes a secondary buglet that I discovered with the test I added. I made an attempt at separating it into its own PR, but I couldn't find a way to test it without fixing the first bug too. I could however move it to a follow up PR if requested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

@neutrinoceros neutrinoceros marked this pull request as ready for review January 11, 2024 09:31
Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@neutrinoceros - overall this looks nice! I have a number of comments, some nit-picky. 😄

Comment on lines +3686 to +3694
eq = self.__eq__(other)
if isinstance(eq, bool):
# bitwise operators on bool values not reliable (e.g. `bool(~True) == True`)
# and are deprecated in Python 3.12
# see https://github.com/python/cpython/pull/103487
return not eq
else:
return ~eq
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

try:
self_is_masked = self.has_masked_columns
other_is_masked = isinstance(other, np.ma.MaskedArray)
if (self_is_masked + other_is_masked) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of relying on the int value of bool. Basically this is the XOR, which doesn't exist in Python but for bools can be written as below:

Suggested change
if (self_is_masked + other_is_masked) == 1:
# One table is masked and the other is not
if self_is_masked != other_is_masked:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, that's cleaner. Thank you !

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that if (self_is_masked + other_is_masked) == 1: isn't the best code, but interpreting True as 1 and False as 0 is safe.

>>> issubclass(bool, int)
True

I don't know if the XOR operator ^ should be preferred over != here, but it does exist:

>>> for x in (True, False):
...     for y in (True, False):
...         print(f"{x = }, {y = }, {x ^ y = }")
... 
x = True, y = True, x ^ y = False
x = True, y = False, x ^ y = True
x = False, y = True, x ^ y = True
x = False, y = False, x ^ y = False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ! I think I learned about it a while back but probably never encountered it in production. It's indeed a perfect use case, but I'm still hesitant to use it since it's so rarely seen that it might affect readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it is "safe", but I've always thought of it as an accidental implementation detail not a feature. As we get more into typing it just feels wrong. There is no argument that the strict rules say that bool + bool => int, but it feels ugly.

That's funny about the ^ XOR. I wasn't sure so I googled it and got to a stackoverflow page that said it doesn't exist. Go figure, and I learned something new today. Definitely using that is better than !=.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't debate further, I'm too happy that I get to use something that feels new 😄

self_is_masked = self.has_masked_columns
other_is_masked = isinstance(other, np.ma.MaskedArray)
if (self_is_masked + other_is_masked) == 1:
# remap variables to a and b where a is masked and b isn't
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice refactor to remove duplication!

Comment on lines 3739 to 3742
except TypeError:
# numpy may complain that structured array are not comparable
# see https://github.com/astropy/astropy/issues/13421
return False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also the potential for a ValueError if the tables are not broadcastable (this was also previously a warning). E.g.

In [48]: np.array([1, 2]) == np.array([1, 2, 3])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[48], line 1
----> 1 np.array([1, 2]) == np.array([1, 2, 3])

ValueError: operands could not be broadcast together with shapes (2,) (3,) 

Even though it is a bit of duplication, you should try/except for (TypeError, ValueError) around only the two lines which actually do the comparison. Otherwise those broad catches could be masking real errors in all the other (non-trivial) code.

If there isn't a test for this case you should add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking. I'm on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

false_mask = np.zeros(1, dtype=[(n, bool) for n in other.dtype.names])
result = (self.as_array() == other.data) & (other.mask == false_mask)
false_mask = np.zeros(1, dtype=[(n, bool) for n in a.dtype.names])
return (a.data == b) & (a.mask == false_mask)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert to setting result and returning that at the end? Overall I prefer this pattern and try to avoid early return if it doesn't require contorted logic or a giant if block.

This is one thing in astropy that each subpackage has its own style and it helps to maintain that style FWIW. 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, no problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original example in #13421 is a little confusing because it has two different things, the length and the column names. As noted earlier comparing a length=2 and length=3 table will fail in a different way. So maybe this test should only change one thing at a time. Or maybe test the three cases:

  1. Data same, names different
  2. Data different, names same
  3. Data different, names different

@taldcroft
Copy link
Member

Another thing is that this is a good opportunity to update the narrative docs: https://docs.astropy.org/en/stable/table/access_table.html#table-equality

What is said there is incomplete or not quite true any more: "This is the same as the behavior of numpy structured arrays.". That used to be correct, but in modern numpy comparing structured arrays with different columns or not-broadcastable will give an exception. So for == it should be noted that it returns False if the arrays have different columns or are not broadcastable, unlike comparing structured arrays.

It might also be worth adding a note that both == and .values_equal() report equality after broadcasting. Honestly I find this a bit confusing so I might not be alone. It matches numpy so maybe it makes sense, but a user might expect that if t1 has 100 rows and t2 has 1 rows that t1 == t2 would be False. Indeed if t2 has 2 rows then t1 == t2 is False.

@neutrinoceros
Copy link
Contributor Author

So for == it should be noted that it returns False if the arrays have different columns or are not broadcastable, unlike comparing structured arrays.

@taldcroft Actually Table._rows_equal's docstring explicitly states that it "Returns a 1-D boolean numpy array", so I used np.array([False]) as the default return value instead of False (this was also necessary to avoid conditional branches in the now generalised test). Let me know if I should change it. I'll update narrative docs depending on your reply :)

@neutrinoceros
Copy link
Contributor Author

switching to draft until I get old deps CI green again

@neutrinoceros neutrinoceros marked this pull request as draft January 11, 2024 14:07
@taldcroft
Copy link
Member

So for == it should be noted that it returns False if the arrays have different columns or are not broadcastable, unlike comparing structured arrays.

@taldcroft Actually Table._rows_equal's docstring explicitly states that it "Returns a 1-D boolean numpy array", so I used np.array([False]) as the default return value instead of False (this was also necessary to avoid conditional branches in the now generalised test). Let me know if I should change it. I'll update narrative docs depending on your reply :)

I also noticed that and was originally going to say your update to __ne__ was not necessary if there can never be a scalar bool. But in fact the == has always returned a scalar bool in certain cases, following the old behavior of numpy. So I think the answer is to fix the docstrings and documentation to match the existing legacy behavior. As we saw in #13421 this behavior is expected by users who mostly don't read the docs 😄.

@neutrinoceros
Copy link
Contributor Author

Ok, this makes our new test significantly more complicated so let's see how this performs. @taldcroft I'd appreciate if you could review the current state before I commit changes to documentation. Thanks !

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great now, thanks! Just a couple of minor suggestions.

self_is_masked = self.has_masked_columns
other_is_masked = isinstance(other, np.ma.MaskedArray)

whitelist = (TypeError, ValueError if not NUMPY_LT_1_25 else DeprecationWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use whitelist now, in favor of allowlist / blocklist or other equivalents. Here you might just be more descriptive and call this allowed_numpy_exceptions. This will probably also have a good side effect of formatting so that it is bit more clear:

allowed_numpy_exceptions = (
    TypeError,
    ValueError if not NUMPY_LT_1_25 else DeprecationWarning
)

except (TypeError, ValueError):
# dtypes are not comparable or arrays can't be broadcasted:
# a simple bool should be returned
assert not t1 == t2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are here what about explicitly checking the return type in all 8 asserts. I think this can be done compactly with this but you should check.

        assert not (cmp := t1 == t2) and isinstance(cmp, bool)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically true, but it is not needed: the expression t1 == t2 raises a DeprecationWarning on current numpy (treated as an error), so the test would already fail if the return type didn't match the expectation.

@neutrinoceros
Copy link
Contributor Author

Alright, I think I've taken all your suggestions into account, including docs updates. Also, CI should be green now, so undrafting !

@neutrinoceros neutrinoceros marked this pull request as ready for review January 11, 2024 15:29
Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Two suggestions, one you probably should not implement, the other (for the tests), maybe something to think about: I personally like tests that are explicit about what is expected. But absolutely fine without this too, so approving.

@@ -3683,15 +3684,26 @@ def __eq__(self, other):
return self._rows_equal(other)

def __ne__(self, other):
return ~self.__eq__(other)
eq = self.__eq__(other)
if isinstance(eq, bool):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do return self.__eq__(other) == False, but arguably trying to be too clever...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this would probably be more performant (isinstance is infamously slow). If this function is considered performance-critical in any way I think it's worth considering, otherwise I think it just hurts readability (I know it's hard to resist the call of golfing sometimes !)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, readability is why I felt I was trying to be too clever. Let's stick with what you have!

@@ -2424,3 +2424,63 @@ def test_mixin_join_regression():
t12 = table.join(t1, t2, keys=("index", "flux1", "flux2"), join_type="outer")

assert len(t12) == 6


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the test would be clearer if it gave the result in the parametrization rather than do a check to see what would be expected. I.e., "t1, t2, eq"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked myself the same question as I wrote the test, but I found that the try/except pattern expressed the documented behaviour more clearly than what could feel like a list of ad-hoc expectations. Honestly it seems arguable both ways. Maybe @taldcroft has stronger opinions that could help forming a decision ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reasonable. As said, I am absolutely fine with just merging the PR as is! (in another PR, @taldcroft pointed out the old adage of perfect being the enemy of good enough)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I looked at this I had the same idea of "t1, t2, eq" being a little more clear, but then I reminded myself of perfect and good. I think this is quite good. 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having said that @neutrinoceros - I generally agree with @mhvk's sentiment so that is something to keep in mind going forward.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try !

@mhvk
Copy link
Contributor

mhvk commented Jan 11, 2024

Ah, one more point: in MaskedArray, equality tests for structured arrays explicitly ignore masked entries. This is similar to (np.array([0, 1, 2]) == np.ma.MaskedArray([0, 1, 1], mask=[False, False, True])).all() being True. For this PR, I think it makes sense to just follow what was there and fix the actual bug, but it may be worth thinking a bit whether this is actually the behaviour that is wanted.

@neutrinoceros
Copy link
Contributor Author

I my opinion this is the most reasonable choice of behaviour. As a user I think I'd be unpleasantly surprised if masked values were taken into account here.

@mhvk
Copy link
Contributor

mhvk commented Jan 11, 2024

As a user I think I'd be unpleasantly surprised if masked values were taken into account here.

The thing is that for the table comparison, masked values are explicitly counted as always unequal to unmasked ones, a bit as if they were NaN. Which is a reasonable choice, but perhaps one worth thinking more about (if so, in a separate issue!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this count as behavior change that should not be backported? If so, please update milestone and remove backport label. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taldcroft should answer here but my understanding is that we're just making the intended behaviour actually work and align out-of-dat documentation, so I think it's okay to backport.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pllim - I agree with what @neutrinoceros said, with the conclusion that this should be backported.

Copy link
Member

@taldcroft taldcroft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@taldcroft taldcroft merged commit c714978 into astropy:main Jan 11, 2024
25 of 26 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/astropy that referenced this pull request Jan 11, 2024
…sed when attempting to compare two unequal ``Table`` instances.
@neutrinoceros neutrinoceros deleted the table/bug/table_comparison branch January 11, 2024 20:57
saimn pushed a commit that referenced this pull request Mar 25, 2024
…n attempting to compare two unequal ``Table`` instances.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug table 💤 backport-v6.0.x on-merge: backport to v6.0.x
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QTable equality broken with numpy 1.23
5 participants