Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span `__hash__` and `__eq__` are incompatible #4965

Closed
Evpok opened this issue Feb 2, 2020 · 4 comments · Fixed by #5005
Closed

Span `__hash__` and `__eq__` are incompatible #4965

Evpok opened this issue Feb 2, 2020 · 4 comments · Fixed by #5005
Labels

Comments

@Evpok
Copy link

@Evpok Evpok commented Feb 2, 2020

From __hash__ doc

The only required property is that objects which compare equal have the same hash value

While right now

import spacy
nlp = spacy.load("fr_core_news_sm")
d = nlp("Le succès de General Electric n'est dépassé que par celui de France Télécom")

Gives

>>> d[3:5] == d.ents[0]
True
>>> hash(d[3:5]) == hash(d.ents[0])
False

I understand the rationale here: it is convenient to be able to check that Spans effectively cover the same spans, while Spans bearing different annotations are conceptually different, but it makes functions that depend on hash fail in counter-intuitive ways such as

>>> d[3:5] in d.ents
True
>>> d[3:5] in set(d.ents)
False

Or the specific issue that made me notice this

>>> ent_dict = {e: e.label_ for e in d.ents}
>>> ent_dict[d[3:5]]
KeyError: General Electric
@adrianeboyd

This comment has been minimized.

Copy link
Collaborator

@adrianeboyd adrianeboyd commented Feb 3, 2020

That's a very charitable explanation for everything! I think it's more likely an oversight that the comparison functions weren't updated when more attributes were added.

Since it's deprecated, we might want to replace the __richcmp__ function with the typical python __le__, etc. functions and consider all the span attributes in the hash and comparisons, like the newer kb_id, too. In any case, __hash__ and __eq__ should definitely do the same thing here.

Off the top of my head I'm not totally sure how to compare labels or kb ids, e.g.:

Span(doc, 0, 5, label=1) < Span(doc, 0, 5, label=2) == ???

But __eq__ should be easy to get right.

@Evpok

This comment has been minimized.

Copy link
Author

@Evpok Evpok commented Feb 3, 2020

But __eq__ should be easy to get right.

In that case, is there an easy replacement to check that two Span well… span the same tokens?

@adrianeboyd

This comment has been minimized.

Copy link
Collaborator

@adrianeboyd adrianeboyd commented Feb 3, 2020

I take what I said about __eq__ back a bit. There's quite a bit to compare if you really want to compare Span objects properly. From the init:

Doc doc, int start, int end, label=0, vector=None, vector_norm=None, kb_id=0

I don't think there's an easy replacement, but if __eq__ and __hash__ don't compare everything, it's pretty weird behavior for an object. You can even currently compare spans from different docs and they're equal if the offsets are the same, which is definitely not what anybody should want.

I think you might have to just write a utility function to compare the offsets if that's what you're interested in. I'd have to think about whether there's an elegant way to add this to the library.

@Evpok

This comment has been minimized.

Copy link
Author

@Evpok Evpok commented Feb 3, 2020

I think you might have to just write a utility function to compare the offsets if that's what you're interested in

Not cross-document, even if that might make some sense if you process the same input and can guarantee that the tokenization is the same but that seems to specific. So I think either a token-by-token comparison or checking that the parent Doc is the same and that the offsets match?

@adrianeboyd adrianeboyd mentioned this issue Feb 12, 2020
3 of 3 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Linked pull requests

Successfully merging a pull request may close this issue.

2 participants
You can’t perform that action at this time.