-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Add coref_resolved method to CorefPredictor #2296
Conversation
else: | ||
resolved[coref[0]] = mention_span.text + final_token.whitespace_ | ||
for i in range(coref[0] + 1, coref[1] + 1): | ||
resolved[i] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this doing exactly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since resolved
is a list of the tokens in the document with white space, line 96 replaces the first coreference token with the mention it refers to, while lines 97-98 goes through and masks out all subsequent tokens with ""
, so that they are ultimately eliminated in the "".join
that follows. This procedure is necessary for replacing multi-word coreferences with a single mention, e.g. "the country" with "China" in example. This logic is borrowed from lines 262-271 here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah--that makes sense. I recommend turning part of your explanation into a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, yea I'll add that comment. resolved.remove(i)
would work as well, but it's slightly more inefficient (quadratic vs. linear).
# Correctly formats possessive mentions with 's endings | ||
# These include my, his, her, as well as persons's, computer's, etc. | ||
if final_token.tag_ in ["PRP$", "POS"]: | ||
resolved[coref[0]] = mention_span.text + "'s" + final_token.whitespace_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps Spacy avoids reconstructing a sentence because the possessive suffix depends on the resolved word. E.g. "Michael's" (singluar) but "zebras'" (plural). Your logic seems like a reasonable compromise, but I would expect a clear disclaimer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is one area where mistakes in grammar might occur, in producing "zebras's" instead of "zebras'". In the spaCy neuralcoref extension, you're right that it doesn't add this reconstruction, and instead replaces possessive pronouns with their main mention directly, e.g. "the country's" becomes "China". I can see how certain applications would benefit from maintaining the possessive form, while others would do fine without it. More if..else
style code could be added to guarantee grammatical accuracy in these cases, but it's probably stretching the logic too far and opening up further sources of error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep--it would be stretching the logic too far. I recommend just adding a disclaimer about the issue in the comments of this method.
@shellshock1911 thanks for the examples in the description. I think we should have them as test cases if this is to be merged. |
These are the examples I was using for testing, that could be used as test cases here as well:
Testing these shows that the method can accurately handle both multi-word main mentions and coreferences, as well as both personal and possessive pronouns. Mistakes in resolution, such as in the third example, result from errors in the underlying model's prediction, and so are out of the scope of this contribution. |
@shellshock1911 I'd like to merge this but we need two things:
Then we're good to go! |
@schmmd - Awesome, I've made changes to fix issues with the linting and types, but I kept failing my unit tests in the build system, although they pass locally. I looked into more and realized that the serialized models provided in |
@shellshock1911 it looks like there's still a linting issue. For your test case, I probably wouldn't call a predictor in the test case. I would probably have only one or two examples and provide hard-coded predictions as inputs. The test would then check if resolving the example gives the text you intended. That way your test will run faster and will be independent of the actual coreference models. |
@shellshock1911 to more directly answer your question--yes the test fixture models are used to test interfaces and not correctness. But for your unit test, I would avoid using any models and instead hard-code predictions. |
4b6d947
to
c5b31ad
Compare
* Resolves coreferences by producing a document that has had its coreferences substituted with their main mentions * Ex: Personal ======== "Charlie wants to buy a game, so he can play it with friends." --> "Charlie wants to buy a game, so Charlie can play a game with friends." Possessive ========== "Stocks also got a boost after China took steps to encourage bank lending and stimulate the country's flagging economy." --> "Stocks also got a boost after China took steps to encourage bank lending and stimulate China's flagging economy."
a53367d
to
b20d12a
Compare
Thanks for approving this PR! |
Thanks for the hard work! I'm not going to merge it right now as I'm just about to head out for a week vacation--but someone else may or I will as soon as I'm back! |
Any thoughts on this? Changes in the upstream introduced a linting issue, but I resolved that and now the build is good 😄. |
@schmmd, looks like you forgot about this - feel free to be brave and click merge if the tests pass =). |
Thanks guys! I do still think this would be a useful utility to the coreference predictor. In recent weeks, I've been running news article processing pipeline with this fork of AllenNLP, where this in-place resolution method has been applied on millions of articles without apparent error, i.e. no crashes. It'd be awesome to have though in the upstream and I'm sure others could benefit as well since it automates the coreference resolution step of pre-processing in a larger NLP pipeline. |
Thanks @shellshock1911! Sorry this fell through the cracks and took so long to get merged. |
* Add coref_resolved method to CorefPredictor * Resolves coreferences by producing a document that has had its coreferences substituted with their main mentions * Ex: Personal ======== "Charlie wants to buy a game, so he can play it with friends." --> "Charlie wants to buy a game, so Charlie can play a game with friends." Possessive ========== "Stocks also got a boost after China took steps to encourage bank lending and stimulate the country's flagging economy." --> "Stocks also got a boost after China took steps to encourage bank lending and stimulate China's flagging economy." * Remove redudant Doc import
* Add coref_resolved method to CorefPredictor * Resolves coreferences by producing a document that has had its coreferences substituted with their main mentions * Ex: Personal ======== "Charlie wants to buy a game, so he can play it with friends." --> "Charlie wants to buy a game, so Charlie can play a game with friends." Possessive ========== "Stocks also got a boost after China took steps to encourage bank lending and stimulate the country's flagging economy." --> "Stocks also got a boost after China took steps to encourage bank lending and stimulate China's flagging economy." * Remove redudant Doc import
Resolves coreferences by producing a document that has had
its coreferences substituted with their main mentions
Ex:
Personal
"Charlie wants to buy a game, so he can play it with friends."
-->
"Charlie wants to buy a game, so Charlie can play a game
with friends."
Possessive
"Stocks also got a boost after China took steps to encourage
bank lending and stimulate the country's flagging economy."
-->
"Stocks also got a boost after China took steps to encourage
bank lending and stimulate China's flagging economy."