Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stanza constituency output #78

Open
BramVanroy opened this issue Nov 8, 2021 · 2 comments
Open

Add stanza constituency output #78

BramVanroy opened this issue Nov 8, 2021 · 2 comments

Comments

@BramVanroy
Copy link
Contributor

BramVanroy commented Nov 8, 2021

Since release v1.3.0, stanza has a constituency parser for English. Support for more languages will follow. It would be great if we could access the constituency parse from within the spaCy wrapper too.

At first I thought I'd create a separate package for this that uses spacy_stanza under the hood and then registers a custom component that adds the constituency parse. However, that implies either copying most of spacy_stanza, subclassing StanzaTokenizer, and writing to Underscore objects in __call__. Or only creating a custom component and after receiving a Doc, parsing its text again with stanza to get the constituency parse. Neither of these are ideal, so I would hope that you are open to incorporating such functionality in spacy_stanza directly.

The stanza constituency parser adds a constituency object (a Tree) to every sentence. Things that may be considered

  • Subclassing the stanza constituency Tree so that we can use spaCy Tokens as node labels. This would allow us to navigate the spaCy sentence Span object via the tree
  • Register ._.constituency for every sentence span and every span that is a constituent (a sub-classed stanza Tree)
  • Register ._.constituency for every Token, which would be its subtree in the full tree with itself as the node (a sub-classed stanza Tree)

If you agree I can work on this from time to time.

@adrianeboyd
Copy link
Contributor

Sorry for not getting back to you about this sooner. I think my main concern would be that it sounds like it's going to be relatively hard to use this annotation from a spacy Doc. I haven't looked into how they store the constituency trees in detail, but using plain stanza with its original data structures sounds like it might be better from a usability perspective? What do you think are the advantages of having this in a spacy Doc?

@BramVanroy
Copy link
Contributor Author

A tree is an iterable of subtrees with ultimately Words as terminals and linguistic categories as intermediate nodes. From that perspective, I was thinking of having a similar Tree structure in the spacy_stanza API that used spaCy Tokens instead. You'd still be able to traverse the constituency tree as per stanza API but the terminals that you get out of it are spaCy tokens. I might be biased, but this would be useful in my own research where I want to use constituency trees on the one hand as well as spaCy's extensibility for my own components.

As always, if this does not seem useful for the wider user-base to you, then we can close this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants