Add stanza constituency output #78

BramVanroy · 2021-11-08T08:31:27Z

Since release v1.3.0, stanza has a constituency parser for English. Support for more languages will follow. It would be great if we could access the constituency parse from within the spaCy wrapper too.

At first I thought I'd create a separate package for this that uses spacy_stanza under the hood and then registers a custom component that adds the constituency parse. However, that implies either copying most of spacy_stanza, subclassing StanzaTokenizer, and writing to Underscore objects in __call__. Or only creating a custom component and after receiving a Doc, parsing its text again with stanza to get the constituency parse. Neither of these are ideal, so I would hope that you are open to incorporating such functionality in spacy_stanza directly.

The stanza constituency parser adds a constituency object (a Tree) to every sentence. Things that may be considered

Subclassing the stanza constituency Tree so that we can use spaCy Tokens as node labels. This would allow us to navigate the spaCy sentence Span object via the tree
Register ._.constituency for every sentence span and every span that is a constituent (a sub-classed stanza Tree)
Register ._.constituency for every Token, which would be its subtree in the full tree with itself as the node (a sub-classed stanza Tree)

If you agree I can work on this from time to time.

The text was updated successfully, but these errors were encountered:

adrianeboyd · 2021-12-17T10:14:00Z

Sorry for not getting back to you about this sooner. I think my main concern would be that it sounds like it's going to be relatively hard to use this annotation from a spacy Doc. I haven't looked into how they store the constituency trees in detail, but using plain stanza with its original data structures sounds like it might be better from a usability perspective? What do you think are the advantages of having this in a spacy Doc?

BramVanroy · 2021-12-17T12:19:44Z

A tree is an iterable of subtrees with ultimately Words as terminals and linguistic categories as intermediate nodes. From that perspective, I was thinking of having a similar Tree structure in the spacy_stanza API that used spaCy Tokens instead. You'd still be able to traverse the constituency tree as per stanza API but the terminals that you get out of it are spaCy tokens. I might be biased, but this would be useful in my own research where I want to use constituency trees on the one hand as well as spaCy's extensibility for my own components.

As always, if this does not seem useful for the wider user-base to you, then we can close this topic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add stanza constituency output #78

Add stanza constituency output #78

BramVanroy commented Nov 8, 2021 •

edited

Loading

adrianeboyd commented Dec 17, 2021

BramVanroy commented Dec 17, 2021

Add stanza constituency output #78

Add stanza constituency output #78

Comments

BramVanroy commented Nov 8, 2021 • edited Loading

adrianeboyd commented Dec 17, 2021

BramVanroy commented Dec 17, 2021

BramVanroy commented Nov 8, 2021 •

edited

Loading