Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make numbered figures, chapters etc. flat in dep #69

Closed
amir-zeldes opened this issue Oct 23, 2020 · 11 comments
Closed

Make numbered figures, chapters etc. flat in dep #69

amir-zeldes opened this issue Oct 23, 2020 · 11 comments

Comments

@amir-zeldes
Copy link
Owner

Currently phrases like "Figure 1" have many analyses:

  • nummod
  • dep
  • flat
  • amod

Choose one for consistency (prob. flat?)

https://corpling.uis.georgetown.edu/annis/#_q=ImVudGl0aWVzIg&_c=R1VN&cl=5&cr=5&s=0&l=10

@amir-zeldes
Copy link
Owner Author

Going with dep due to cases like "Chapters 10-15", which don't work as flat.

@amir-zeldes
Copy link
Owner Author

Note that these are nummod in EWT, which I don't think is right... adding @sebschu @nschneid - maybe change in EWT as well? Known lemmas in this construction:

  • section
  • part
  • chapter
  • volume
  • method
  • table
  • figure
  • listing (like a numbered code listing, occurs in GUM academic)

@nschneid
Copy link
Contributor

Recall UniversalDependencies/docs/issues/654nummod seems to be the policy for now.

What to do with non-quantity modifier numbers also falls under the broader discussion of nominal-nominal relations, I suppose (#71).

@amir-zeldes
Copy link
Owner Author

Does anyone really like that? I'm mainly seeing you and other people being sympathetic to changing nummod in cases where nothing is being counted. Otherwise what does it really contribute beyond the NUM tag which is already available from POS?

@amir-zeldes
Copy link
Owner Author

TBC if we endorse this guideline, we are saying that syntactically "3 houses" and "house 3" are the same construction in UD

amir-zeldes added a commit to UniversalDependencies/UD_English-GUM that referenced this issue Oct 30, 2020
  * Totally reviewed entity and coreference information
  * Added discourse dependency annotations
  * Moved Typo from MISC to FEATS
  * Issues addressed:
    * amir-zeldes/gum#71
    * amir-zeldes/gum#69
    * amir-zeldes/gum#66
    * amir-zeldes/gum#65
    * UniversalDependencies/UD_English-EWT#101
    * UniversalDependencies/UD_English-EWT#99
    * #5
    * #4
amir-zeldes added a commit to UniversalDependencies/UD_English-GUMReddit that referenced this issue Oct 31, 2020
  * Totally reviewed entity and coreference information
  * Added discourse dependency annotations
  * Moved Typo from MISC to FEATS
  * Issues addressed:
    * amir-zeldes/gum#71
    * amir-zeldes/gum#69
    * amir-zeldes/gum#66
    * amir-zeldes/gum#65
    * UniversalDependencies/UD_English-EWT#101
    * UniversalDependencies/UD_English-EWT#99
    * #5
    * #4
@nschneid
Copy link
Contributor

nschneid commented Nov 5, 2020

TBC if we endorse this guideline, we are saying that syntactically "3 houses" and "house 3" are the same construction in UD

Well, there are lots of constructions that UD lumps under the same deprel even though a finer-grained annotation scheme might distinguish them. Sometimes subtypes help with the distinction, as in nmod:poss and acl:relcl.

We can say that the two most important things UD cares about are the types of things being related (e.g. clause vs. nominal), and which is the head. At this level both "3 houses" and "house 3" are similar. But obviously the word order is semantically significant, and morphosyntactically (agreement) they are different. I would like to see some sort of distinction that involves a clear limitation on the scope of nummod but I'm not sure typologically what the criteria should be: limited to quantity-like numeric modifiers (and possibly extension of similar morphosyntax to other uses)?

BTW, I'm not sure SD would distinguish them either. The SD guidelines say:

num: numeric modifier

A numeric modifier of a noun is any number phrase that serves to modify the meaning of the noun.

...which would seem to fit both.

@nschneid
Copy link
Contributor

nschneid commented Nov 5, 2020

Or, we could say that assigning even a temporary label to uniquely identify an instance of an entity makes it like a proper name rather than a "free" use of a number, so flat should apply. Then:

figure 1 = flat(figure, 1)—portrays "figure 1" as the figure's official name (at least in the context)

figure number 1 = compound(figure, number), flat(number, 1)? (Or however "French actor Ulliel" is handled for the relationship between "figure" and "number": #71.) This portrays "number 1" as the figure's official name and the word "figure" as a descriptor.

@amir-zeldes
Copy link
Owner Author

The problem with flat is cases like this:

Chapters 3 - 5

If you want 3 to govern 5, then 3 can't be flat, since that label can't have children. Because of this, I finally went with dep for the current UD release (already in UD dev for the freeze).

As for "house 3", if we use the naive definition of nummod as "any modifier that is a number", we'll end up with a different analysis for "house A", since "A" is not a number (and not tagged NUM).

@nschneid
Copy link
Contributor

nschneid commented Nov 5, 2020

Good point, it's a "chapter X", and if flat dependents are unable to contain compositional subtrees it's a problem. Same with "house 3" or "house A", where the label may or may not be a numeric form.

If I had to choose a label other than dep, then, I'd probably go with left-headed compound. But it might be worth having a subtype as this is really not the garden variety compound construction.

@amir-zeldes
Copy link
Owner Author

In principle a new subtype might be optimal, but I hesitate about suggesting subtypes since most parsers target them as separate labels (for example Stanza's pretrained models output distinct subtypes). Introducing very sparsely attested subtypes could really mess with automatic parsers trained on the data, so my preference has been to add things into FEATS (which is admittedly predicted by NLP tools, but separately) or even MISC. For now I can live with 'dep', which has the advantage of being a garbage can category we can sift through in the future and reassign.

@amir-zeldes
Copy link
Owner Author

Handled consistently as dep in 6.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants