Attachment of list item enumerators #518

nschneid · 2024-04-27T17:13:35Z

Sequential markers like "1.", "(a)", and so forth lacked a good policy for how they should attach, but this was just clarified as discourse: UniversalDependencies/docs#1027

I will update EWT, where they are currently nummod. I tried several approaches to query these—sentence-initial nummods, nummods modifying a non-nominal, etc. The approach that worked best was to query for nummods with ".", ")", "]" immediately after the number:

https://universal.grew.fr/?custom=662d31829073f

This excludes NUM-headed nummods, which are area codes in telephone numbers (this should be fixed separately).

In GUM they are dep. Because GUM has more genres than EWT I would guess the punctuation associated with enumerators (if any) will be more varied. But the LS tag can also help identify them.

The text was updated successfully, but these errors were encountered:

nschneid · 2024-04-27T17:23:23Z

Actually, there are 4 cases in EWT where it has a following ")" but no LS, and these are cross-references so should not be discourse. It appears all instances of non-root LS should have their deprel changed to discourse.

AngledLuffa · 2024-04-27T22:58:11Z

Sounds good, but would you add a few words on the appropriate UPOS tagging? In EWT we get the tags 1_NUM )_PUNCT whereas in GUM that becomes a single token with the X tag

# sent_id = GUM_interview_herrick-48
# text = You either 1) sacrifice on breadth
3       1)      1)      X       LS

# sent_id = GUM_academic_replication-12
# text = The severe concerns underpinning the alleged crisis have several dimensions relating to: (a) the (small) amount
14      (a)     (a)     X       LS

…nummod, as described in UniversalDependencies/UD_English-EWT#518

AngledLuffa · 2024-04-27T23:07:18Z

also the tag on a) might not necessarily be NUM, although that's how it's done in EWT still

# sent_id = email-enronsent36_02-0033
# newpar id = email-enronsent36_02-p0005
# text = Attached for your review is a blacklined version of the: (a) Schedule and (b) Paragraph 13 to the ISDA Master Agreement.
12      (       (       PUNCT   -LRB-   _       13      punct   13:punct        SpaceAfter=No
13      a       a       NUM     LS      _       15      nummod  15:nummod       SpaceAfter=No
14      )       )       PUNCT   -RRB-   _       13      punct   13:punct        _

nschneid · 2024-04-27T23:30:21Z

Chris said he would keep NUM for "(a)" etc. (it functions like a number in indicating sequential order). I think X is wrong because it suggests it's somehow not a real word of English. I wouldn't mind a broadened SYM, but the group did not come to an agreement on the UPOS, so SYM remains officially restricted to non-alphanumerics.

nschneid · 2024-04-28T00:51:55Z

Oh I see you asked about "(1)" as well. I would definitely advocate NUM for that.

AngledLuffa · 2024-04-28T01:00:29Z

should be tokenized though, no? that's the standard in ptb and ewt

…

On Sat, Apr 27, 2024, 5:52 PM Nathan Schneider ***@***.***> wrote: Oh I see you asked about "(1)" as well. I would definitely advocate NUM for that. — Reply to this email directly, view it on GitHub <#518 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWN5FFZSXAWUZUPVIKDY7RB4BAVCNFSM6AAAAABG4HUW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGI3DSMRXGM> . You are receiving this because you commented.Message ID: ***@***.***>

nschneid · 2024-04-28T01:16:52Z

GUM doesn't. The tokenization varies by treebank and it would be impractical to change the tokenization IMO.

AngledLuffa · 2024-04-28T01:38:05Z

Because of the coref misc annotations or because it's part of multiple annotation layers outside UD? I don't think it would be an impossible task to retokenize and it would make things more consistent

…

On Sat, Apr 27, 2024, 6:17 PM Nathan Schneider ***@***.***> wrote: GUM doesn't. The tokenization varies by treebank and it would be impractical to change the tokenization IMO. — Reply to this email directly, view it on GitHub <#518 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA2AYWNOLHQB4NDAY4X3RT3Y7REZVAVCNFSM6AAAAABG4HUW52VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRGI3TOMBXHE> . You are receiving this because you commented.Message ID: ***@***.***>

nschneid · 2024-04-28T01:53:00Z

I think @amir-zeldes is happy with the GUM tokenization of list item markers as it is easier on annotators (who do it manually and then don't have to go through the effort of attaching punctuation in the tree). For EWT I don't want to mess with LDC tokenization as it will break compatibility with Penn trees.

…nummod, as described in UniversalDependencies/UD_English-EWT#518

amir-zeldes · 2024-04-29T14:40:15Z

Yeah, I basically think the decision to tokenize parts of a marker like "a.)" is wrong, it's confusing to me, leads to unmatched brackets and ambiguous period tokens, and you only end up reattaching them as punct for no real gain that I can see. Are there stats on what other corpora/languages do with these?

As for POS, I'd be willing to consider NUM for things that are numeric. Is there a regex you have in mind for what to include in that? I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks, little pointing hands and the like, I think those should be SYM (X was a legacy thing can't remember what we were imitating there)

nschneid · 2024-04-29T14:54:39Z

I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks

You mean bullets? The guidelines say PUNCT for those as they are not pronounced, and given the disagreement about this among the core group the conclusion was to stick with the status quo.

amir-zeldes · 2024-04-29T20:35:14Z

upos is not super important for me so I wouldn't fight for that too much, but I think "discourse" for numerical LS but "punct" for symbols is wrong/potentially confusing for parsing models, so I don't want to implement that. I don't suppose anyone wants to use PUNCT/discourse for bullets?

nschneid · 2024-04-29T20:55:47Z

I don't suppose anyone wants to use PUNCT/discourse for bullets?

Nope.

There's no perfect solution that everybody likes but it's better to have a solution.

amir-zeldes · 2024-04-30T15:33:53Z

No question, just still seems wrong to me. So are we doing discourse for this release already?

nschneid · 2024-04-30T17:24:59Z

Yes

amir-zeldes · 2024-04-30T17:55:09Z

OK, so if upos for non-bullets is NUM a la Chris, what is the NumForm for things like (a)? I assume NumType is Ord right?

nschneid · 2024-04-30T19:36:07Z

Ord seems odd because it usually corresponds to a suffix, but in any case I'm not going to mess with whatever is in EWT.

amir-zeldes · 2024-04-30T20:04:38Z

Looks like EWT has it as upos NUM with Card + Digit for numerical ones like "(1)", and upos NUM also for "(A)", but with no NumType or NumForm... That doesn't seem right/would ruin the current state in GUM where NUM guarantees that we have a NumType and NumForm.

I'm happy to change all LS that are not bullets and have some kind of ordering meaning to NUM, but then I think they should have a NumType and NumForm - would you be OK adding that to EWT?

nschneid · 2024-04-30T20:36:39Z

Looks like this is #465, and we were waiting to try to figure out a complete solution to LS. :) I'll comment there.

rueter · 2024-05-01T07:11:29Z

Chris said he would keep NUM for "(a)" etc. (it functions like a number in indicating sequential order). I think X is wrong because it suggests it's somehow not a real word of English. I wouldn't mind a broadened SYM, but the group did not come to an agreement on the UPOS, so SYM remains officially restricted to non-alphanumerics.

It is interesting that ordinal numerals (their function is to indicate sequential order) are tagged as ADJ. Hence, I see no reason that sequence indicators should be labled as quantifiers, but, as usual, I probably am not seeing the whole picture. Can you elaborate.

nschneid · 2024-05-01T13:05:33Z

True, but we generally try to keep a uniform UPOS even where a word has a slightly different function (at least if the form and meaning of the word itself is the same). "3 books" and "3) books" show different functions but they draw on a shared concept of 'three'. Ordinals are actually spelled differently ("third", "3rd") so there is less pressure to keep the same UPOS I suppose.

…nummod, as described in UniversalDependencies/UD_English-EWT#518

nschneid added a commit that referenced this issue Apr 27, 2024

list item enumerators (LS) deprel: nummod->discourse (#518)

eb3f090

nschneid added the done except for GUM label Apr 27, 2024

AngledLuffa added a commit to stanfordnlp/CoreNLP that referenced this issue Apr 27, 2024

Change the dependency relation of list items to discourse instead of …

98aaa2a

…nummod, as described in UniversalDependencies/UD_English-EWT#518

AngledLuffa added a commit to stanfordnlp/CoreNLP that referenced this issue Apr 28, 2024

Change the dependency relation of list items to discourse instead of …

835e708

…nummod, as described in UniversalDependencies/UD_English-EWT#518

AngledLuffa added a commit to stanfordnlp/CoreNLP that referenced this issue May 19, 2024

Change the dependency relation of list items to discourse instead of …

2858276

…nummod, as described in UniversalDependencies/UD_English-EWT#518

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attachment of list item enumerators #518

Attachment of list item enumerators #518

nschneid commented Apr 27, 2024 •

edited

nschneid commented Apr 27, 2024

AngledLuffa commented Apr 27, 2024

AngledLuffa commented Apr 27, 2024

nschneid commented Apr 27, 2024

nschneid commented Apr 28, 2024

AngledLuffa commented Apr 28, 2024 via email

nschneid commented Apr 28, 2024

AngledLuffa commented Apr 28, 2024 via email

nschneid commented Apr 28, 2024

amir-zeldes commented Apr 29, 2024

nschneid commented Apr 29, 2024

amir-zeldes commented Apr 29, 2024

nschneid commented Apr 29, 2024

amir-zeldes commented Apr 30, 2024

nschneid commented Apr 30, 2024

amir-zeldes commented Apr 30, 2024

nschneid commented Apr 30, 2024

amir-zeldes commented Apr 30, 2024

nschneid commented Apr 30, 2024

rueter commented May 1, 2024

nschneid commented May 1, 2024

Attachment of list item enumerators #518

Attachment of list item enumerators #518

Comments

nschneid commented Apr 27, 2024 • edited

nschneid commented Apr 27, 2024

AngledLuffa commented Apr 27, 2024

AngledLuffa commented Apr 27, 2024

nschneid commented Apr 27, 2024

nschneid commented Apr 28, 2024

AngledLuffa commented Apr 28, 2024 via email

nschneid commented Apr 28, 2024

AngledLuffa commented Apr 28, 2024 via email

nschneid commented Apr 28, 2024

amir-zeldes commented Apr 29, 2024

nschneid commented Apr 29, 2024

amir-zeldes commented Apr 29, 2024

nschneid commented Apr 29, 2024

amir-zeldes commented Apr 30, 2024

nschneid commented Apr 30, 2024

amir-zeldes commented Apr 30, 2024

nschneid commented Apr 30, 2024

amir-zeldes commented Apr 30, 2024

nschneid commented Apr 30, 2024

rueter commented May 1, 2024

nschneid commented May 1, 2024

nschneid commented Apr 27, 2024 •

edited