Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attachment of list item enumerators #518

Open
1 task done
nschneid opened this issue Apr 27, 2024 · 21 comments
Open
1 task done

Attachment of list item enumerators #518

nschneid opened this issue Apr 27, 2024 · 21 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Apr 27, 2024

Sequential markers like "1.", "(a)", and so forth lacked a good policy for how they should attach, but this was just clarified as discourse: UniversalDependencies/docs#1027

I will update EWT, where they are currently nummod. I tried several approaches to query these—sentence-initial nummods, nummods modifying a non-nominal, etc. The approach that worked best was to query for nummods with ".", ")", "]" immediately after the number:

This excludes NUM-headed nummods, which are area codes in telephone numbers (this should be fixed separately).

In GUM they are dep. Because GUM has more genres than EWT I would guess the punctuation associated with enumerators (if any) will be more varied. But the LS tag can also help identify them.

@nschneid
Copy link
Contributor Author

Actually, there are 4 cases in EWT where it has a following ")" but no LS, and these are cross-references so should not be discourse. It appears all instances of non-root LS should have their deprel changed to discourse.

@AngledLuffa
Copy link
Contributor

Sounds good, but would you add a few words on the appropriate UPOS tagging? In EWT we get the tags 1_NUM )_PUNCT whereas in GUM that becomes a single token with the X tag

# sent_id = GUM_interview_herrick-48
# text = You either 1) sacrifice on breadth
3       1)      1)      X       LS 
# sent_id = GUM_academic_replication-12
# text = The severe concerns underpinning the alleged crisis have several dimensions relating to: (a) the (small) amount
14      (a)     (a)     X       LS

AngledLuffa added a commit to stanfordnlp/CoreNLP that referenced this issue Apr 27, 2024
@AngledLuffa
Copy link
Contributor

also the tag on a) might not necessarily be NUM, although that's how it's done in EWT still

# sent_id = email-enronsent36_02-0033
# newpar id = email-enronsent36_02-p0005
# text = Attached for your review is a blacklined version of the: (a) Schedule and (b) Paragraph 13 to the ISDA Master Agreement.
12      (       (       PUNCT   -LRB-   _       13      punct   13:punct        SpaceAfter=No
13      a       a       NUM     LS      _       15      nummod  15:nummod       SpaceAfter=No
14      )       )       PUNCT   -RRB-   _       13      punct   13:punct        _

@nschneid
Copy link
Contributor Author

Chris said he would keep NUM for "(a)" etc. (it functions like a number in indicating sequential order). I think X is wrong because it suggests it's somehow not a real word of English. I wouldn't mind a broadened SYM, but the group did not come to an agreement on the UPOS, so SYM remains officially restricted to non-alphanumerics.

@nschneid
Copy link
Contributor Author

Oh I see you asked about "(1)" as well. I would definitely advocate NUM for that.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 28, 2024 via email

@nschneid
Copy link
Contributor Author

GUM doesn't. The tokenization varies by treebank and it would be impractical to change the tokenization IMO.

@AngledLuffa
Copy link
Contributor

AngledLuffa commented Apr 28, 2024 via email

@nschneid
Copy link
Contributor Author

I think @amir-zeldes is happy with the GUM tokenization of list item markers as it is easier on annotators (who do it manually and then don't have to go through the effort of attaching punctuation in the tree). For EWT I don't want to mess with LDC tokenization as it will break compatibility with Penn trees.

AngledLuffa added a commit to stanfordnlp/CoreNLP that referenced this issue Apr 28, 2024
@amir-zeldes
Copy link
Contributor

Yeah, I basically think the decision to tokenize parts of a marker like "a.)" is wrong, it's confusing to me, leads to unmatched brackets and ambiguous period tokens, and you only end up reattaching them as punct for no real gain that I can see. Are there stats on what other corpora/languages do with these?

As for POS, I'd be willing to consider NUM for things that are numeric. Is there a regex you have in mind for what to include in that? I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks, little pointing hands and the like, I think those should be SYM (X was a legacy thing can't remember what we were imitating there)

@nschneid
Copy link
Contributor Author

I'm pretty strongly against PUNCT for non-ordinal marker tokens, like asterisks

You mean bullets? The guidelines say PUNCT for those as they are not pronounced, and given the disagreement about this among the core group the conclusion was to stick with the status quo.

@amir-zeldes
Copy link
Contributor

upos is not super important for me so I wouldn't fight for that too much, but I think "discourse" for numerical LS but "punct" for symbols is wrong/potentially confusing for parsing models, so I don't want to implement that. I don't suppose anyone wants to use PUNCT/discourse for bullets?

@nschneid
Copy link
Contributor Author

I don't suppose anyone wants to use PUNCT/discourse for bullets?

Nope.

There's no perfect solution that everybody likes but it's better to have a solution.

@amir-zeldes
Copy link
Contributor

No question, just still seems wrong to me. So are we doing discourse for this release already?

@nschneid
Copy link
Contributor Author

Yes

@amir-zeldes
Copy link
Contributor

OK, so if upos for non-bullets is NUM a la Chris, what is the NumForm for things like (a)? I assume NumType is Ord right?

@nschneid
Copy link
Contributor Author

Ord seems odd because it usually corresponds to a suffix, but in any case I'm not going to mess with whatever is in EWT.

@amir-zeldes
Copy link
Contributor

Looks like EWT has it as upos NUM with Card + Digit for numerical ones like "(1)", and upos NUM also for "(A)", but with no NumType or NumForm... That doesn't seem right/would ruin the current state in GUM where NUM guarantees that we have a NumType and NumForm.

I'm happy to change all LS that are not bullets and have some kind of ordering meaning to NUM, but then I think they should have a NumType and NumForm - would you be OK adding that to EWT?

@nschneid
Copy link
Contributor Author

Looks like this is #465, and we were waiting to try to figure out a complete solution to LS. :) I'll comment there.

@rueter
Copy link

rueter commented May 1, 2024

Chris said he would keep NUM for "(a)" etc. (it functions like a number in indicating sequential order). I think X is wrong because it suggests it's somehow not a real word of English. I wouldn't mind a broadened SYM, but the group did not come to an agreement on the UPOS, so SYM remains officially restricted to non-alphanumerics.

It is interesting that ordinal numerals (their function is to indicate sequential order) are tagged as ADJ. Hence, I see no reason that sequence indicators should be labled as quantifiers, but, as usual, I probably am not seeing the whole picture. Can you elaborate.

@nschneid
Copy link
Contributor Author

nschneid commented May 1, 2024

True, but we generally try to keep a uniform UPOS even where a word has a slightly different function (at least if the form and meaning of the word itself is the same). "3 books" and "3) books" show different functions but they draw on a shared concept of 'three'. Ordinals are actually spelled differently ("third", "3rd") so there is less pressure to keep the same UPOS I suppose.

AngledLuffa added a commit to stanfordnlp/CoreNLP that referenced this issue May 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants