Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization of space-separated ellipsis. #988

Open
rhdunn opened this issue Nov 4, 2023 · 7 comments
Open

Tokenization of space-separated ellipsis. #988

rhdunn opened this issue Nov 4, 2023 · 7 comments

Comments

@rhdunn
Copy link

rhdunn commented Nov 4, 2023

There are generally 3 ways to specify an ellipsis in text:

  1. as a sequence of 3 (or more) full-stop/period characters without spaces between them, e.g. ...;
  2. as a sequence of 3 (or more) full-stop/period characters with spaces between them, e.g. . . .;
  3. as a unicode ellipsis character, e.g. .

a) For the second case, the ellipsis is tokenized in EWT as 3 (or more) separate tokens. This is consistent with a space-based tokenizer, but is inconsistent with tokenizing the other cases as a single ellipsis token. -- Q: Should these be a single token?

My preference is yes, as they are linguistically a single punctuation token and can be substituted for any of the other forms while remaining equivalent.

b) For cases 1 and 2, where there are 4 (or more) . characters at the end of a sentence, should this be a single ellipsis as is currently annotated, or an ellipsis of n-1 . characters and a separate . token to end the sentence.

Linguistically, I would say it is the latter, but that makes it difficult to tokenize in a single pass (although you have that issue with abbreviations such as "Miss. Austen wrote English fiction.").

NOTE: EWT has several single token ellipsis that are labelled as SYM+NFP instead of as PUNCT+. or PUNCT+, like the other ellipsis tokens.

@nschneid
Copy link
Contributor

nschneid commented Nov 4, 2023

Thanks for pointing these out.

a) For the second case, the ellipsis is tokenized in EWT as 3 (or more) separate tokens. This is consistent with a space-based tokenizer, but is inconsistent with tokenizing the other cases as a single ellipsis token. -- Q: Should these be a single token?

The UD tokenization/word segmentation policy is very strict about prohibiting spaces within syntactic words (i.e. units that have a dependency relation). As far as I'm aware, the only exceptions are 1) languages like Vietnamese where spaces indicate syllable rather than word boundaries, and 2) spaces within numerals for readability like "1 000 000" (where other orthographies might use commas or periods as separators). IMO, some variation in the spacing of punctuation marks is not important enough to warrant an additional exception.

b) For cases 1 and 2, where there are 4 (or more) . characters at the end of a sentence, should this be a single ellipsis as is currently annotated, or an ellipsis of n-1 . characters and a separate . token to end the sentence.

I don't necessarily have a strong opinion on this, and the policy may vary depending on whether it is a well-edited genre. Considering that EWT consists of web text, I wouldn't expect the use of three vs. four (or more) dots to be completely standard. We see things like

  • Moral of the story: Don't drink Coke..........drink Pepsi!

How many ellipses and/or periods is that? For EWT, at least, I'm happy with the current, simple approach that lumps them all together as one PUNCT token.

NOTE: EWT has several single token ellipsis that are labelled as SYM+NFP instead of as PUNCT+. or PUNCT+, like the other ellipsis tokens.

I'm just seeing these 5 which are standalone "sentences". Maybe this should be addressed as part of UniversalDependencies/UD_English-EWT#415.


As a general matter, I'd say UD is concerned with morphosyntax proper and less developed when it comes to issues like punctuation. If there are simple ways to make the analysis of punctuation cleaner/more consistent, then great, but we are cautious about departing from standards assumed by tokenizers—it will cause problems for parsers.

@amir-zeldes
Copy link
Contributor

Another option for a) is to simply use goeswith (spelled apart, should've been spelled together)

@arademaker
Copy link
Contributor

IMO, some variation in the spacing of punctuation marks is not important enough to warrant an additional exception.

why? I like the idea of “. . .” as single token.

@amir-zeldes
Copy link
Contributor

why? I like the idea of “. . .” as single token.

I think essentially UD's way of saying that while there's a space in the string is goeswith

@sylvainkahane
Copy link
Contributor

But goeswith is used when there is a misspelling, no? Do we consider that “. . .” is a misspelling?

@martinpopel
Copy link
Member

Do we consider that “. . .” is a misspelling?

Yes, according to most typography guidelines, including a Czech one and an English one. That said, CMOS used to recommend using "three periods plus two nonbreaking spaces" - which could result in the same visual output as the Unicode ellipsis symbol if you hack the kerning rules (some fonts include these hacked kerning rules because their users could not use Unicode).

BTW: The Czech guideline says that in case of omitted characters, we can use as many dots as there are omitted characters, e.g. "Soviet cosmonaut G......" (Gagarin).

Seeing this issue (and similar recent issues) makes me feel good: UD treebanks seem to be so consistently annotated in the important aspects, so we can invest our time into such nitpicking. And then I look into the data and the feeling disappears.

@amir-zeldes
Copy link
Contributor

And then I look into the data and the feeling disappears.

I feel your pain 😂

Do we consider that “. . .” is a misspelling?

If a student submitted a paper draft with that to me I would correct it

@dan-zeman dan-zeman modified the milestones: v2.14, v2.15 May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants