Tokenization of space-separated ellipsis. #988

rhdunn · 2023-11-04T17:10:26Z

There are generally 3 ways to specify an ellipsis in text:

as a sequence of 3 (or more) full-stop/period characters without spaces between them, e.g. ...;
as a sequence of 3 (or more) full-stop/period characters with spaces between them, e.g. . . .;
as a unicode ellipsis character, e.g. ….

a) For the second case, the ellipsis is tokenized in EWT as 3 (or more) separate tokens. This is consistent with a space-based tokenizer, but is inconsistent with tokenizing the other cases as a single ellipsis token. -- Q: Should these be a single token?

My preference is yes, as they are linguistically a single punctuation token and can be substituted for any of the other forms while remaining equivalent.

b) For cases 1 and 2, where there are 4 (or more) . characters at the end of a sentence, should this be a single ellipsis as is currently annotated, or an ellipsis of n-1 . characters and a separate . token to end the sentence.

Linguistically, I would say it is the latter, but that makes it difficult to tokenize in a single pass (although you have that issue with abbreviations such as "Miss. Austen wrote English fiction.").

NOTE: EWT has several single token ellipsis that are labelled as SYM+NFP instead of as PUNCT+. or PUNCT+, like the other ellipsis tokens.

The text was updated successfully, but these errors were encountered:

nschneid · 2023-11-04T17:48:05Z

Thanks for pointing these out.

a) For the second case, the ellipsis is tokenized in EWT as 3 (or more) separate tokens. This is consistent with a space-based tokenizer, but is inconsistent with tokenizing the other cases as a single ellipsis token. -- Q: Should these be a single token?

The UD tokenization/word segmentation policy is very strict about prohibiting spaces within syntactic words (i.e. units that have a dependency relation). As far as I'm aware, the only exceptions are 1) languages like Vietnamese where spaces indicate syllable rather than word boundaries, and 2) spaces within numerals for readability like "1 000 000" (where other orthographies might use commas or periods as separators). IMO, some variation in the spacing of punctuation marks is not important enough to warrant an additional exception.

b) For cases 1 and 2, where there are 4 (or more) . characters at the end of a sentence, should this be a single ellipsis as is currently annotated, or an ellipsis of n-1 . characters and a separate . token to end the sentence.

I don't necessarily have a strong opinion on this, and the policy may vary depending on whether it is a well-edited genre. Considering that EWT consists of web text, I wouldn't expect the use of three vs. four (or more) dots to be completely standard. We see things like

Moral of the story: Don't drink Coke..........drink Pepsi!

How many ellipses and/or periods is that? For EWT, at least, I'm happy with the current, simple approach that lumps them all together as one PUNCT token.

NOTE: EWT has several single token ellipsis that are labelled as SYM+NFP instead of as PUNCT+. or PUNCT+, like the other ellipsis tokens.

I'm just seeing these 5 which are standalone "sentences". Maybe this should be addressed as part of UniversalDependencies/UD_English-EWT#415.

As a general matter, I'd say UD is concerned with morphosyntax proper and less developed when it comes to issues like punctuation. If there are simple ways to make the analysis of punctuation cleaner/more consistent, then great, but we are cautious about departing from standards assumed by tokenizers—it will cause problems for parsers.

amir-zeldes · 2023-11-04T17:50:55Z

Another option for a) is to simply use goeswith (spelled apart, should've been spelled together)

arademaker · 2023-11-04T18:37:49Z

IMO, some variation in the spacing of punctuation marks is not important enough to warrant an additional exception.

why? I like the idea of “. . .” as single token.

amir-zeldes · 2023-11-06T15:12:50Z

why? I like the idea of “. . .” as single token.

I think essentially UD's way of saying that while there's a space in the string is goeswith

sylvainkahane · 2023-11-06T15:45:04Z

But goeswith is used when there is a misspelling, no? Do we consider that “. . .” is a misspelling?

martinpopel · 2023-11-06T17:29:59Z

Do we consider that “. . .” is a misspelling?

Yes, according to most typography guidelines, including a Czech one and an English one. That said, CMOS used to recommend using "three periods plus two nonbreaking spaces" - which could result in the same visual output as the Unicode ellipsis symbol if you hack the kerning rules (some fonts include these hacked kerning rules because their users could not use Unicode).

BTW: The Czech guideline says that in case of omitted characters, we can use as many dots as there are omitted characters, e.g. "Soviet cosmonaut G......" (Gagarin).

Seeing this issue (and similar recent issues) makes me feel good: UD treebanks seem to be so consistently annotated in the important aspects, so we can invest our time into such nitpicking. And then I look into the data and the feeling disappears.

amir-zeldes · 2023-11-07T01:44:10Z

And then I look into the data and the feeling disappears.

I feel your pain 😂

Do we consider that “. . .” is a misspelling?

If a student submitted a paper draft with that to me I would correct it

dan-zeman added this to the v2.14 milestone Nov 5, 2023

dan-zeman added question tokenization universal labels Nov 5, 2023

dan-zeman modified the milestones: v2.14, v2.15 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization of space-separated ellipsis. #988

Tokenization of space-separated ellipsis. #988

rhdunn commented Nov 4, 2023

nschneid commented Nov 4, 2023

amir-zeldes commented Nov 4, 2023

arademaker commented Nov 4, 2023

amir-zeldes commented Nov 6, 2023

sylvainkahane commented Nov 6, 2023

martinpopel commented Nov 6, 2023

amir-zeldes commented Nov 7, 2023

Tokenization of space-separated ellipsis. #988

Tokenization of space-separated ellipsis. #988

Comments

rhdunn commented Nov 4, 2023

nschneid commented Nov 4, 2023

amir-zeldes commented Nov 4, 2023

arademaker commented Nov 4, 2023

amir-zeldes commented Nov 6, 2023

sylvainkahane commented Nov 6, 2023

martinpopel commented Nov 6, 2023

amir-zeldes commented Nov 7, 2023