Abbreviation at the end of sentence #410

martinpopel · 2017-02-01T00:28:51Z

The documentation says abbreviations like e. g. can be annotated as multitoken words, but there is no more info on the tokenization. I understand it that it could be also annotated as one token and one word (esp. if it is written without space).

In particular, I am interested in how to deal with abbreviations at the end of sentence, where two periods should appear, but in most/all languages only one period is written. For example in UD_Bulgarian, "г." is an abbreviation of "year", it often ends the sentence and the v1.4 annotation has two periods, although only one is written in raw text. How should we annotate it?

A) status quo (with wrong text, but at least it is valid according to validate.py)

# text = 1948 г..
1  1948  1948  ADJ
2  г.    г.    NOUN
3  .     .     PUNCT

B) form=г and lemma=г.

# text = 1948 г.
1  1948  1948  ADJ
2  г     г.    NOUN
3  .     .     PUNCT

C) multi-word token

# text = 1948 г.
1    1948  1948  ADJ
2-3  г.    _    _
2    г.     г.    NOUN
3    .     .     PUNCT

D) no PUNCT

# text = 1948 г.
1  1948  1948  ADJ
2  г.    г.    NOUN

The text was updated successfully, but these errors were encountered:

martinpopel · 2017-02-01T00:32:10Z

See also related comment #112 (comment) and issue #181

dan-zeman · 2017-02-01T08:34:41Z

As it does not involve two or more real syntactic words (i.e. something more than just a word and a symbol), I vote against C.

I also think A is bad because one does not expect tokenization (the low-level one, as oposed to word segmentation) to introduce a punctuation symbol that did not come in the input.

Whether it is better to give higher priority to the sentence-terminating function (and make the period a separate token, as in B) or to prefer the abbreviation-marking function (and keep г. as one token, as in D), I don't know. B would be easier if the tokenizer always cut off periods, even from abbreviations in the middle of the sentence. As I wrote in #112, some treebanks have this but they seem to be a minority. Otherwise, D is probably easier because the tokenizer could use a list of common abbreviations and would not have to care about sentence boundaries.

Lemma of abbreviations is a separate issue.

jnivre · 2017-02-01T08:52:36Z

This is partly a philosophical question, because we first have to agree on whether there exists one or two dots. C implements the view that "r." + "." is contracted to "r.", while B implements the view that abbreviation-final "." is omitted after sentence-final "." and D implements the view that sentence-final "." is omitted after abbreviation-final "." I personally find C most convincing theoretically, but I agree with Dan that it seems overkill in some sense. A seems to be the worst solution, because it misrepresents the original text (I assume). For what it's worth, UD Swedish uses D.

martinpopel · 2017-02-13T00:47:43Z

OK, it seems most people (and existing treebanks) prefer D.
I will convert UD_Bulgarian to this style using Udapi and close this issue.

ftyers · 2017-02-13T08:37:53Z

I realise that this is closed, but for the record, I prefer C, and this is what we do in Kazakh:

# sent_id = Иран.tagged.txt:4:71
# text = Халқының ұлттық құрамы: парсылар (51%), әзірбайжандар (27%), күрдтер (5%), арабтар, түрікмендер, белуджилер, армяндар, еврейлер, т.б.
1       Халқының        халық   NOUN    n       Case=Gen|Number[psor]=Plur,Sing|Person[psor]=3  3       nmod:poss       _       _
2       ұлттық  ұлттық  ADJ     adj     _       3       amod    _       _
3       құрамы  құрам   NOUN    n       Case=Nom|Number[psor]=Plur,Sing|Person[psor]=3  30      nsubj   _       _
4       :       :       PUNCT   sent    _       3       punct   _       _
5       парсылар        парсы   NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
6       (       (       PUNCT   lpar    _       7       punct   _       _
7       51%     51      NUM     num     Case=Nom        5       appos   _       _
8       )       )       PUNCT   rpar    _       7       punct   _       _
9       ,       ,       PUNCT   cm      _       30      punct   _       _
10      әзірбайжандар   әзірбайжан      NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
11      (       (       PUNCT   lpar    _       12      punct   _       _
12      27%     27      NUM     num     Case=Nom        10      appos   _       _
13      )       )       PUNCT   rpar    _       12      punct   _       _
14      ,       ,       PUNCT   cm      _       30      punct   _       _
15      күрдтер күрд    NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
16      (       (       PUNCT   lpar    _       17      punct   _       _
17      5%      5       NUM     num     Case=Nom        15      appos   _       _
18      )       )       PUNCT   rpar    _       17      punct   _       _
19      ,       ,       PUNCT   cm      _       30      punct   _       _
20      арабтар араб    NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
21      ,       ,       PUNCT   cm      _       30      punct   _       _
22      түрікмендер     түрікмен        NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
23      ,       ,       PUNCT   cm      _       30      punct   _       _
24      белуджилер      белуджи NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
25      ,       ,       PUNCT   cm      _       30      punct   _       _
26      армяндар        армян   NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
27      ,       ,       PUNCT   cm      _       30      punct   _       _
28      еврейлер        еврей   NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
29      ,       ,       PUNCT   cm      _       30      punct   _       _
30-31   т.б.    _       _       _       _       _       _       _       _
30      т.б.    т.б.    NOUN    abbr    _       0       root    _       _
31      .       .       PUNCT   sent    _       30      punct   _       _

martinpopel · 2017-02-18T15:09:10Z

Reopening this issue, as there is no agreement yet and no guidelines in the documentation.
It seems that @sebschu prefers B in UD_English.

sebschu · 2017-02-18T23:26:36Z

I don't have a strong opinion about this as I don't think this decision has any practical consequences (for PCFG parsing, the sentence-final period is often important to get the right parse, but I don't think this matters much for most dependencies parsers). The LDC Web Treebank does not seem to be very consistent in that regard, so I'd be happy to change this to variant D in the future.

martinpopel added standard needed tokenization labels Feb 1, 2017

dan-zeman added the universal label Feb 1, 2017

martinpopel closed this as completed Feb 13, 2017

martinpopel reopened this Feb 18, 2017

dan-zeman added this to the v2.2 milestone Apr 24, 2018

dan-zeman modified the milestones: v2.2, v2.4 Nov 13, 2018

dan-zeman modified the milestones: v2.4, v2.5 Oct 6, 2019

dan-zeman closed this as completed Nov 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abbreviation at the end of sentence #410

Abbreviation at the end of sentence #410

martinpopel commented Feb 1, 2017

martinpopel commented Feb 1, 2017

dan-zeman commented Feb 1, 2017

jnivre commented Feb 1, 2017

martinpopel commented Feb 13, 2017

ftyers commented Feb 13, 2017

martinpopel commented Feb 18, 2017

sebschu commented Feb 18, 2017 •

edited

Abbreviation at the end of sentence #410

Abbreviation at the end of sentence #410

Comments

martinpopel commented Feb 1, 2017

martinpopel commented Feb 1, 2017

dan-zeman commented Feb 1, 2017

jnivre commented Feb 1, 2017

martinpopel commented Feb 13, 2017

ftyers commented Feb 13, 2017

martinpopel commented Feb 18, 2017

sebschu commented Feb 18, 2017 • edited

sebschu commented Feb 18, 2017 •

edited