Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abbreviation at the end of sentence #410

Closed
martinpopel opened this issue Feb 1, 2017 · 7 comments
Closed

Abbreviation at the end of sentence #410

martinpopel opened this issue Feb 1, 2017 · 7 comments

Comments

@martinpopel
Copy link
Member

The documentation says abbreviations like e. g. can be annotated as multitoken words, but there is no more info on the tokenization. I understand it that it could be also annotated as one token and one word (esp. if it is written without space).

In particular, I am interested in how to deal with abbreviations at the end of sentence, where two periods should appear, but in most/all languages only one period is written. For example in UD_Bulgarian, "г." is an abbreviation of "year", it often ends the sentence and the v1.4 annotation has two periods, although only one is written in raw text. How should we annotate it?

A) status quo (with wrong text, but at least it is valid according to validate.py)

# text = 1948 г..
1  1948  1948  ADJ
2  г.    г.    NOUN
3  .     .     PUNCT

B) form=г and lemma=г.

# text = 1948 г.
1  1948  1948  ADJ
2  г     г.    NOUN
3  .     .     PUNCT

C) multi-word token

# text = 1948 г.
1    1948  1948  ADJ
2-3  г.    _    _
2    г.     г.    NOUN
3    .     .     PUNCT

D) no PUNCT

# text = 1948 г.
1  1948  1948  ADJ
2  г.    г.    NOUN 
@martinpopel
Copy link
Member Author

See also related comment #112 (comment) and issue #181

@dan-zeman
Copy link
Member

As it does not involve two or more real syntactic words (i.e. something more than just a word and a symbol), I vote against C.

I also think A is bad because one does not expect tokenization (the low-level one, as oposed to word segmentation) to introduce a punctuation symbol that did not come in the input.

Whether it is better to give higher priority to the sentence-terminating function (and make the period a separate token, as in B) or to prefer the abbreviation-marking function (and keep г. as one token, as in D), I don't know. B would be easier if the tokenizer always cut off periods, even from abbreviations in the middle of the sentence. As I wrote in #112, some treebanks have this but they seem to be a minority. Otherwise, D is probably easier because the tokenizer could use a list of common abbreviations and would not have to care about sentence boundaries.

Lemma of abbreviations is a separate issue.

@jnivre
Copy link
Contributor

jnivre commented Feb 1, 2017

This is partly a philosophical question, because we first have to agree on whether there exists one or two dots. C implements the view that "r." + "." is contracted to "r.", while B implements the view that abbreviation-final "." is omitted after sentence-final "." and D implements the view that sentence-final "." is omitted after abbreviation-final "." I personally find C most convincing theoretically, but I agree with Dan that it seems overkill in some sense. A seems to be the worst solution, because it misrepresents the original text (I assume). For what it's worth, UD Swedish uses D.

@martinpopel
Copy link
Member Author

OK, it seems most people (and existing treebanks) prefer D.
I will convert UD_Bulgarian to this style using Udapi and close this issue.

@ftyers
Copy link
Contributor

ftyers commented Feb 13, 2017

I realise that this is closed, but for the record, I prefer C, and this is what we do in Kazakh:

# sent_id = Иран.tagged.txt:4:71
# text = Халқының ұлттық құрамы: парсылар (51%), әзірбайжандар (27%), күрдтер (5%), арабтар, түрікмендер, белуджилер, армяндар, еврейлер, т.б.
1       Халқының        халық   NOUN    n       Case=Gen|Number[psor]=Plur,Sing|Person[psor]=3  3       nmod:poss       _       _
2       ұлттық  ұлттық  ADJ     adj     _       3       amod    _       _
3       құрамы  құрам   NOUN    n       Case=Nom|Number[psor]=Plur,Sing|Person[psor]=3  30      nsubj   _       _
4       :       :       PUNCT   sent    _       3       punct   _       _
5       парсылар        парсы   NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
6       (       (       PUNCT   lpar    _       7       punct   _       _
7       51%     51      NUM     num     Case=Nom        5       appos   _       _
8       )       )       PUNCT   rpar    _       7       punct   _       _
9       ,       ,       PUNCT   cm      _       30      punct   _       _
10      әзірбайжандар   әзірбайжан      NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
11      (       (       PUNCT   lpar    _       12      punct   _       _
12      27%     27      NUM     num     Case=Nom        10      appos   _       _
13      )       )       PUNCT   rpar    _       12      punct   _       _
14      ,       ,       PUNCT   cm      _       30      punct   _       _
15      күрдтер күрд    NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
16      (       (       PUNCT   lpar    _       17      punct   _       _
17      5%      5       NUM     num     Case=Nom        15      appos   _       _
18      )       )       PUNCT   rpar    _       17      punct   _       _
19      ,       ,       PUNCT   cm      _       30      punct   _       _
20      арабтар араб    NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
21      ,       ,       PUNCT   cm      _       30      punct   _       _
22      түрікмендер     түрікмен        NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
23      ,       ,       PUNCT   cm      _       30      punct   _       _
24      белуджилер      белуджи NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
25      ,       ,       PUNCT   cm      _       30      punct   _       _
26      армяндар        армян   NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
27      ,       ,       PUNCT   cm      _       30      punct   _       _
28      еврейлер        еврей   NOUN    n       Case=Nom|Number=Plur    30      conj    _       _
29      ,       ,       PUNCT   cm      _       30      punct   _       _
30-31   т.б.    _       _       _       _       _       _       _       _
30      т.б.    т.б.    NOUN    abbr    _       0       root    _       _
31      .       .       PUNCT   sent    _       30      punct   _       _

@martinpopel
Copy link
Member Author

Reopening this issue, as there is no agreement yet and no guidelines in the documentation.
It seems that @sebschu prefers B in UD_English.

@martinpopel martinpopel reopened this Feb 18, 2017
@sebschu
Copy link
Member

sebschu commented Feb 18, 2017

I don't have a strong opinion about this as I don't think this decision has any practical consequences (for PCFG parsing, the sentence-final period is often important to get the right parse, but I don't think this matters much for most dependencies parsers). The LDC Web Treebank does not seem to be very consistent in that regard, so I'd be happy to change this to variant D in the future.

@dan-zeman dan-zeman added this to the v2.2 milestone Apr 24, 2018
@dan-zeman dan-zeman modified the milestones: v2.2, v2.4 Nov 13, 2018
@dan-zeman dan-zeman modified the milestones: v2.4, v2.5 Oct 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants