New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Abbreviation at the end of sentence #410
Comments
See also related comment #112 (comment) and issue #181 |
As it does not involve two or more real syntactic words (i.e. something more than just a word and a symbol), I vote against C. I also think A is bad because one does not expect tokenization (the low-level one, as oposed to word segmentation) to introduce a punctuation symbol that did not come in the input. Whether it is better to give higher priority to the sentence-terminating function (and make the period a separate token, as in B) or to prefer the abbreviation-marking function (and keep г. as one token, as in D), I don't know. B would be easier if the tokenizer always cut off periods, even from abbreviations in the middle of the sentence. As I wrote in #112, some treebanks have this but they seem to be a minority. Otherwise, D is probably easier because the tokenizer could use a list of common abbreviations and would not have to care about sentence boundaries. Lemma of abbreviations is a separate issue. |
This is partly a philosophical question, because we first have to agree on whether there exists one or two dots. C implements the view that "r." + "." is contracted to "r.", while B implements the view that abbreviation-final "." is omitted after sentence-final "." and D implements the view that sentence-final "." is omitted after abbreviation-final "." I personally find C most convincing theoretically, but I agree with Dan that it seems overkill in some sense. A seems to be the worst solution, because it misrepresents the original text (I assume). For what it's worth, UD Swedish uses D. |
OK, it seems most people (and existing treebanks) prefer D. |
I realise that this is closed, but for the record, I prefer C, and this is what we do in Kazakh:
|
I don't have a strong opinion about this as I don't think this decision has any practical consequences (for PCFG parsing, the sentence-final period is often important to get the right parse, but I don't think this matters much for most dependencies parsers). The LDC Web Treebank does not seem to be very consistent in that regard, so I'd be happy to change this to variant D in the future. |
The documentation says abbreviations like e. g. can be annotated as multitoken words, but there is no more info on the tokenization. I understand it that it could be also annotated as one token and one word (esp. if it is written without space).
In particular, I am interested in how to deal with abbreviations at the end of sentence, where two periods should appear, but in most/all languages only one period is written. For example in UD_Bulgarian, "г." is an abbreviation of "year", it often ends the sentence and the v1.4 annotation has two periods, although only one is written in raw text. How should we annotate it?
A) status quo (with wrong
text
, but at least it is valid according tovalidate.py
)B) form=
г
and lemma=г.
C) multi-word token
D) no PUNCT
The text was updated successfully, but these errors were encountered: