Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[puupankki.conllu] мың миллион миллиард млн млрд трлн are inconsistent #24

Open
IlnarSelimcan opened this issue Oct 13, 2020 · 0 comments

Comments

@IlnarSelimcan
Copy link
Member

IlnarSelimcan commented Oct 13, 2020

General context: #17

Actually several related issues:

  1. мың and миллиард are NUM num everywhere, while миллион in some cases is NUM num, and in others NOUN n.

  2. млрд. and трлн. are NOUN abbr everywhere, while млн. is some cases tagged as NUM num, in others as NOUN abbr.

  3. (a)

4	2	2	NUM	num	NumType=Card	5	compound	_	_
5	миллиард	миллиард	NUM	num	NumType=Card	6	compound	_	_
6	300	300	NUM	num	NumType=Card	7	compound	_	_
7	миллион	миллион	NUM	num	NumType=Card	8	nummod	_	_
8	теңгеден	теңге	NOUN	n	Case=Abl	10	nmod	_	_
9	астам	астам	ADJ	adj	_	10	amod	_	_
10	қаржы	қаржы	NOUN	n	Case=Nom	11	obj	_	_

vs (b)

3	4,3	4,3	NUM	num	NumType=Card	4	nummod	_	_
4	мыңнан	мың	NUM	num	Case=Abl|NumType=Card,Ord	6	nmod	_	_
5	астам	астам	ADJ	adj	_	6	amod	_	_
6	шақырымды	шақырым	NOUN	n	Case=Acc	7	obj	_	_

Hereby I suggest:

  • to tag all of мың, миллион, миллиард, триллион, млн., млрд. and трлн. as NUM num. For the latter three, apertium-kaz & co can be modified to output <abbr> as a secondary tag, i.e. млн\.? --> <num><abbr>. Since there are abbreviated nouns, abbreviated numerals etc, for known abbreviations I think it makes sense to make <abbr> a secondary tag, especially in context of UD annotating:

[quote https://universaldependencies.org/u/pos/all.html#sym-symbol]

Strings that consists entirely of alphanumeric characters are not symbols but they may be proper nouns: 130XE, DC10; others may be tagged PROPN (rather than SYM) even if they contain special characters: DC-10. Similarly, abbreviations for single words are not symbols but are assigned the part of speech of the full form. For example, Mr. (mister), kg (kilogram), km (kilometer), Dr (Doctor) should be tagged nouns. Acronyms for proper names such as UN and NATO should be tagged as proper nouns.

[unquote]

but also generally speaking knowing the POS of the unabbreviated form is considered helpful for applications.

UPDATE: note that in UD there is the Abbr feature: https://universaldependencies.org/u/feat/Abbr.html

  • to handle all numerical constructions like the above as compounds (i.e. as done in 3a). In other words, a flat chain of compounds, with the rightmost element being the head receiving nummod or nmod whatever.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant