Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.5 is a Card, not Frac? #884

Closed
AngledLuffa opened this issue Jul 23, 2022 · 24 comments
Closed

0.5 is a Card, not Frac? #884

AngledLuffa opened this issue Jul 23, 2022 · 24 comments

Comments

@AngledLuffa
Copy link

It seems the UD standard is to label things like 0.5 a Card, not a Frac

@amir-zeldes
Copy link
Contributor

This seems so counterintuitive to me that I would like to suggest this should be discussed again. What do you think @dan-zeman @nschneid ?

I don't see why "one" and "1" should have the same NumType but "half" and "0.5" shouldn't.

@nschneid
Copy link
Contributor

Because "0.5" can be read as "zero point five" or just "point five". It is only through mathematical knowledge (not linguistic knowledge) that we know this is equivalent to 1/2 = one half.

@nschneid
Copy link
Contributor

I actually wouldn't object to adding NumType=Dec for numbers with a decimal point. But that would result in "5.0" and "5" being different NumTypes. That real numbers with only zeros after the decimal point are also integers is mathematical, not linguistic, knowledge.

@dan-zeman
Copy link
Member

dan-zeman commented Jul 24, 2022

I don't have much to add here. I'd be fine with using just NumType=Card, wouldn't need Frac. Frac is a Czech-specific artefact of conversion from PDT, where it is used for words like polovina "half", třetina "third", čtvrtina "quarter/fourth" etc. These inflect exactly like feminine nouns but are categorized under numerals. The feature value is not used for other numerals that represent non-integer numbers. If you say tři čtvrtiny "three quarters", the first word will be regular NUM NumType=Card, the second word will be NUM NumType=Frac (http://hdl.handle.net/11346/PMLTQ-BSCI). If you write 3/4 or 0.75, it will be simply a cardinal number. There seems to be an inconsistency in polovina "half" because I just found out that it's actually a NOUN (http://hdl.handle.net/11346/PMLTQ-RJ6K). And the other word for "half", půl, is NUM NumType=Card (or NOUN) (http://hdl.handle.net/11346/PMLTQ-BWJU); it does not behave morphologically like those -ina words. Similarly, čtvrt, which is another way of saying a "quarter", is also NUM NumType=Card, or a NOUN (http://hdl.handle.net/11346/PMLTQ-4W4C).

To sum up, I suppose that the original idea was that NumType=Frac is used to denote a subclass of cardinals which have specific morphosyntactic properties and which also happen to be one possible way of denoting fractions. And the idea was not applied consistently (because it should also be applied to polovina).

@nschneid
Copy link
Contributor

In English I suppose there is value in distinguishing the two senses of "third", the fractional one and the ordinal one, both of which are separate from cardinal "three". Corresponding to cardinal "two" are the fractional "half" and ordinal "second".

A simple rule could be that NumType is strictly for the morphological paradigm, so in the case of words formed from digits without overt morphological marking, NumType=Card applies regardless of what is represented mathematically.

@dan-zeman dan-zeman transferred this issue from UniversalDependencies/UD_English-GUM Jul 25, 2022
@dan-zeman dan-zeman added this to the v2.11 milestone Jul 25, 2022
@dan-zeman
Copy link
Member

This was originally an issue in UD_English-GUM. I moved it to docs because it has become a general question of how certain values of the NumType feature should be applied and interpreted.

@dan-zeman dan-zeman changed the title 0.5 is a Card 0.5 is a Card, not Frac? Jul 25, 2022
@dan-zeman
Copy link
Member

dan-zeman commented Jul 25, 2022

Note: NumType=Frac is currently used in 16 UD languages.

This Teitok query returns various occurrences across languages but note that the query operates on UD 2.7.

@Stormur
Copy link
Contributor

Stormur commented Jul 25, 2022

In English I suppose there is value in distinguishing the two senses of "third", the fractional one and the ordinal one, both of which are separate from cardinal "three". Corresponding to cardinal "two" are the fractional "half" and ordinal "second".

I think that third should always be an ordinal, as it in fact is. The same goes for the Italian equivalent terzo, and the motivation behind is always expressing fractions as "the third part of a unity", and so on. While drittel in German is a candidate for Frac, distinct from dritte/r/s.

I see a more general problem with symbolic expressions, which I lately stumbled upon for Roman numbers. One problem here, for example, is that a Roman number like XIX (in digits: 19) might be used for a cardinal or an ordinal, and maybe in general for any numeric expression. So I cannot sensibly choose a NumType (nor give any morphologic analysis, since nothing is there), and have resorted to just use NumForm=Roman (as redundant as it may be, since it just makes explicit a regular expression...) with nothing else. And similarly, the notion of cardinal or other cannot really be applied to a number like 5 or 5.4536457 or π. Let's not even enter into mathematical distinctions, which in my opinion are not at all pertinent with UD's annotation.

I wonder if we just shouldn't use SYM as part of speech for all similar symbolic transcriptions. These are very different than abbreviations. This, in combination with a possible complementarity with NumForm, e.g.: a numeric SYM always needs to be tagged for NumForm; or the feature NumForm (else than Word, which becomes redundant) excludes any other morphologic feature and other ones like NumType.

@nschneid
Copy link
Contributor

nschneid commented Jul 25, 2022

Maybe the word "Cardinal" is confusing to people. I think it has been interpreted in the guidelines as "unmarked numeric" (whether expressed as a word or with digits or Roman numerals), as opposed to an expressly ordinal, fractional, or multiplicative form. But as @dan-zeman's treebank query shows, there are a few treebanks which use Frac for decimal numbers, sometimes including "%" as part of the token.

I don't see how this is relevant to the choice of NUM vs. SYM as the tag. NumType and NumForm are not limited to NUM; they also apply to ordinal ADJs, for example. (If we were revising the UPOS tagset I would have questions about whether we really need NUM, SYM, and PROPN, but we are stuck with it.)

@Stormur
Copy link
Contributor

Stormur commented Jul 25, 2022

I don't see how this is relevant to the choice of NUM vs. SYM as the tag. NumType and NumForm are not limited to NUM; they also apply to ordinal ADJs, for example. (If we were revising the UPOS tagset I would have questions about whether we really need NUM, SYM, and PROPN, but we are stuck with it.)

It is relevant because we are not dealing with a lexical representation of numbers, but a symbolic one, so it becomes very problematic to associate lexical features like NumType, and which, to the latter: it is simply an ill-posed problem.

How should one label something like 2. intended as second? Not as ADJ, I think. This issue is rather delicate because we have to tackle the liminal interaction between language and its written representation.

PS: don't let me even start with PROPN... 😛 But here I am just suggesting to swap to a better part of speech for some phenomena.

@nschneid
Copy link
Contributor

I see, yes, it is unfortunate that there is some mixing of morphological and orthographic considerations. But I think saying that "two" is NUM while "2" is SYM will confuse people more, not less.

How should one label something like 2. intended as second? Not as ADJ, I think.

Is the "." explicitly indicating that it is read as "second" as opposed to "two" (analogous to the suffix in 2nd)? If so then I think it's perfectly fine to call it ADJ and NumType=Ord, because there is explicit ordinal marking.

@amir-zeldes
Copy link
Contributor

amir-zeldes commented Jul 25, 2022

In English I suppose there is value in distinguishing the two senses of "third", the fractional one and the ordinal one

If we agree on this (and I see a thumbs up from @dan-zeman too), then I am even more confused about why "1/3" in Arabic numerals would not be Frac as well. That orthography, which uniquely represents and distinguishes the fractional sense from the ordinal one (1/3 cannot mean "the third one"), is all the more associated with Frac IMO, and "0.33333...." is just a different mathematical notation for the same thing.

If the criterion for being Card is "having Arabic numerals", then I think it should be a NumForm (an orthographic characteristic), not a NumType. But don't get me wrong, that's not what I'm advocating: I think "1/3", "0.3333" and "third" (in the fractional sense) should all get NumType=Frac and the appropriate NumForm (Word, Digit etc.). The historical facts about the PDT's conversion are interesting, but I don't think they should motivate guidelines for other languages.

@AngledLuffa
Copy link
Author

I think that third should always be an ordinal, as it in fact is.

That's not true in the case of "One third of the population ..." or other sentences where it represents a fraction

@Stormur
Copy link
Contributor

Stormur commented Jul 25, 2022

I think that third should always be an ordinal, as it in fact is.

That's not true in the case of "One third of the population ..." or other sentences where it represents a fraction

And the representation of a fraction is achieved by means of a (substantivised) ordinal. The two forms are the same thing, we are just giving different contextual interpretations.

@nschneid
Copy link
Contributor

I see the theoretical argument that there is a syntactic construction that expresses fractions using the ordinal forms of words...but this doesn't work for "half" in "one half" or "three halves". As a practical measure it seems reasonable to say that there are two slots in the paradigm with syncretism between the fractional and ordinal for most values.

@Stormur
Copy link
Contributor

Stormur commented Jul 25, 2022

I see the theoretical argument that there is a syntactic construction that expresses fractions using the ordinal forms of words...but this doesn't work for "half" in "one half" or "three halves". As a practical measure it seems reasonable to say that there are two slots in the paradigm with syncretism between the fractional and ordinal for most values.

I agree that half indeed appears as a specific fractional numeral (to continue the parallel with italian: mezzo), and it might well be the only such form in English. Lower numbers often show peculiar patterns, especially a fraction as significant as 1/2.

The syncretism would be total apart from this single case, as far as I know, and this advocates for not making a distinction!

@amir-zeldes
Copy link
Contributor

I understood this discussion to be about the universal category Frac so while it's interesting, I don't think it's necessarily important what syncretisms English has. As @Stormur pointed out, in German all fractions have a form which is synchronically distinct from the ordinal (though originally it is a compound with an ordinal stem as the modifier), so there are definitely languages where a distinction makes sense across the board. The mathematical notations, even in English, are also not ambiguous (1/4, 0.25 etc.)

@AngledLuffa
Copy link
Author

The syncretism would be total apart from this single case, as far as I know, and this advocates for not making a distinction!

There's also "quarter" and arguably "percent". Still, I think those are two different meanings of third. If you order a hamburger and they ask if you want "1/4, 1/3, or 1/2 lb patty" and you respond "the third", it's pretty clear which one you want

@amir-zeldes
Copy link
Contributor

they ask if you want "1/4, 1/3, or 1/2 lb patty" and you respond "the third"

Actually I think that's ambiguous :) But that ambiguity supports there being two different meanings...

@AngledLuffa
Copy link
Author

AngledLuffa commented Jul 25, 2022 via email

@dan-zeman
Copy link
Member

or the feature NumForm (else than Word, which becomes redundant) excludes any other morphologic feature and other ones like NumType.

In some languages, ordinals expressed in digits are different from cardinals: 3 = "three", 3. = "third". Czech is one of those languages but unfortunately the tokenization used in the Czech data separates the period from the digit, so we don't really have a token that could get NumForm=Digit|NumType=Ord.

@dan-zeman
Copy link
Member

The historical facts about the PDT's conversion are interesting, but I don't think they should motivate guidelines for other languages.

Agreed. But I presented them because I am not convinced that we need to distinguish Frac from Card in every language, just because every language has to express fractions somehow. And I suspect some people may have interpreted Frac this way.

@Stormur
Copy link
Contributor

Stormur commented Jul 26, 2022

The syncretism would be total apart from this single case, as far as I know, and this advocates for not making a distinction!

There's also "quarter" and arguably "percent". Still, I think those are two different meanings of third. If you order a hamburger and they ask if you want "1/4, 1/3, or 1/2 lb patty" and you respond "the third", it's pretty clear which one you want

Yes, quarter looks like that, nice! But about percent I am not so sure, it seems rather distributive, or something different altogether. Anyway, the ambiguity or absence of such a term is given by the context: both the real world, and also the words that accompany it, so we will see that when we have det(third,the) it probably is intended in the ordinal sense, while a nummod(third,x) is very probably used to indicate a fraction. But the word stays the same. I would follow the principle: entia non sunt multiplicanda praeter necessitatem 'entities are not to be multiplied beyond necessity'. And this actually would help a lot an annotation algorithm.

or the feature NumForm (else than Word, which becomes redundant) excludes any other morphologic feature and other ones like NumType.

In some languages, ordinals expressed in digits are different from cardinals: 3 = "three", 3. = "third". Czech is one of those languages but unfortunately the tokenization used in the Czech data separates the period from the digit, so we don't really have a token that could get NumForm=Digit|NumType=Ord.

I would still lean on seeing this as a symbolic, conventional representation beyond lexicon. Maybe we could just be content with a relation of amod. Anyway, a lemmatisation problem arises here, too: do we want to assign third to 3.? Would this make sense? What I meant before is that here we do not simply have an abbreviation, i.e. a subset of the string of a form, but a totally different representation, which seems to have different properties beyond language.

@dan-zeman
Copy link
Member

do we want to assign third to 3.?

No unless we also say that the lemma of 3 is three. Which we don't do in Czech.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants