Roman numeral lemmas: upper or lower case? #987

rhdunn · 2023-11-03T18:00:10Z

Looking at the English treebanks for roan numerals like I and IX, EWT and GUM preserve the casing while PUD normalizes the forms to lower case.

What should be used here? I would have thought that a lower case lemma would be used, like is done for nouns and other tokens.

The text was updated successfully, but these errors were encountered:

dan-zeman · 2023-11-03T18:05:55Z

This could be standardized across (at least) European languages but the possible obstacle is that language-specific lemmatization customs differ.

Lemma definitely does not have to be all-lowercase if the base form is normally written capitalized. For example, London should not be lemmatized london. But I know that some treebanks do even this. And if we accept that the cannonical form (lemma) can contain uppercase characters, one may still question whether uppercase or lowecase is the prototypical form of Roman numerals. For me it is uppercase.

rhdunn · 2023-11-03T18:10:58Z

There are several examples in EWT where the roman numeral form is lower case and the lemma is the same. Those and the PUD cases would need changing to use an uppercase lemma for these.

amir-zeldes · 2023-11-03T18:37:36Z

I would be open to normalizing to one form, and would have also expected all caps. Actually looking at GUM and GENTLE, they only have uppercase occurrences, so we wouldn't need to change anything there. Did you see a lower case one?

rhdunn · 2023-11-03T18:48:21Z

As far as I can tell, only EWT has lower case roman numeral forms in sentences. The others all have upper case. EWT is keeping the case between the upper/lowercase forms and lemmas unchanged, while PUD is normalizing uppercase forms to lowercase.

AngledLuffa · 2023-11-03T19:22:21Z

I can certainly do whatever to PUD. No strong preference on the result, but I do like having lemmas be consistent between the treebanks.

nschneid · 2023-11-03T21:15:49Z

Because casing sometimes disambiguates referents (an outline or legal document might use "III" for a large section and "iii" for a subsection), I would suggest not normalizing Roman numeral case in the lemma. Except to make it consistent across the characters ("IIi" is probably a typo meaning "III"; and "charles ii" is presumably a nonstandard way of writing "Charles II"). If we wanted to normalize in general, maybe a MISC feature with the decimal form of the number?

Stormur · 2023-11-09T12:30:33Z

Lemma definitely does not have to be all-lowercase if the base form is normally written capitalized. For example, London should not be lemmatized london. But I know that some treebanks do even this.

I think this has extremely good reasons, better than keeping a capitalisation, and incidently it is what is done in (at least three) Latin treebanks.

But probably Roman numerals are slightly different, in that they are factually symbols. We should just choose if to keep them lower- or uppercase. Since the former form is used, I would probably lean towards that for more uniformity.

Because casing sometimes disambiguates referents (an outline or legal document might use "III" for a large section and "iii"

I do not tink there is any difference at all between those "3s". I am not convicned we should let "orthographic tricks" percolate into lemmatisation, that is the major point

in general, maybe a MISC feature with the decimal form of the number

This I think is needed in general for numeric values and probably, to make it include all possible values, the only current solution seems to be to put it into MISC, as discussed recently.

dan-zeman added this to the v2.14 milestone Nov 3, 2023

dan-zeman added standard needed lemmatization labels Nov 3, 2023

dan-zeman added a commit to UniversalDependencies/UD_English-PUD that referenced this issue Nov 3, 2023

Roman numerals lemmatized to uppercase (UniversalDependencies/docs#987).

d1ab9ec

dan-zeman closed this as completed May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roman numeral lemmas: upper or lower case? #987

Roman numeral lemmas: upper or lower case? #987

rhdunn commented Nov 3, 2023

dan-zeman commented Nov 3, 2023

rhdunn commented Nov 3, 2023

amir-zeldes commented Nov 3, 2023

rhdunn commented Nov 3, 2023

AngledLuffa commented Nov 3, 2023 via email

nschneid commented Nov 3, 2023 •

edited

Stormur commented Nov 9, 2023 •

edited

Roman numeral lemmas: upper or lower case? #987

Roman numeral lemmas: upper or lower case? #987

Comments

rhdunn commented Nov 3, 2023

dan-zeman commented Nov 3, 2023

rhdunn commented Nov 3, 2023

amir-zeldes commented Nov 3, 2023

rhdunn commented Nov 3, 2023

AngledLuffa commented Nov 3, 2023 via email

nschneid commented Nov 3, 2023 • edited

Stormur commented Nov 9, 2023 • edited

nschneid commented Nov 3, 2023 •

edited

Stormur commented Nov 9, 2023 •

edited