Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roman numeral lemmas: upper or lower case? #987

Closed
rhdunn opened this issue Nov 3, 2023 · 7 comments
Closed

Roman numeral lemmas: upper or lower case? #987

rhdunn opened this issue Nov 3, 2023 · 7 comments

Comments

@rhdunn
Copy link

rhdunn commented Nov 3, 2023

Looking at the English treebanks for roan numerals like I and IX, EWT and GUM preserve the casing while PUD normalizes the forms to lower case.

What should be used here? I would have thought that a lower case lemma would be used, like is done for nouns and other tokens.

@dan-zeman
Copy link
Member

This could be standardized across (at least) European languages but the possible obstacle is that language-specific lemmatization customs differ.

Lemma definitely does not have to be all-lowercase if the base form is normally written capitalized. For example, London should not be lemmatized london. But I know that some treebanks do even this. And if we accept that the cannonical form (lemma) can contain uppercase characters, one may still question whether uppercase or lowecase is the prototypical form of Roman numerals. For me it is uppercase.

@rhdunn
Copy link
Author

rhdunn commented Nov 3, 2023

There are several examples in EWT where the roman numeral form is lower case and the lemma is the same. Those and the PUD cases would need changing to use an uppercase lemma for these.

@amir-zeldes
Copy link
Contributor

I would be open to normalizing to one form, and would have also expected all caps. Actually looking at GUM and GENTLE, they only have uppercase occurrences, so we wouldn't need to change anything there. Did you see a lower case one?

@rhdunn
Copy link
Author

rhdunn commented Nov 3, 2023

As far as I can tell, only EWT has lower case roman numeral forms in sentences. The others all have upper case. EWT is keeping the case between the upper/lowercase forms and lemmas unchanged, while PUD is normalizing uppercase forms to lowercase.

dan-zeman added a commit to UniversalDependencies/UD_English-PUD that referenced this issue Nov 3, 2023
@AngledLuffa
Copy link

AngledLuffa commented Nov 3, 2023 via email

@nschneid
Copy link
Contributor

nschneid commented Nov 3, 2023

Because casing sometimes disambiguates referents (an outline or legal document might use "III" for a large section and "iii" for a subsection), I would suggest not normalizing Roman numeral case in the lemma. Except to make it consistent across the characters ("IIi" is probably a typo meaning "III"; and "charles ii" is presumably a nonstandard way of writing "Charles II"). If we wanted to normalize in general, maybe a MISC feature with the decimal form of the number?

@Stormur
Copy link
Contributor

Stormur commented Nov 9, 2023

Lemma definitely does not have to be all-lowercase if the base form is normally written capitalized. For example, London should not be lemmatized london. But I know that some treebanks do even this.

I think this has extremely good reasons, better than keeping a capitalisation, and incidently it is what is done in (at least three) Latin treebanks.

But probably Roman numerals are slightly different, in that they are factually symbols. We should just choose if to keep them lower- or uppercase. Since the former form is used, I would probably lean towards that for more uniformity.

Because casing sometimes disambiguates referents (an outline or legal document might use "III" for a large section and "iii"

I do not tink there is any difference at all between those "3s". I am not convicned we should let "orthographic tricks" percolate into lemmatisation, that is the major point

in general, maybe a MISC feature with the decimal form of the number

This I think is needed in general for numeric values and probably, to make it include all possible values, the only current solution seems to be to put it into MISC, as discussed recently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants