Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gender and Number #723

Closed
perrier54 opened this issue Aug 26, 2020 · 19 comments
Closed

Gender and Number #723

perrier54 opened this issue Aug 26, 2020 · 19 comments

Comments

@perrier54
Copy link
Contributor

perrier54 commented Aug 26, 2020

In the annotation of UD_French-GSD, I have not marked features Gender and Number, when they are not determined by the word form, but by the context.

Examples :
relative pronoun qui (who): no Gender, no Number
determiner les (the): no Gender
adjective libre (free): no Gender

I have made an exception for nouns that I consider to always have a gender and a number, which can be determined by the context. Without this exception, annotation of corpora would be made difficult.

Examples:
elle est ministre (she is a minister) - ministre[Gender=Fem,Number=Sing]
il est ministre (he is a minister) - ministre [Gender=Masc,Number=Sing]
les souris dansent (the mice are dancing) - souris[Gender=Fem,Number=Plur]
la souris danse (the mouse is dancing) - souris[Gender=Fem,Number=Sing]

When gender cannot be determined by word form or context, it is not marked.

Examples:
il a quatre enfants (he has four chlidren) - enfants[Number=Plur]

@dan-zeman dan-zeman added this to the v2.7 milestone Aug 26, 2020
@jheinecke
Copy link
Contributor

jheinecke commented Sep 10, 2020

What are the guidelines for the article l' ? I found cases with and without Gender, even when the context is clear (fr-ud-test_00004:9, fr-ud-test_00098:41, fr-ud-test_00107:2 and other).
I have written a small script to check some basic linguistic coherencies (agreement, absent features, fixed expressions) which also found some other issues. I'll push it soon on Github

@perrier54
Copy link
Contributor Author

According to my principles, article l'has no gender. In the version of UD_French-GSD on Github, there are still some inconsistencies, which are corrected in the current version.
I have also written patterns for GREW to check the agreement in number and gender in French UD corpora. These patterns are designed for the SUD format but they are easily adaptable to UD because there is no change of features from SUD to UD.

@Stormur
Copy link
Contributor

Stormur commented Sep 10, 2020

I think that, according to the UD guidelines°, in a language like French (and in general Romance languagesm like in Italian, where the l' too occurs) all articles should be assigned a Gender. The first consideration is that the l' in exam is just the graphic form of either le or la, both of which have a clear gender.

But beside this, I find the criterion of assigning gender and number only to some words that "inherently" possess it quite incoherent. The fact is that French apparently still makes the difference (maybe principally graphic, but not only... fou/folle), so it should be that either every word has the features Gender and Number, or none.

In any case, was the original post asking for some comments about this? Mine is that a possible "absence" or "bleaching" of the gender/number distinction should, if it is the case, emerge from the observation that the majority of forms does not make a distinction about it anymore (so, for example, we would detect that les never makes this difference), however not be excluded a priori, which makes such analyses problematic. But of course, there are some words which by their nature do not activate this distinction and "never" have, like qui or je.

° I am referring to this passage about the Common gender, which can be well adapted for the case in question:

Note that it could also be expressed as a combined value Gender=Fem,Masc. Nevertheless we keep Com also as a separate value. Combined feature values should only be used in exceptional, undecided cases, not for something that occurs systematically in the grammar. Language-specific extensions to these guidelines should determine whether the Com value is appropriate for a particular language.

Note further that the Com value is not intended for cases where we just cannot derive the gender from the word itself (without seeing the context), while the language actually distinguishes Masc and Fem. For example, in Spanish, nouns distinguish two genders, masculine and feminine, and every noun can be classified as either Masc or Fem. Adjectives are supposed to agree with nouns in gender (and number), which they typically achieve by alternating -o / -a. But then there are adjectives such as grande or feliz that have only one form for both genders. So we cannot tell whether they are masculine or feminine unless we see the context. Yet they are either masculine or feminine (feminine in una ciudad grande, masculine in un puerto grande). Therefore in Spanish we should not tag grande with Gender=Com. Instead, we should either drop the gender feature entirely (suggesting that this word does not inflect for gender) or tag individual instances of grande as either masculine or feminine, depending on context.

@sylvainkahane
Copy link
Contributor

One solution is to consider that feature are associatated to lemmas and not to forms. The lemma le has Gender because it has maculine and feminine forms le and la in singular and therefore we could keep Gender in plural les even if it is neutralized, as well as in l' used in front of vowels.

For adjective, we have clearly invariable lemmas in French, such as cool or fun, borrowed from English. The adjective marron 'brown' does not inflect in Gender because it is a conversion of the noun marron 'chesnut'. I find reasonnable to consider that other "true" French adjectives such as fragile or rouge 'red' which have the same forms in masculine and ffeminine do not have Gender, even if the majority of adjectives have it. Some adjectives such as heureux 'happy' have two feminine forms (heureuse, heureuses) but neutralizes it in masculine, so they should have a Number also in masculine if we consider that it is a property of the lemma.

@jheinecke
Copy link
Contributor

Personally I'd prefer to see Gender and Number in every NOUN and in consequence in every DET (at least articles) and ADJ even it it is not shown in the surface form (as l' and marron), since all nouns in French have a Gender, and is Gender/Number agreement in French between Adjectives and Nouns.
Last not least an important downstream application is the training of parsers and the more regular the training data is, the better the (C/M)LAS will be (I have not yet tested this though)

@Stormur
Copy link
Contributor

Stormur commented Sep 11, 2020

For adjective, we have clearly invariable lemmas in French, such as cool or fun, borrowed from English. The adjective marron 'brown' does not inflect in Gender because it is a conversion of the noun marron 'chesnut'. I find reasonnable to consider that other "true" French adjectives such as fragile or rouge 'red' which have the same forms in masculine and ffeminine do not have Gender, even if the majority of adjectives have it. Some adjectives such as heureux 'happy' have two feminine forms (heureuse, heureuses) but neutralizes it in masculine, so they should have a Number also in masculine if we consider that it is a property of the lemma.

cool or fun appear as members of a quite easily defined class that could effectively not be marked for gender/number (and probably this is reflected in the orthography, too, e.g. there are no *coole, *cools), but that could also be singled out as such with e.g. Foreign=Yes.
Also, we might argue that words like marron are adjectives: there is the same phenomenon at least in Italian, with color terms such as viola 'purple'. These are nouns (viola = the flower viola) used as modifiers which still behave as apposed nouns, e.g. it is not acceptable to say case viole 'purple houses' instead of the correct case viola, so I do not deem as unthinkable to label them as NOUNs which we often register used as nmods.
So both of the aforementioned categories are under many aspects really different from invariable "true" adjectives like fragile, and probably one cannot apply the same techniques to all of them.

As also @jheinecke says, it seems that French still recognizes gender and number in its system as a whole. If some word classes or single forms are loosing this distinction, it is something that should emerge a posteriori, but, in my opinion, for coherence all members of the classes NOUN, DET and ADJ should keep these features as for now.

@sylvainkahane
Copy link
Contributor

So we have three possible solutions:

  1. Attributes are attached to word-forms: two forms of the same lemma (le and les) can have different attributes.

  2. Attributes are attached to lemmas: two lemmas of the same POS (fragile and grand) can have different attributes, but two form of the same lemma must have the same attributes.

  3. Attributes are attached to POS: two lemmas of the same POS must have different attributes, with very few exceptions such as Foreign=Yes.

How do we decide? What are the general principles that help us to decide?

@dan-zeman
Copy link
Member

Gender is normally a feature of the lexeme for nouns and a feature of the word form for words that show morphological agreement with nouns (I would say that this is the case of French adjectives and determiners). The lexeme is represented by the lemma together with the UPOS tag (if the part of speech changes but the lemma still looks the same, the set of features may change). Similarly, it is possible in UD that two lexemes have the same-looking lemma because we do not allow disambiguating numeric suffixes in the LEMMA column.

@jheinecke
Copy link
Contributor

in the case of articles and adjectives, Gender and Number are rather features of the form, and sometimes only context can decide. If the syntactic context is non-ambiguous, they should be annotated in every case.
A similar case are VERBs, which have Number and Person features, if VerbForm=Fin. For forms which are ambiguous out of context (like mange , 1st or 3rd Person at least for Indicative Present), are nevertheless marked Person=1 or Person=3 in French-GSD and Sequoia.

@sylvainkahane
Copy link
Contributor

I think it would be too bad to close this issue without giving some general principles to decide at which level the features are attached. I can agree with @dan-zeman's and @jheinecke's solutions but I don't see clearly enough what motivate them. I try to give some clues:

  • in French, nouns trigger the agreement of adjectives and articles for Gender, so Gender must be attached to every noun. For Number, it's less clear that it's the noun form that trigger the agreement of the article, especially because in French Number appears in the written form of nouns but it is not pronounced and in spoken French, the number of NPs is given by the determiner. But OK for the sake of simplicity to decide to attach Number to nouns.

  • for adjective, you propose to consider that Gender is attached to forms. Is it because, adjectives do not trigger agreement on other words?

@dan-zeman Another remark about lexemes: in French, a lemma can correspond to two lexemes with two different genders, such une moule 'a mussel' and un moule 'a mold'. In Wolof and other African languages with multiple noun classes, a lemma can correspond to many lexemes of different classes. The lexeme is better represented by the lemma together with the POS and additional features such as Gender. (Of course, due to homonymy, it is not sufficient.)

@Stormur
Copy link
Contributor

Stormur commented Sep 15, 2020

Continuing what @dan-zeman has said (even if it may sound obvious), in UD we have to distinguish between so-called "lexical" and "morphological" features, but such "lexicality" can change according to the word class, the POS (which I consider by itself just a "shortcut" to refer to some given morphosyntactic features). So, as Dan hinted at, I'd say that the Gender is "inherent", lexical for NOUNs, but "variable", morphological for ADJs and DETs (at least in Romance languages, and this is exactly a way to define these classes in those languages). For me, the main observation/criterion is that a variable element "adapts" its Gender to the inherently gendered element, whereas the contrary never happens (in the words of @sylvainkahane , yes, "adjectives do not trigger agreement on other words"). The result is that we will have to register a Gender in both cases, but in the first one this choice will depend on the lexeme, in the second case on the word form. I think that the choice is in general "lexical" (i.e. linked to the lexeme) for PRONs, but, contrary to NOUNs, this does not mean that a Gender is always present there (e.g. je).

Just for the sake of comparison, let's take a completely different language like Mongolian. There, agreement does not exist. Gender is totally absent and can only be considered, if at all, a lexical property: we have couples like улаан 'red' and улаагчин 'red (for female animals)', but there is no other trace thereof anywhere else. Number can be marked morphologically on nearly all nominal elements (e.g. монгол vs.монголчууд 'Mongol(s)', та vs. та нар 'you (polite, sg. vs. pl.)'), but it is nearly always a marked choice and it can be omitted, depending on the context. So, here I see a language where, for all word classes, we have an exclusively form-dependent Number and an exclusively lexeme-dependent (extremely limited) Gender. Agreement has no role and indeed, both features have a "sporadic" nature.

So in the end I think that we can see the outlines of clear criteria by which we determine the distribution of the Gender/Number (and other) features.


Anyway, I agree with @sylvainkahane that the presence of Number is much more doubtful in French. The article les bears it, but beside that, it does not seem other than just a graphical convention. I would be in favor of not marking it anywhere for NOUNs, ADJs and DETs (apart from some motivated exceptions like les).

As for ambiguity, for what I have understood, we should not worry about that in UD. In the end, the distinction between true homonimies is a semantical fact, whereas two elements might be otherwise morphosyntactically indistinguishable.

@dan-zeman
Copy link
Member

If Number is distinguished in writing, I would be in favor of annotating it, even though the distinction is not pronounced. Many NLP applications work with written data and it would be a pity if we cannot train models that will learn the number from the text.

@Stormur
Copy link
Contributor

Stormur commented Sep 15, 2020

If Number is distinguished in writing, I would be in favor of annotating it, even though the distinction is not pronounced. Many NLP applications work with written data and it would be a pity if we cannot train models that will learn the number from the text.

Would it be feasible (or sensible) to annotate somehow the difference between an effective linguistic distinction and a merely graphical one? Like for example Number[written] instead of the simple Number?

@dan-zeman
Copy link
Member

Would it be feasible (or sensible) to annotate somehow the difference between an effective linguistic distinction and a merely graphical one? Like for example Number[written] instead of the simple Number?

It looks weird to me but possible it is :-)

@jheinecke
Copy link
Contributor

Would it be feasible (or sensible) to annotate somehow the difference between an effective linguistic distinction and a merely graphical one? Like for example Number[written] instead of the simple Number?

I wouldn't do it. Imagine a case where different (standard language) speakers pronounce a plural form differently, one speaker pronounces the plural morpheme and the other doesn't. You would get all problems of phonology into UD treebanks. And apart from that, you would get irregularites which learning programmes have to tackle with, e.g. in French enfant and enfants is written differently but pronounced in the same way, however in bœuf and bœufs you add just the same s but here it is pronounces differently (/böf/ vs /bö/ (sorry no IPA keyboard at hand). For bœufs you'd have Number=Plural but for enfants you would put Number[written]=Plural. Not to speak of os which is written the same in Sing/plur but pronounced differently (/os/ vs /o:/)

@sylvainkahane What I meant with "in the case of articles and adjectives, Gender and Number are rather features of the form" is that on their lemma does not have a Gender only the form (le, la, verte) is marked for Gender.

@perrier54
Copy link
Contributor Author

My proposal aimed at facilitating the annotation of corpora: avoid using context to determine certain feature values. Nevertheless, I recognize its inconsistency on some points : I proposed to mark the number of nouns in a systematic way, but this number is sometimes derived from the context.

What I take from the above discussion for French is to mark the number and gender of forms for all nouns, determiners and adjectives, except for some adjectives, as mentioned by @sylvainkahane.

In some cases, the derivation of the number or gender of an adjective is not direct as in the example: "l'exercice que je trouve facile (the exercise that I find easy)" The adjective facilerefers to exercice, which is masculine.

Some features are attached to the lemmas (noun gender for example) and others to the word forms (noun number for example), but the distinction in French is systematic and it is not interesting to mark it.
Moreover, I don't think it is interesting to make a distinction between written and oral features, at least for written texts, in which the features are derived from the graphic form of words.

@dseddah
Copy link

dseddah commented Sep 16, 2020

Anyway, I agree with @sylvainkahane that the presence of Number is much more doubtful in French. The article les bears it, but beside that, it does not seem other than just a graphical convention. I would be in favor of not marking it anywhere for NOUNs, ADJs and DETs (apart from some motivated exceptions like les).

I see the point but it's worth noting that there are cases of nouns that are number-marked like journal/journaux*, bocal/bocaux** with the -al suffix or some with the -ail one, become -aux in plural), so how about them?

*newspaper
**jar

@Stormur
Copy link
Contributor

Stormur commented Sep 16, 2020

Anyway, I agree with @sylvainkahane that the presence of Number is much more doubtful in French. The article les bears it, but beside that, it does not seem other than just a graphical convention. I would be in favor of not marking it anywhere for NOUNs, ADJs and DETs (apart from some motivated exceptions like les).

I see the point but it's worth noting that there are cases of nouns that are number-marked like journal/journaux*, bocal/bocaux** with the -al suffix or some with the -ail one, become -aux in plural), so how about them?

*newspaper
**jar

Well, I got confused and forgot about them, which just nail down the necessity of marking the Number in French (apart from other very interesting and subtle considerations like that about boeuf) 😄

@dseddah
Copy link

dseddah commented Sep 16, 2020

or "yeux" :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants