-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gender and Number #723
Comments
What are the guidelines for the article l' ? I found cases with and without |
According to my principles, article |
I think that, according to the UD guidelines°, in a language like French (and in general Romance languagesm like in Italian, where the l' too occurs) all articles should be assigned a But beside this, I find the criterion of assigning gender and number only to some words that "inherently" possess it quite incoherent. The fact is that French apparently still makes the difference (maybe principally graphic, but not only... fou/folle), so it should be that either every word has the features In any case, was the original post asking for some comments about this? Mine is that a possible "absence" or "bleaching" of the gender/number distinction should, if it is the case, emerge from the observation that the majority of forms does not make a distinction about it anymore (so, for example, we would detect that les never makes this difference), however not be excluded a priori, which makes such analyses problematic. But of course, there are some words which by their nature do not activate this distinction and "never" have, like qui or je. ° I am referring to this passage about the
|
One solution is to consider that feature are associatated to lemmas and not to forms. The lemma le has For adjective, we have clearly invariable lemmas in French, such as cool or fun, borrowed from English. The adjective marron 'brown' does not inflect in |
Personally I'd prefer to see Gender and Number in every NOUN and in consequence in every DET (at least articles) and ADJ even it it is not shown in the surface form (as l' and marron), since all nouns in French have a Gender, and is Gender/Number agreement in French between Adjectives and Nouns. |
cool or fun appear as members of a quite easily defined class that could effectively not be marked for gender/number (and probably this is reflected in the orthography, too, e.g. there are no *coole, *cools), but that could also be singled out as such with e.g. As also @jheinecke says, it seems that French still recognizes gender and number in its system as a whole. If some word classes or single forms are loosing this distinction, it is something that should emerge a posteriori, but, in my opinion, for coherence all members of the classes |
So we have three possible solutions:
How do we decide? What are the general principles that help us to decide? |
Gender is normally a feature of the lexeme for nouns and a feature of the word form for words that show morphological agreement with nouns (I would say that this is the case of French adjectives and determiners). The lexeme is represented by the lemma together with the UPOS tag (if the part of speech changes but the lemma still looks the same, the set of features may change). Similarly, it is possible in UD that two lexemes have the same-looking lemma because we do not allow disambiguating numeric suffixes in the LEMMA column. |
in the case of articles and adjectives, Gender and Number are rather features of the form, and sometimes only context can decide. If the syntactic context is non-ambiguous, they should be annotated in every case. |
I think it would be too bad to close this issue without giving some general principles to decide at which level the features are attached. I can agree with @dan-zeman's and @jheinecke's solutions but I don't see clearly enough what motivate them. I try to give some clues:
@dan-zeman Another remark about lexemes: in French, a lemma can correspond to two lexemes with two different genders, such une moule 'a mussel' and un moule 'a mold'. In Wolof and other African languages with multiple noun classes, a lemma can correspond to many lexemes of different classes. The lexeme is better represented by the lemma together with the POS and additional features such as |
Continuing what @dan-zeman has said (even if it may sound obvious), in UD we have to distinguish between so-called "lexical" and "morphological" features, but such "lexicality" can change according to the word class, the POS (which I consider by itself just a "shortcut" to refer to some given morphosyntactic features). So, as Dan hinted at, I'd say that the Just for the sake of comparison, let's take a completely different language like Mongolian. There, agreement does not exist. So in the end I think that we can see the outlines of clear criteria by which we determine the distribution of the Anyway, I agree with @sylvainkahane that the presence of As for ambiguity, for what I have understood, we should not worry about that in UD. In the end, the distinction between true homonimies is a semantical fact, whereas two elements might be otherwise morphosyntactically indistinguishable. |
If |
Would it be feasible (or sensible) to annotate somehow the difference between an effective linguistic distinction and a merely graphical one? Like for example |
It looks weird to me but possible it is :-) |
I wouldn't do it. Imagine a case where different (standard language) speakers pronounce a plural form differently, one speaker pronounces the plural morpheme and the other doesn't. You would get all problems of phonology into UD treebanks. And apart from that, you would get irregularites which learning programmes have to tackle with, e.g. in French enfant and enfants is written differently but pronounced in the same way, however in bœuf and bœufs you add just the same s but here it is pronounces differently (/böf/ vs /bö/ (sorry no IPA keyboard at hand). For bœufs you'd have @sylvainkahane What I meant with "in the case of articles and adjectives, Gender and Number are rather features of the form" is that on their lemma does not have a Gender only the form (le, la, verte) is marked for Gender. |
My proposal aimed at facilitating the annotation of corpora: avoid using context to determine certain feature values. Nevertheless, I recognize its inconsistency on some points : I proposed to mark the number of nouns in a systematic way, but this number is sometimes derived from the context. What I take from the above discussion for French is to mark the number and gender of forms for all nouns, determiners and adjectives, except for some adjectives, as mentioned by @sylvainkahane. In some cases, the derivation of the number or gender of an adjective is not direct as in the example: Some features are attached to the lemmas (noun gender for example) and others to the word forms (noun number for example), but the distinction in French is systematic and it is not interesting to mark it. |
I see the point but it's worth noting that there are cases of nouns that are number-marked like journal/journaux*, bocal/bocaux** with the -al suffix or some with the -ail one, become -aux in plural), so how about them? *newspaper |
Well, I got confused and forgot about them, which just nail down the necessity of marking the |
or "yeux" :) |
In the annotation of UD_French-GSD, I have not marked features
Gender
andNumber
, when they are not determined by the word form, but by the context.I have made an exception for nouns that I consider to always have a gender and a number, which can be determined by the context. Without this exception, annotation of corpora would be made difficult.
When gender cannot be determined by word form or context, it is not marked.
The text was updated successfully, but these errors were encountered: