-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotating Turkish "be" verb: i- #639
Comments
Since this is a guidelines question pertaining to the whole language (or even language family; rather than just a bug report in a particular treebank), I am going to move the issue to the general issue tracker of the docs repository. |
My opinion here is that these are clitics and should be treated as such, even at the expense of "minimality". I would ask @jonorthwash, @MemduhG and @coltekin to weigh in too. |
As far as I know, All Turkish (Turkic?) treebanks are, so far, consistent about not segmenting the copular suffix -i after a verb. If there are differences in Turkish treebanks, this must be an error rather than a deliberate decision. Even though these are historically separate words, and one can still observe the use of non-suffix version (in marked usage), I think it is better not to split them for two reasons. First, these suffixes interact with the other TAME suffixes on the verb, a proper TAME assignment to the verb is only possible if one considers the combination of both suffixes. Second, segmenting introduces a (non-trivial) step in processing, we should avoid it if we can - both for the sake of human annotators and automatic tools. On a related note, the -i suffixes attached to nouns are segmented (relatively consistently) on all Turkish treebanks. Whether this is better (for some reason) or even necessary is another question. My initial position was in favor of splitting these morphemes - based on the fact that in certain constructions, they may result in conflicting values for morphological features, and require duplicate subject relations (particularly, if copula follows a nominalized verb). However, the way we currently annotate these cases do not really require splitting the copula off. Furthermore, I am not very happy with splitting the copular suffixes when there is an overt copular suffix, but not doing so when it is null (Person=3,Number=Sing). I think we still need more discussion on these constructions. |
This current approach makes sense to me for the reasons you outlined. But I wonder what to do for other Turkic languages where tenses with copulas sometimes are mandatorily separated by a space, e.g. Kazakh:
This is an issue we brought up in our paper (see §4.4), and I'm still not sure what the right answer is.
I could go either way on this. The UD guidelines do specify, though, that empty lemmas should be avoided (and to a lesser extent sublemmas)? |
By the way, the issue of how to analyse forms of verbs like
|
It is kind of relieving that nobody has a correct or unified answer. There is one thing that I did not wrap my head around. Do we have to separate the use of -i verb with nominals and with verbs? I believe a unified approach towards all of the -i would benefit NLP more. What I also believe is that can one split the -i verbs as such: the ones that can be used as a separate word and the ones that cannot be used as a separate word. Even though both of them behave literally the same prosodically and morphologically (stress assignment, vowel harmony blocking, derivation blocking), I believe there is no good use for separating them syntactically if the parser never encounters them as a separate word. I agree with the reasons @coltekin shown. However, I also believe there is no single -DI morpheme in Turkish. They are syntactically different (other ones as well such as -MIŞ) and it is shown by Cinque (2002) that different -DI, -MIŞ, etc behaves differently and there is a systematic semantic correspondence. I feel stuck here because of two reasons:
I am also interested in other languages which has a similar behavior. I lack the typological information regarding the issue. Apart from Turkic languages, do you know any other languages has a similar phenomenon? References: Edit: link of the article |
@utkuturk, @jonorthwash, @ftyers, Erzya and Moksha share the issue of morphological versus lexical copula. Where the lexical variant may be associated with the facilitation of emphasis, perhaps analogous of the distinction in my monolect of English, where I'll expresses desire/intent, and i WILL expresses emphatic future factual vs. I will, which distinguishes me from others who may do something. The distribution is not entirely complementary, i.e. the expression of future is covered by the lexical copula улемс in non-past, in both languages -- this is the only explicitly future oriented verb in the two languages. In the present, morphological marking is attested on the root, be that a noun, adjective, spatial adverbial or non-finite. In the past, however, both the lexical verbs ульнемс (Erzya) and улемс (Moksha) and the etymological -Оль- morphology come into play. Since there are no instances where features can be confused, is it mandatory that we split the etymological copula from the root? The descriptions in question are 3.1 vs 3.2 and 4.1 vs 4.2. for predicational copula, but they can be applied to equative, identificational, locative, existentional, possessive, quantitative roots, as well
In Erzya, nearly half of the of the instances of copula "formative" now annotated as such are preferably misannotations. The issue being that I have zealously marked third person singular present, which has a ZERO morpheme. Now, the native participant speakers/linguists have raised the question as to why I have split their monolithic words according to etymological pretenses. Since the features can all be expressed without confusion in column 6, this should cause a problem. What would be your suggestions: |
@rueter I agree with previous comment by @jonorthwash that I would not go for etymological split since most suffixes can be traced back to having free etymological relatives. For examples Turkish person markers. And since they do not create a confusion, I would prefer 3.1 and 4.1 to 3.2 and 4.2. As for the Turkish case, (reporting from my monolect and experiences), Turkish lexical copula/auxiliary does not create a pragmatic or semantic difference. However, it may be telling something about the generation of the speaker or the register of the text. For our purposes, we decomposed them as copula (when attached to a noun) or as aux (when attached to a verb). Our only criteria was whether or not the variation is detectable by everyday user, instead of a linguist. Thus, this eliminates any issue with etymological discussion. And it would help to have a data where it is specified that |
Hello,
While annotating another treebank, I had an extremely hard time deciding on the level of representation of Turkish i (be) verb. Should morphology handle it or syntax? I believe neither in IMST, nor in PUD there is no "one" way to deal with the verb i, in places like "-ken" meaning when or where, "-se" meaning indicative conditional, "-miş", "-dı" and other TAM-2 markers, which are clitics (see Nakipoğlu & Yumrutaş, 2009). Their stress pattern is different, their semantic information is different, Being TAM-1 or TAM-2 have different aspect and tense readings.
I could not decide on whether I should divide all of these clitics as an another node and treat them as how English treats "have" or "be" auxiliaries, which will end up forsaking minimality or I should treat all of them as a morphological part of the word, which will end up in forsaking the representation of different levels of contribution between different -mış, -se, -dı, and others. I would love to read on what has been done in other languages or argued for other languages' morphological aspect systems.
I would love to have an unified analysis of all i verbs, but I would also appreciate a way to systematically distinguish a group of these clitics and only mark them as a different syntactic node. Transparency of the morpheme and its "detachability" may be one clue, but then it is not that systematic and it is not "fool-proof".
The text was updated successfully, but these errors were encountered: