Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotating Turkish "be" verb: i- #639

Closed
utkuturk opened this issue Jul 8, 2019 · 8 comments
Closed

Annotating Turkish "be" verb: i- #639

utkuturk opened this issue Jul 8, 2019 · 8 comments

Comments

@utkuturk
Copy link
Member

utkuturk commented Jul 8, 2019

Hello,

While annotating another treebank, I had an extremely hard time deciding on the level of representation of Turkish i (be) verb. Should morphology handle it or syntax? I believe neither in IMST, nor in PUD there is no "one" way to deal with the verb i, in places like "-ken" meaning when or where, "-se" meaning indicative conditional, "-miş", "-dı" and other TAM-2 markers, which are clitics (see Nakipoğlu & Yumrutaş, 2009). Their stress pattern is different, their semantic information is different, Being TAM-1 or TAM-2 have different aspect and tense readings.

I could not decide on whether I should divide all of these clitics as an another node and treat them as how English treats "have" or "be" auxiliaries, which will end up forsaking minimality or I should treat all of them as a morphological part of the word, which will end up in forsaking the representation of different levels of contribution between different -mış, -se, -dı, and others. I would love to read on what has been done in other languages or argued for other languages' morphological aspect systems.

I would love to have an unified analysis of all i verbs, but I would also appreciate a way to systematically distinguish a group of these clitics and only mark them as a different syntactic node. Transparency of the morpheme and its "detachability" may be one clue, but then it is not that systematic and it is not "fool-proof".

@dan-zeman
Copy link
Member

Since this is a guidelines question pertaining to the whole language (or even language family; rather than just a bug report in a particular treebank), I am going to move the issue to the general issue tracker of the docs repository.

@dan-zeman dan-zeman transferred this issue from UniversalDependencies/UD_Turkish-IMST Jul 9, 2019
@dan-zeman dan-zeman added this to the v2.4 milestone Jul 9, 2019
@ftyers
Copy link
Contributor

ftyers commented Jul 9, 2019

My opinion here is that these are clitics and should be treated as such, even at the expense of "minimality". I would ask @jonorthwash, @MemduhG and @coltekin to weigh in too.

@coltekin
Copy link
Member

coltekin commented Jul 9, 2019

As far as I know, All Turkish (Turkic?) treebanks are, so far, consistent about not segmenting the copular suffix -i after a verb. If there are differences in Turkish treebanks, this must be an error rather than a deliberate decision.

Even though these are historically separate words, and one can still observe the use of non-suffix version (in marked usage), I think it is better not to split them for two reasons. First, these suffixes interact with the other TAME suffixes on the verb, a proper TAME assignment to the verb is only possible if one considers the combination of both suffixes. Second, segmenting introduces a (non-trivial) step in processing, we should avoid it if we can - both for the sake of human annotators and automatic tools.

On a related note, the -i suffixes attached to nouns are segmented (relatively consistently) on all Turkish treebanks. Whether this is better (for some reason) or even necessary is another question. My initial position was in favor of splitting these morphemes - based on the fact that in certain constructions, they may result in conflicting values for morphological features, and require duplicate subject relations (particularly, if copula follows a nominalized verb). However, the way we currently annotate these cases do not really require splitting the copula off. Furthermore, I am not very happy with splitting the copular suffixes when there is an overt copular suffix, but not doing so when it is null (Person=3,Number=Sing). I think we still need more discussion on these constructions.

@jonorthwash
Copy link
Contributor

All Turkish (Turkic?) treebanks are, so far, consistent about not segmenting the copular suffix -i after a verb
the -i suffixes attached to nouns are segmented (relatively consistently) on all Turkish treebanks

This current approach makes sense to me for the reasons you outlined. But I wonder what to do for other Turkic languages where tenses with copulas sometimes are mandatorily separated by a space, e.g. Kazakh:

kaz tur equiv notes
ішкенмін içmişim kaz is like tur: historical copula is always bound
ішкен едім içmiştim (içmiş idim) kaz is not like tur: historical copula is always separate

This is an issue we brought up in our paper (see §4.4), and I'm still not sure what the right answer is.

I am not very happy with splitting the copular suffixes when there is an overt copular suffix, but not doing so when it is null (Person=3,Number=Sing). I think we still need more discussion on these constructions.

I could go either way on this. The UD guidelines do specify, though, that empty lemmas should be avoided (and to a lesser extent sublemmas)?

@jonorthwash
Copy link
Contributor

By the way, the issue of how to analyse forms of verbs like ішкен or içmiş (when followed by a copula) is a topic brought up in a paper @ftyers and I are publishing in a proceedings shortly. We don't solve the problem, but we entertain a few possibilities:

  1. Purely etymological approach: verbal noun or adjective with a copular verb
    Doesn't work well (at least for UD) for the reasons @coltekin outlined above, and also because we then end up declaring most finite verb forms like this. Even -DI is thought by some to have arisen like this.
  2. Grammaticalised auxiliary construction: infinitive followed by a copular verb as an auxiliary (as in fra "je suis allé" or deu "ich bin gegangen")
    Doesn't work well because these verbal noun/adjective suffixes (-GAн, -mIş) would then be infinitives too, but restricted to use with the copula-derived auxiliary, and the copula-derived auxiliary would be restricted to use with verb forms with VN- and VAdj-derived infinitive suffixes.
  3. (Mostly) grammaticalised tense consruction: the whole sequence is a single tense form.
    Doesn't work well because, depending on the language, you can conjoin pieces of them (suspended affixation), maybe sometimes have things come between them (da/de in Turkish??)

@utkuturk
Copy link
Member Author

utkuturk commented Jul 16, 2019

It is kind of relieving that nobody has a correct or unified answer. There is one thing that I did not wrap my head around. Do we have to separate the use of -i verb with nominals and with verbs? I believe a unified approach towards all of the -i would benefit NLP more. What I also believe is that can one split the -i verbs as such: the ones that can be used as a separate word and the ones that cannot be used as a separate word. Even though both of them behave literally the same prosodically and morphologically (stress assignment, vowel harmony blocking, derivation blocking), I believe there is no good use for separating them syntactically if the parser never encounters them as a separate word.

I agree with the reasons @coltekin shown. However, I also believe there is no single -DI morpheme in Turkish. They are syntactically different (other ones as well such as -MIŞ) and it is shown by Cinque (2002) that different -DI, -MIŞ, etc behaves differently and there is a systematic semantic correspondence.

I feel stuck here because of two reasons:

  1. I do not want to overwhelm the treebank with empty lemmas.
  2. I do want to encode the difference so that proprieties and the behavior of the structure would be predictable from the treebank itself without any additional input to differentiate stress, intonation, meaning differences, which ranges extremely as shown in Cinque (2002).

I am also interested in other languages which has a similar behavior. I lack the typological information regarding the issue. Apart from Turkic languages, do you know any other languages has a similar phenomenon?

References:
Cinque, G. (2001). A note on mood, modality, tense and aspect affixes in Turkish. The verb in Turkish, 44, 47-59.

Edit: link of the article

@dan-zeman dan-zeman modified the milestones: v2.4, v2.5 Oct 5, 2019
@dan-zeman dan-zeman modified the milestones: v2.5, v2.6 Nov 9, 2019
@rueter
Copy link
Contributor

rueter commented Apr 20, 2020

@utkuturk, @jonorthwash, @ftyers, Erzya and Moksha share the issue of morphological versus lexical copula. Where the lexical variant may be associated with the facilitation of emphasis, perhaps analogous of the distinction in my monolect of English, where I'll expresses desire/intent, and i WILL expresses emphatic future factual vs. I will, which distinguishes me from others who may do something.

The distribution is not entirely complementary, i.e. the expression of future is covered by the lexical copula улемс in non-past, in both languages -- this is the only explicitly future oriented verb in the two languages. In the present, morphological marking is attested on the root, be that a noun, adjective, spatial adverbial or non-finite. In the past, however, both the lexical verbs ульнемс (Erzya) and улемс (Moksha) and the etymological -Оль- morphology come into play.

Since there are no instances where features can be confused, is it mandatory that we split the etymological copula from the root? The descriptions in question are 3.1 vs 3.2 and 4.1 vs 4.2. for predicational copula, but they can be applied to equative, identificational, locative, existentional, possessive, quantitative roots, as well

 sent_id = subj-root-cop:lex.1
 text = кудось од ульнесь
 text_en = the house was new...
 comment = this will most likely be followed by development of being a new house.
1	кудось	кудо	NOUN	_	Case=Nom|Definite=Def|Number=Sing	2	nsubj:cop	_	_
2	од	од	ADJ	_	Case=Nom|Definite=Ind|Number=Sing	0	root	_	_
3	ульнесь	ульнемс	AUX	_	Mood=Ind|Number[subj]=Sing|Person[subj]=3|Tense=Past|Valency=1	2	cop	_	_

 sent_id = subj-cop:lex-root.2
 text = Кудось ульнесь од
 text_en = the house was new...
 comment = this will most likely be followed by something...
1	Кудось	кудо	NOUN	_	Case=Nom|Definite=Def|Number=Sing	3	nsubj:cop	_	_
2	ульнесь	ульнемс	AUX	_	Mood=Ind|Number[subj]=Sing|Person[subj]=3|Tense=Past|Valency=1	3	cop	_	_
3	од	од	ADJ	_	Case=Nom|Definite=Ind|Number=Sing	0	root	_	_

 sent_id = root-subj.3.1
 text = Одоль кудось
 text_en = and/but the house was new. 
 comment = this will most likely be a statement in contrast to what has been previously stated NEW
1	Одоль	од	ADJ	_	Case=Nom|Definite=Ind|Number=Sing|Number[subj]=Sing|Person[subj]=3|Tense=Past	0	root	_	_
2	кудось	кудо	NOUN	_	Case=Nom|Definite=Def|Number=Sing	1	nsubj:cop	_	_

 sent_id = subj-root.4.1
 text = Кудось одоль
 text_en = and/but the house was new.
 comment = not sure how to distinguish this one
1	Кудось	кудо	NOUN	_	Case=Nom|Definite=Def|Number=Sing	2	nsubj:cop	_	_
2	одоль	од	ADJ	_	Case=Nom|Definite=Ind|Number=Sing|Number[subj]=Sing|Person[subj]=3|Tense=Past	0	root	_	_

 sent_id = root-subj.3.2
 text = Одоль кудось
 text_en = and/but the house was new. 
 comment = this will most likely be a statement in contrast to what has been previously stated NEW
1-2	Одоль	_	_	_	_	_	_	_	_
1	Од	од	ADJ	_	Case=Nom|Definite=Ind|Number=Sing	0	root	_	_
2	оль	оль	AUX	_	Number[subj]=Sing|Person[subj]=3|Tense=Past	1	cop	_	_
3	кудось	кудо	NOUN	_	Case=Nom|Definite=Def|Number=Sing	1	nsubj:cop	_	_

 sent_id = subj-root.4.2
 text = Кудось одоль
 text_en = and/but the house was new.
 comment = not sure how to distinguish this one
1	Кудось	кудо	NOUN	_	Case=Nom|Definite=Def|Number=Sing	2	nsubj:cop	_	_
2-3	Одоль	_	_	_	_	_	_	_	_
2	од	од	ADJ	_	Case=Nom|Definite=Ind|Number=Sing	0	root	_	_
3	оль	оль	AUX	_	Number[subj]=Sing|Person[subj]=3|Tense=Past	1	cop	_	_

In Erzya, nearly half of the of the instances of copula "formative" now annotated as such are preferably misannotations. The issue being that I have zealously marked third person singular present, which has a ZERO morpheme. Now, the native participant speakers/linguists have raised the question as to why I have split their monolithic words according to etymological pretenses. Since the features can all be expressed without confusion in column 6, this should cause a problem.

What would be your suggestions:
(1) prefer 3.1 and 4.1 to 3.2 and 4.2?
(2) if the preference is for an etymological split, however, should the third person singular and plural present be split off, whereas the singular form is ZERO and the plural is equivalent to the indefinite nominative plural marker.

@dan-zeman dan-zeman modified the milestones: v2.6, v2.7 May 14, 2020
@utkuturk
Copy link
Member Author

@rueter I agree with previous comment by @jonorthwash that I would not go for etymological split since most suffixes can be traced back to having free etymological relatives. For examples Turkish person markers. And since they do not create a confusion, I would prefer 3.1 and 4.1 to 3.2 and 4.2.

As for the Turkish case, (reporting from my monolect and experiences), Turkish lexical copula/auxiliary does not create a pragmatic or semantic difference. However, it may be telling something about the generation of the speaker or the register of the text.

For our purposes, we decomposed them as copula (when attached to a noun) or as aux (when attached to a verb). Our only criteria was whether or not the variation is detectable by everyday user, instead of a linguist. Thus, this eliminates any issue with etymological discussion. And it would help to have a data where it is specified that verb=aux and verb aux (or noun=copula and noun copula) is actually same.

@dan-zeman dan-zeman modified the milestones: v2.7, v2.8 Nov 14, 2020
@dan-zeman dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021
@dan-zeman dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022
@dan-zeman dan-zeman modified the milestones: v2.11, v2.13 May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants