New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition of word #377

Open
spyysalo opened this Issue Dec 15, 2016 · 46 comments

Comments

Projects
None yet
10 participants
@spyysalo
Member

spyysalo commented Dec 15, 2016

Recent discussions have suggested that the UD documentation could benefit from a more detailed definition of "word". We can use this issue to discuss the existing definition and possible improvements.

( @jnivre @dan-zeman @ftyers , others? )

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 15, 2016

Member

Currently, http://universaldependencies.org/u/overview/tokenization.html reads in part

The UD annotation is based on a lexicalist view of syntax, which means that dependency relations hold between words [...] there is no attempt at segmenting words into morphemes. [...] the basic units of annotation are syntactic words (not phonological or orthographic words)

Member

spyysalo commented Dec 15, 2016

Currently, http://universaldependencies.org/u/overview/tokenization.html reads in part

The UD annotation is based on a lexicalist view of syntax, which means that dependency relations hold between words [...] there is no attempt at segmenting words into morphemes. [...] the basic units of annotation are syntactic words (not phonological or orthographic words)

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 15, 2016

Member

Joakim's Coling keynote included the following related heuristic:

What is a word?

  • Single part-of-speech tag
  • Real syntactic relation
Member

spyysalo commented Dec 15, 2016

Joakim's Coling keynote included the following related heuristic:

What is a word?

  • Single part-of-speech tag
  • Real syntactic relation
@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 15, 2016

Member

Greg Pringle has written at length on the topic in the context of UD Japanese at http://www.cjvlang.com/Spicks/udjapanese.html . One specifically relevant part:

Linguists later fine-tuned the definition of words with further distributional criteria:

'Positional mobility (syntagmatic mobility)': the word is free to be used at different places in the sentence. For example, John will go can be transformed into Will John go?, indicating that will, John, and go are three separate words.

'Internal stability (internal immutability)': In contrast with the positional mobility that the word enjoys, morphemes within a word are fixed in order. For example, played is a stable unit that does not permit rearrangement as ed-play.

'Uninterruptability': it is not possible to insert anything between the morphemes of a word. For instance, it is not possible to insert anything between play and -ed (e.g., play-be-ed).

Member

spyysalo commented Dec 15, 2016

Greg Pringle has written at length on the topic in the context of UD Japanese at http://www.cjvlang.com/Spicks/udjapanese.html . One specifically relevant part:

Linguists later fine-tuned the definition of words with further distributional criteria:

'Positional mobility (syntagmatic mobility)': the word is free to be used at different places in the sentence. For example, John will go can be transformed into Will John go?, indicating that will, John, and go are three separate words.

'Internal stability (internal immutability)': In contrast with the positional mobility that the word enjoys, morphemes within a word are fixed in order. For example, played is a stable unit that does not permit rearrangement as ed-play.

'Uninterruptability': it is not possible to insert anything between the morphemes of a word. For instance, it is not possible to insert anything between play and -ed (e.g., play-be-ed).

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Dec 15, 2016

Contributor

This is a great initiative. However, Pringle's characterisation does not quite work for syntactic words, because it seems to exclude clitics. This paper is very relevant as well:
http://coltekin.net/cagri/papers/coltekin2016turcling.pdf

Contributor

jnivre commented Dec 15, 2016

This is a great initiative. However, Pringle's characterisation does not quite work for syntactic words, because it seems to exclude clitics. This paper is very relevant as well:
http://coltekin.net/cagri/papers/coltekin2016turcling.pdf

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 15, 2016

Member

Thanks! @coltekin proposes the following criteria

The term inflectional group (IG) in Turkish natural language processing literature refers to a sub-word unit. [...] the unit has been a de facto standard for representing words in Turkish NLP. [...] The current paper proposes tokenizing a surface word into multiple IGs only in case one of the following is true.
a. Parts of the word may have potentially conflicting inflectional features.
b. Parts of the word may participate in different syntactic relations.

Member

spyysalo commented Dec 15, 2016

Thanks! @coltekin proposes the following criteria

The term inflectional group (IG) in Turkish natural language processing literature refers to a sub-word unit. [...] the unit has been a de facto standard for representing words in Turkish NLP. [...] The current paper proposes tokenizing a surface word into multiple IGs only in case one of the following is true.
a. Parts of the word may have potentially conflicting inflectional features.
b. Parts of the word may participate in different syntactic relations.

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Dec 15, 2016

Contributor

Clause a. only works for languages and words that inflect, but it can be generalized to postags, I think. Fusions are a case in point. Something like French "au" does not have conflicting inflectional features, because the preposition "à" does not inflect, but it has conflicting postags, ADP vs. DET.

Contributor

jnivre commented Dec 15, 2016

Clause a. only works for languages and words that inflect, but it can be generalized to postags, I think. Fusions are a case in point. Something like French "au" does not have conflicting inflectional features, because the preposition "à" does not inflect, but it has conflicting postags, ADP vs. DET.

@dan-zeman dan-zeman added this to the lg-specific v2 milestone Dec 16, 2016

@dan-zeman

This comment has been minimized.

Show comment
Hide comment
@dan-zeman

dan-zeman Dec 16, 2016

Member

Wrt clitics, I was wondering how we would explain the difference between clitics attached to a verb (Spanish dámelo, vámonos) and morphemes that encode agreement of the verb with one or more arguments (and which could be interpreted as hidden pronouns when the respective argument does not appear as an overt nominal). Spanish vamos itself could be said to encode nosotros vamos, and head-marking languages like Basque may even encode agreement with a second and third argument, but we are quite clear about UD not analyzing these as separate words ("do not annotate things that are not there"). While I agree that we should not annotate dropped subjects (or arguments in general), I would love to hear a sound rule that says why vámonos is different.

One clear difference I can think of right now is exactly the possibility (in the agreement version) that the argument also appears as a separate word. However, I am not sure that a guideline based on this would work in languages that have clitic doubling.

Member

dan-zeman commented Dec 16, 2016

Wrt clitics, I was wondering how we would explain the difference between clitics attached to a verb (Spanish dámelo, vámonos) and morphemes that encode agreement of the verb with one or more arguments (and which could be interpreted as hidden pronouns when the respective argument does not appear as an overt nominal). Spanish vamos itself could be said to encode nosotros vamos, and head-marking languages like Basque may even encode agreement with a second and third argument, but we are quite clear about UD not analyzing these as separate words ("do not annotate things that are not there"). While I agree that we should not annotate dropped subjects (or arguments in general), I would love to hear a sound rule that says why vámonos is different.

One clear difference I can think of right now is exactly the possibility (in the agreement version) that the argument also appears as a separate word. However, I am not sure that a guideline based on this would work in languages that have clitic doubling.

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Dec 16, 2016

Contributor

I agree that the distinction between pronominal clitics and inflectional endings that mark agreement is a bit fuzzy and that the former could develop into the latter historically. For the Spanish case, I would appeal to language-internal arguments about systematicity. Subject agreement is present in all verbs in Spanish, even with a separate overt subject, but "object agreement" is not. Therefore, when a morpheme referring to the object is glued to the verb, we avoid exceptions if we separate it from its host. How does that sound?

Contributor

jnivre commented Dec 16, 2016

I agree that the distinction between pronominal clitics and inflectional endings that mark agreement is a bit fuzzy and that the former could develop into the latter historically. For the Spanish case, I would appeal to language-internal arguments about systematicity. Subject agreement is present in all verbs in Spanish, even with a separate overt subject, but "object agreement" is not. Therefore, when a morpheme referring to the object is glued to the verb, we avoid exceptions if we separate it from its host. How does that sound?

@dan-zeman

This comment has been minimized.

Show comment
Hide comment
@dan-zeman

dan-zeman Dec 16, 2016

Member

On a more general note: I think that the default practice in UD is that we take orthographic words (in languages where they exist) as units. Only when we see good reasons to treat certain cases as contractions of multiple syntactic words, we split them; but if they work reasonably well as one word, we keep them together even though a parallel expression in a related language is rendered as multiple words. And there is some flexibility to these decisions, as always.

Conversely, when we have a good reason to say that one syntactic word spans several orthographic words, we connect them using the fixed relation, except for the very limited cases where we are allowed to include a space in a word form.

Member

dan-zeman commented Dec 16, 2016

On a more general note: I think that the default practice in UD is that we take orthographic words (in languages where they exist) as units. Only when we see good reasons to treat certain cases as contractions of multiple syntactic words, we split them; but if they work reasonably well as one word, we keep them together even though a parallel expression in a related language is rendered as multiple words. And there is some flexibility to these decisions, as always.

Conversely, when we have a good reason to say that one syntactic word spans several orthographic words, we connect them using the fixed relation, except for the very limited cases where we are allowed to include a space in a word form.

@dan-zeman

This comment has been minimized.

Show comment
Hide comment
@dan-zeman

dan-zeman Dec 16, 2016

Member

@jnivre : sounds good to me, thanks. If this discussion results in a set of guidelines for UD-wordness, we should include the Spanish example with this explanation.

Member

dan-zeman commented Dec 16, 2016

@jnivre : sounds good to me, thanks. If this discussion results in a set of guidelines for UD-wordness, we should include the Spanish example with this explanation.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 16, 2016

Member

Stupid question: what's the rule against analyzing e.g. joined as join/VERB ed/AUX? Would it work the same if English didn't use space?

Member

spyysalo commented Dec 16, 2016

Stupid question: what's the rule against analyzing e.g. joined as join/VERB ed/AUX? Would it work the same if English didn't use space?

@dan-zeman

This comment has been minimized.

Show comment
Hide comment
@dan-zeman

dan-zeman Dec 16, 2016

Member

@spyysalo : irregular verbs? If the English writing system did not use space, it would still not be clear where the auxiliary begins in threw.

Member

dan-zeman commented Dec 16, 2016

@spyysalo : irregular verbs? If the English writing system did not use space, it would still not be clear where the auxiliary begins in threw.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 16, 2016

Member

Nice :-) How might that look as a general rule? Are there other arguments that could apply?

Member

spyysalo commented Dec 16, 2016

Nice :-) How might that look as a general rule? Are there other arguments that could apply?

@manning

This comment has been minimized.

Show comment
Hide comment
@manning

manning Dec 16, 2016

Contributor

There is a lot of literature in linguistics on distinguishing clitics from morphology. Among others, Arnold Zwicky has written a lot on this issue. Of course, as @jnivre notes, the results are not always crystal clear due to grammaticalization. There is an argument that some have pursued that romance "clitics" should really be treated as morphology. So I suspect that @dan-zeman's "On a more general note" is really part of the answer for us. But if you'd like to read more about how words, clitics, and morphology have been argued to be distinguishable, here's a little reading list:

Contributor

manning commented Dec 16, 2016

There is a lot of literature in linguistics on distinguishing clitics from morphology. Among others, Arnold Zwicky has written a lot on this issue. Of course, as @jnivre notes, the results are not always crystal clear due to grammaticalization. There is an argument that some have pursued that romance "clitics" should really be treated as morphology. So I suspect that @dan-zeman's "On a more general note" is really part of the answer for us. But if you'd like to read more about how words, clitics, and morphology have been argued to be distinguishable, here's a little reading list:

@dan-zeman

This comment has been minimized.

Show comment
Hide comment
@dan-zeman

dan-zeman Dec 16, 2016

Member

Awesome, thanks!

Member

dan-zeman commented Dec 16, 2016

Awesome, thanks!

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 16, 2016

Member

@manning: thanks, that great! I'll dig in after recovering from Coling :-)

The reason I'm asking this particular question now is the proposed UD Japanese analysis of e.g. 食べた tabeta “ate” as 食べ/VERB た/AUX (with aux dependency: http://universaldependencies.org/ja/overview/syntax.html#special-clausal-dependents). 食べた strikes me as equally obviously a single-word past form verb as the English joined, and I'd like to know what arguments in the UD framework would support keeping this as one word.

Member

spyysalo commented Dec 16, 2016

@manning: thanks, that great! I'll dig in after recovering from Coling :-)

The reason I'm asking this particular question now is the proposed UD Japanese analysis of e.g. 食べた tabeta “ate” as 食べ/VERB た/AUX (with aux dependency: http://universaldependencies.org/ja/overview/syntax.html#special-clausal-dependents). 食べた strikes me as equally obviously a single-word past form verb as the English joined, and I'd like to know what arguments in the UD framework would support keeping this as one word.

@amir-zeldes

This comment has been minimized.

Show comment
Hide comment
@amir-zeldes

amir-zeldes Dec 16, 2016

Contributor

@spyysalo I think what you're looking for is what this paper calls "MUW" (Middle Unit Word):

http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf

Contributor

amir-zeldes commented Dec 16, 2016

@spyysalo I think what you're looking for is what this paper calls "MUW" (Middle Unit Word):

http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Dec 17, 2016

Member

@amir-zeldes: this work is certainly highly relevant (and discussed at length in http://www.cjvlang.com/Spicks/udjapanese.html, referenced above), but unfortunately even the longest "LUW" (Long Unit Word) in the paper splits 食べた into 食べ/VERB andた/AUX (see Figure 1).

Member

spyysalo commented Dec 17, 2016

@amir-zeldes: this work is certainly highly relevant (and discussed at length in http://www.cjvlang.com/Spicks/udjapanese.html, referenced above), but unfortunately even the longest "LUW" (Long Unit Word) in the paper splits 食べた into 食べ/VERB andた/AUX (see Figure 1).

@amir-zeldes

This comment has been minimized.

Show comment
Hide comment
@amir-zeldes

amir-zeldes Dec 17, 2016

Contributor

You're right, sorry for the confusion! I agree, it's odd for this to be a token while other agglutinative morphemes (e.g. potential) are not.

Contributor

amir-zeldes commented Dec 17, 2016

You're right, sorry for the confusion! I agree, it's odd for this to be a token while other agglutinative morphemes (e.g. potential) are not.

@kanayamah

This comment has been minimized.

Show comment
Hide comment
@kanayamah

kanayamah Jan 4, 2017

Contributor

@spyysalo, one strong reason to split 食べた (ate) into 食べ/VERB and た/AUX is other elements such as a polite marker can be inserted between these two words (食べました - 食べ/VERB まし/AUX た/AUX). Considering English perfective and French passé composé, I think it is quite natural to represent a past form with two words.

Contributor

kanayamah commented Jan 4, 2017

@spyysalo, one strong reason to split 食べた (ate) into 食べ/VERB and た/AUX is other elements such as a polite marker can be inserted between these two words (食べました - 食べ/VERB まし/AUX た/AUX). Considering English perfective and French passé composé, I think it is quite natural to represent a past form with two words.

@dan-zeman

This comment has been minimized.

Show comment
Hide comment
@dan-zeman

dan-zeman Jan 4, 2017

Member

@kanayamah, considering simple past in English as well as in good many other languages, it is also quite natural to represent a past form with one word :-) Is the range of "words" that you can insert before た wider than just the politeness morpheme? Like in English "I have just eaten", "I have not eaten", "I have already eaten" (and the inserted thing is always indisputably an independent word).

Multiple affixes can attach to a stem in morphologically rich languages, but it does not necessarily warrant wordness of these affixes. For instance, the Czech adjective zvláštn-í "strange" ends with the morpheme í, which contributes the Case and Number features, and you can insert the comparative morpheme ejš before that (zvláštn-ějš-í "stranger") but it does not imply that either ejš or í are independent words.

Member

dan-zeman commented Jan 4, 2017

@kanayamah, considering simple past in English as well as in good many other languages, it is also quite natural to represent a past form with one word :-) Is the range of "words" that you can insert before た wider than just the politeness morpheme? Like in English "I have just eaten", "I have not eaten", "I have already eaten" (and the inserted thing is always indisputably an independent word).

Multiple affixes can attach to a stem in morphologically rich languages, but it does not necessarily warrant wordness of these affixes. For instance, the Czech adjective zvláštn-í "strange" ends with the morpheme í, which contributes the Case and Number features, and you can insert the comparative morpheme ejš before that (zvláštn-ějš-í "stranger") but it does not imply that either ejš or í are independent words.

@amir-zeldes

This comment has been minimized.

Show comment
Hide comment
@amir-zeldes

amir-zeldes Jan 4, 2017

Contributor

@kanayamah I think one of the reasons some people are uncomfortable with this but have less problems with French is that the French auxiliary can stand by itself: it is white-space separated, it looks identical to either the verb 'be' or 'have' as used outside of the past tense construction, and it can be separated from the lexical verb by adverbs: 'il est déjà arrivé'. The morpheme た behaves more like an inflectional marker: its position is completely predictable and it cannot be used independently.

Contributor

amir-zeldes commented Jan 4, 2017

@kanayamah I think one of the reasons some people are uncomfortable with this but have less problems with French is that the French auxiliary can stand by itself: it is white-space separated, it looks identical to either the verb 'be' or 'have' as used outside of the past tense construction, and it can be separated from the lexical verb by adverbs: 'il est déjà arrivé'. The morpheme た behaves more like an inflectional marker: its position is completely predictable and it cannot be used independently.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 12, 2017

Member

I agree with the points raised by @dan-zeman and @amir-zeldes. Specifically regarding the argument given by @kanayamah, I don't think the occurrence ofます between 食べ and in 食べました is a reason to analyze as an independent (syntactic) word rather than as a morpheme, as we can either consider 食べました as a word or as an inflectional morpheme on ます.

Member

spyysalo commented Jan 12, 2017

I agree with the points raised by @dan-zeman and @amir-zeldes. Specifically regarding the argument given by @kanayamah, I don't think the occurrence ofます between 食べ and in 食べました is a reason to analyze as an independent (syntactic) word rather than as a morpheme, as we can either consider 食べました as a word or as an inflectional morpheme on ます.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 12, 2017

Member

@kanayamah : to clarify my concern a bit: the currently proposed UD Japanese representation implies the radical claim that Japanese has (effectively) no inflectional morphology. To the best of my knowledge, this is is a departure from both traditional and previous theoretical linguistic analyses of Japanese verb behavior.

Some of the published UD Japanese material suggests that you agree that the proposed (SUW) analysis segments into morphemes rather than syntactic words, e.g. http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf:

Short Unit Word (SUW): SUW is a minimal language unit that has a morphological function.

This kind of construction [...] is considered to be morphological (a word formation), rather than a syntactic relation. [regarding the derivation of e.g. かわいさ and 子どもっぽい]

However, the UD representation does not segment words into morphemes (see http://universaldependencies.org/u/overview/tokenization.html).

Are you using UD to represent the morphological structure of Japanese verbs (etc.), or does UD Japanese claim that Japanese verbs have no morphological structure?

Member

spyysalo commented Jan 12, 2017

@kanayamah : to clarify my concern a bit: the currently proposed UD Japanese representation implies the radical claim that Japanese has (effectively) no inflectional morphology. To the best of my knowledge, this is is a departure from both traditional and previous theoretical linguistic analyses of Japanese verb behavior.

Some of the published UD Japanese material suggests that you agree that the proposed (SUW) analysis segments into morphemes rather than syntactic words, e.g. http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf:

Short Unit Word (SUW): SUW is a minimal language unit that has a morphological function.

This kind of construction [...] is considered to be morphological (a word formation), rather than a syntactic relation. [regarding the derivation of e.g. かわいさ and 子どもっぽい]

However, the UD representation does not segment words into morphemes (see http://universaldependencies.org/u/overview/tokenization.html).

Are you using UD to represent the morphological structure of Japanese verbs (etc.), or does UD Japanese claim that Japanese verbs have no morphological structure?

@miyao-yusuke

This comment has been minimized.

Show comment
Hide comment
@miyao-yusuke

miyao-yusuke Jan 12, 2017

Contributor

@dan-zeman @spyysalo Many other expressions (aspect, modality, passive, causative, benefactive, etc.) can appear between 食べ and た. Their composition is very systematic (no exceptions like "threw"), and some expressions can appear multiple times (e.g. double negation) and sometimes can be coordinated. I don't find any reason to prevent from segmenting them.

@spyysalo Modern syntactic theories consider these expressions as independent tokens. For example, the following book presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens and given category S\S.

Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee [Formal theory of Japanese grammar: The system of conjugation, syntactic structure, and semantic composition]. Japanese Frontier Series 24. Kurosio Publishers.

Anyway, I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.

Contributor

miyao-yusuke commented Jan 12, 2017

@dan-zeman @spyysalo Many other expressions (aspect, modality, passive, causative, benefactive, etc.) can appear between 食べ and た. Their composition is very systematic (no exceptions like "threw"), and some expressions can appear multiple times (e.g. double negation) and sometimes can be coordinated. I don't find any reason to prevent from segmenting them.

@spyysalo Modern syntactic theories consider these expressions as independent tokens. For example, the following book presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens and given category S\S.

Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee [Formal theory of Japanese grammar: The system of conjugation, syntactic structure, and semantic composition]. Japanese Frontier Series 24. Kurosio Publishers.

Anyway, I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 12, 2017

Member

@miyao-yusuke : thanks for the quick response, your input is much appreciated! Some comments:

I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.

Very much agreed! In the absence of more qualified volunteers, I've been trying to draft a rough proposal on this, but the more I read on attempts to define "word" cross-linguistically (e.g. Haspelmath 2011), the less confident I am...

no exceptions like "threw"

Wouldn't するした, くるきた, and いくいった qualify as irregular past forms?

Modern syntactic theories consider these expressions as independent tokens. For example, [Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee] presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens

Thank you for the pointer! I could unfortunately not find a copy of 日本語文法の形式理論 (I take it) online, but will try to follow up on this. A couple of clarifying questions:

  • Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as in 食べた as independent syntactic words?

  • Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?

Also, regarding the above statement from Tanaka et al. 2016, do you agree with

For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.

i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?

Member

spyysalo commented Jan 12, 2017

@miyao-yusuke : thanks for the quick response, your input is much appreciated! Some comments:

I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.

Very much agreed! In the absence of more qualified volunteers, I've been trying to draft a rough proposal on this, but the more I read on attempts to define "word" cross-linguistically (e.g. Haspelmath 2011), the less confident I am...

no exceptions like "threw"

Wouldn't するした, くるきた, and いくいった qualify as irregular past forms?

Modern syntactic theories consider these expressions as independent tokens. For example, [Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee] presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens

Thank you for the pointer! I could unfortunately not find a copy of 日本語文法の形式理論 (I take it) online, but will try to follow up on this. A couple of clarifying questions:

  • Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as in 食べた as independent syntactic words?

  • Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?

Also, regarding the above statement from Tanaka et al. 2016, do you agree with

For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.

i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?

@miyao-yusuke

This comment has been minimized.

Show comment
Hide comment
@miyao-yusuke

miyao-yusuke Jan 13, 2017

Contributor

Wouldn't する → した, くる → きた, and いく → いった qualify as irregular past forms?

Popular analysis is する is inflected to さ, し, せ, する, すれ, etc., and function words/morphemes attach to one of them; e.g. し + ない, ている, た, よう, etc. くる and いく are similar.

Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as た in 食べた as independent syntactic words?

They avoid defining "word". The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.

Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?

We are not following this specific theory, but follow the definition of SUW. However, I think we share the fundamental idea. Function words/morphemes of Japanese cannot be clearly classified into word or suffix/morpheme. Some tend to behave more like word, while others are closer to suffix, but most are somewhere in between. As far as I know, linguistics does not have a clear conclusion until now.

I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one. We simply selected one of them (mainly due to practical reasons).

For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.
i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?

These specific examples might look like derivational morphemes (actually we had some discussions in the Japanese UD team about these constructions). However, they also have many characteristics of "wordness" (actually, many of the "wordness" criteria of Zwicky, Haspelmath, etc. can apply to them).

Contributor

miyao-yusuke commented Jan 13, 2017

Wouldn't する → した, くる → きた, and いく → いった qualify as irregular past forms?

Popular analysis is する is inflected to さ, し, せ, する, すれ, etc., and function words/morphemes attach to one of them; e.g. し + ない, ている, た, よう, etc. くる and いく are similar.

Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as た in 食べた as independent syntactic words?

They avoid defining "word". The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.

Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?

We are not following this specific theory, but follow the definition of SUW. However, I think we share the fundamental idea. Function words/morphemes of Japanese cannot be clearly classified into word or suffix/morpheme. Some tend to behave more like word, while others are closer to suffix, but most are somewhere in between. As far as I know, linguistics does not have a clear conclusion until now.

I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one. We simply selected one of them (mainly due to practical reasons).

For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.
i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?

These specific examples might look like derivational morphemes (actually we had some discussions in the Japanese UD team about these constructions). However, they also have many characteristics of "wordness" (actually, many of the "wordness" criteria of Zwicky, Haspelmath, etc. can apply to them).

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Jan 13, 2017

Contributor

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. I still think we can use the paper by Cagri on principles for Turkish as a starting point, but it obviously needs to be refined and extended to cover typologically different languages. My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes, but this does obviously not carry a lot of weight.

Contributor

jnivre commented Jan 13, 2017

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. I still think we can use the paper by Cagri on principles for Turkish as a starting point, but it obviously needs to be refined and extended to cover typologically different languages. My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes, but this does obviously not carry a lot of weight.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 13, 2017

Member

@miyao-yusuke :

The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.

I appreciate that, and I understand the merits of segmenting down to morphemes and representing morphological and syntactic structures in a unified fashion.

However, UD does not (currently :-)) address word formation, and I'm concerned that not defining a word unit above the morpheme level will have a negative impact on the value of UD Japanese annotations for cross-linguistic use cases.

I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one.

I'm very happy to hear that, and I hope we can establish a good argument in support of a choice soon!

@jnivre :

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. [...] My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes

+1 on both!

Member

spyysalo commented Jan 13, 2017

@miyao-yusuke :

The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.

I appreciate that, and I understand the merits of segmenting down to morphemes and representing morphological and syntactic structures in a unified fashion.

However, UD does not (currently :-)) address word formation, and I'm concerned that not defining a word unit above the morpheme level will have a negative impact on the value of UD Japanese annotations for cross-linguistic use cases.

I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one.

I'm very happy to hear that, and I hope we can establish a good argument in support of a choice soon!

@jnivre :

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. [...] My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes

+1 on both!

@rtsarfaty

This comment has been minimized.

Show comment
Hide comment
@rtsarfaty

rtsarfaty Jan 13, 2017

Contributor

@jnivre :

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups.

Great idea! happy to join the working group and represent the Semitic angle.

Contributor

rtsarfaty commented Jan 13, 2017

@jnivre :

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups.

Great idea! happy to join the working group and represent the Semitic angle.

@mojgan-seraji

This comment has been minimized.

Show comment
Hide comment
@mojgan-seraji

mojgan-seraji Jan 13, 2017

Contributor

I would also be happy to join the working group and represent the Indo-European languages with Perso-Arabic script. Word segmentation is a major challenge in the languages like Persian with regard to its special characteristics of different writing styles of "words".

Contributor

mojgan-seraji commented Jan 13, 2017

I would also be happy to join the working group and represent the Indo-European languages with Perso-Arabic script. Word segmentation is a major challenge in the languages like Persian with regard to its special characteristics of different writing styles of "words".

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 13, 2017

Member

I'd be happy to represent the Finnic language family :-)

Member

spyysalo commented Jan 13, 2017

I'd be happy to represent the Finnic language family :-)

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Jan 20, 2017

Contributor

Following up on the discussion about UD Japanese, it seems clear that the segmentation of derivational morphology is not consistent with general UD principles, nor with practice in other UD treebanks, so I definitely think that should be changed. Similarly, when it comes to verbal morphology, all the descriptions so far point to this being a regular and systematic case of agglutination, similar to the situation in Turkish and Finnish, so it seems that an analysis using morphological features is most compelling there as well.

Contributor

jnivre commented Jan 20, 2017

Following up on the discussion about UD Japanese, it seems clear that the segmentation of derivational morphology is not consistent with general UD principles, nor with practice in other UD treebanks, so I definitely think that should be changed. Similarly, when it comes to verbal morphology, all the descriptions so far point to this being a regular and systematic case of agglutination, similar to the situation in Turkish and Finnish, so it seems that an analysis using morphological features is most compelling there as well.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 20, 2017

Member

@jnivre : thank you for addressing this issue!

There is a lot of discussion of inflectional morphology above but less of derivational, so to illustrate, one specific problem with the proposed treatment of derivational morphology is shown in Tanaka et al. 2016 (page 4):

the suffix さ sa changes an adjective into a noun as in (5), and っぽい ppoi changes a noun into an adjective as in (6). [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.

mark(かわい, さ)
かわい ... さ
ADJ PART

mark(子ども, っぽい)
子ども ... っぽい
NOUN PART

As discussed above, UD does not involve morphological segmentation. The proposed representation has the obvious problem that as the derived forms (かわいさ and 子どもっぽい) do not appear as words, it is not possible to capture their derived parts of speech (NOUN and ADJ, resp.) anywhere.

The derivation is directly analogous to e.g. the derivation of "childish" or "childlike" from "child" in English. In UD, derived forms like these are words, annotated with their relevant parts of speech.

If this is not clearly stated in the documentation at the moment, we could perhaps formulate a rule along the following lines: In UD, forms created by derivational morphological processes should be represented as words tagged with their derived parts of speech. The root word and morphemes involved in the derivation are not separately represented in the UD analysis.

Member

spyysalo commented Jan 20, 2017

@jnivre : thank you for addressing this issue!

There is a lot of discussion of inflectional morphology above but less of derivational, so to illustrate, one specific problem with the proposed treatment of derivational morphology is shown in Tanaka et al. 2016 (page 4):

the suffix さ sa changes an adjective into a noun as in (5), and っぽい ppoi changes a noun into an adjective as in (6). [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.

mark(かわい, さ)
かわい ... さ
ADJ PART

mark(子ども, っぽい)
子ども ... っぽい
NOUN PART

As discussed above, UD does not involve morphological segmentation. The proposed representation has the obvious problem that as the derived forms (かわいさ and 子どもっぽい) do not appear as words, it is not possible to capture their derived parts of speech (NOUN and ADJ, resp.) anywhere.

The derivation is directly analogous to e.g. the derivation of "childish" or "childlike" from "child" in English. In UD, derived forms like these are words, annotated with their relevant parts of speech.

If this is not clearly stated in the documentation at the moment, we could perhaps formulate a rule along the following lines: In UD, forms created by derivational morphological processes should be represented as words tagged with their derived parts of speech. The root word and morphemes involved in the derivation are not separately represented in the UD analysis.

@miyao-yusuke

This comment has been minimized.

Show comment
Hide comment
@miyao-yusuke

miyao-yusuke Jan 20, 2017

Contributor

In UD, forms created by derivational morphological processes should be represented as words tagged with their derived parts of speech. The root word and morphemes involved in the derivation are not separately represented in the UD analysis.

This statement requires the definition of "derivational morphological processes". さ and っぽい are different from -ish and -like because they can be attached to any phrase or any clause.

My intention in the previous discussions is the boundary between morphology and syntax is often unclear, and we need a clear definition.

Contributor

miyao-yusuke commented Jan 20, 2017

In UD, forms created by derivational morphological processes should be represented as words tagged with their derived parts of speech. The root word and morphemes involved in the derivation are not separately represented in the UD analysis.

This statement requires the definition of "derivational morphological processes". さ and っぽい are different from -ish and -like because they can be attached to any phrase or any clause.

My intention in the previous discussions is the boundary between morphology and syntax is often unclear, and we need a clear definition.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 20, 2017

Member

My intention in the previous discussions is the boundary between morphology and syntax is often unclear, and we need a clear definition.

Understood. We're hoping to establish cross-linguistically acceptable UD guidelines in a working group on the topic, and it would be great to have someone representing Japanese. Would you mind participating?

Member

spyysalo commented Jan 20, 2017

My intention in the previous discussions is the boundary between morphology and syntax is often unclear, and we need a clear definition.

Understood. We're hoping to establish cross-linguistically acceptable UD guidelines in a working group on the topic, and it would be great to have someone representing Japanese. Would you mind participating?

@miyao-yusuke

This comment has been minimized.

Show comment
Hide comment
@miyao-yusuke

miyao-yusuke Jan 20, 2017

Contributor

happy to join of course.

Contributor

miyao-yusuke commented Jan 20, 2017

happy to join of course.

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Jan 20, 2017

Contributor

The main point is that "changing a noun into an adjective" is not a syntactic relation, so there is no appropriate UD relation to put on an arc between these two elements. This is a clear indication that this should be treated as word formation, not syntax, and it is exactly parallel to the argument against the old "DERIV" links in the Turkish treebank.

Contributor

jnivre commented Jan 20, 2017

The main point is that "changing a noun into an adjective" is not a syntactic relation, so there is no appropriate UD relation to put on an arc between these two elements. This is a clear indication that this should be treated as word formation, not syntax, and it is exactly parallel to the argument against the old "DERIV" links in the Turkish treebank.

@miyao-yusuke

This comment has been minimized.

Show comment
Hide comment
@miyao-yusuke

miyao-yusuke Jan 20, 2017

Contributor

I see. Maybe the original explanation in Tanaka et al. 2016 was not very accurate. さ and っぽい do not only change a POS but they add some meaning, in a similar way to auxiliary verbs (they also change a POS in some cases).

Contributor

miyao-yusuke commented Jan 20, 2017

I see. Maybe the original explanation in Tanaka et al. 2016 was not very accurate. さ and っぽい do not only change a POS but they add some meaning, in a similar way to auxiliary verbs (they also change a POS in some cases).

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 20, 2017

Member

@miyao-yusuke :

happy to join of course.

Great, thanks!

I'll put together a brief document summarizing some of the proposals so far and will get back to everyone interested in the topic next week.

Member

spyysalo commented Jan 20, 2017

@miyao-yusuke :

happy to join of course.

Great, thanks!

I'll put together a brief document summarizing some of the proposals so far and will get back to everyone interested in the topic next week.

@spyysalo

This comment has been minimized.

Show comment
Hide comment
@spyysalo

spyysalo Jan 20, 2017

Member

さ and っぽい do not only changes a POS but they add some meaning

This does also hold for "-ish" and "-like" in English.

(Maybe of interest: Duncan A Freezing Approach to the Ish-Construction in English)

Member

spyysalo commented Jan 20, 2017

さ and っぽい do not only changes a POS but they add some meaning

This does also hold for "-ish" and "-like" in English.

(Maybe of interest: Duncan A Freezing Approach to the Ish-Construction in English)

@miyao-yusuke

This comment has been minimized.

Show comment
Hide comment
@miyao-yusuke

miyao-yusuke Jan 20, 2017

Contributor

Sure, but I mean they are not purely functional. Maybe they can be considered like "child-like"or "child like".

Contributor

miyao-yusuke commented Jan 20, 2017

Sure, but I mean they are not purely functional. Maybe they can be considered like "child-like"or "child like".

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Jan 20, 2017

Contributor

But what is the syntactic relation?

Contributor

jnivre commented Jan 20, 2017

But what is the syntactic relation?

@miyao-yusuke

This comment has been minimized.

Show comment
Hide comment
@miyao-yusuke

miyao-yusuke Jan 20, 2017

Contributor

That's the problem, but it should be similar to "child like".

Contributor

miyao-yusuke commented Jan 20, 2017

That's the problem, but it should be similar to "child like".

@jmnybl

This comment has been minimized.

Show comment
Hide comment
@jmnybl

jmnybl Jan 20, 2017

Contributor

Question on (Finnish) derivation lemmas: We tag the derivational forms according to the real POS tag ("childish" is tagged as ADJ, not NOUN), but sometimes, if Finnish morphological analyzer is not able to produce lemma for the final derived form, we keep the original form in the lemma field (here for example it would be the NOUN "child"). I assume that this is not the correct approach and we plan to correct this in v2, so that the lemma would always be the derived form. Does this sound correct?

Contributor

jmnybl commented Jan 20, 2017

Question on (Finnish) derivation lemmas: We tag the derivational forms according to the real POS tag ("childish" is tagged as ADJ, not NOUN), but sometimes, if Finnish morphological analyzer is not able to produce lemma for the final derived form, we keep the original form in the lemma field (here for example it would be the NOUN "child"). I assume that this is not the correct approach and we plan to correct this in v2, so that the lemma would always be the derived form. Does this sound correct?

@jnivre

This comment has been minimized.

Show comment
Hide comment
@jnivre

jnivre Jan 20, 2017

Contributor

@jmnybl Yes, this sounds completely correct to me.

Contributor

jnivre commented Jan 20, 2017

@jmnybl Yes, this sounds completely correct to me.

@dan-zeman dan-zeman modified the milestones: lg-specific v2, v2.2 Apr 24, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment