Definition of word #377

Open
spyysalo opened this Issue Dec 15, 2016 · 33 comments

Projects

None yet

9 participants

@spyysalo
Member

Recent discussions have suggested that the UD documentation could benefit from a more detailed definition of "word". We can use this issue to discuss the existing definition and possible improvements.

( @jnivre @dan-zeman @ftyers , others? )

@spyysalo
Member

Currently, http://universaldependencies.org/u/overview/tokenization.html reads in part

The UD annotation is based on a lexicalist view of syntax, which means that dependency relations hold between words [...] there is no attempt at segmenting words into morphemes. [...] the basic units of annotation are syntactic words (not phonological or orthographic words)

@spyysalo
Member

Joakim's Coling keynote included the following related heuristic:

What is a word?

  • Single part-of-speech tag
  • Real syntactic relation
@spyysalo
Member

Greg Pringle has written at length on the topic in the context of UD Japanese at http://www.cjvlang.com/Spicks/udjapanese.html . One specifically relevant part:

Linguists later fine-tuned the definition of words with further distributional criteria:

'Positional mobility (syntagmatic mobility)': the word is free to be used at different places in the sentence. For example, John will go can be transformed into Will John go?, indicating that will, John, and go are three separate words.

'Internal stability (internal immutability)': In contrast with the positional mobility that the word enjoys, morphemes within a word are fixed in order. For example, played is a stable unit that does not permit rearrangement as ed-play.

'Uninterruptability': it is not possible to insert anything between the morphemes of a word. For instance, it is not possible to insert anything between play and -ed (e.g., play-be-ed).

@jnivre
Contributor
jnivre commented Dec 15, 2016

This is a great initiative. However, Pringle's characterisation does not quite work for syntactic words, because it seems to exclude clitics. This paper is very relevant as well:
http://coltekin.net/cagri/papers/coltekin2016turcling.pdf

@spyysalo
Member

Thanks! @coltekin proposes the following criteria

The term inflectional group (IG) in Turkish natural language processing literature refers to a sub-word unit. [...] the unit has been a de facto standard for representing words in Turkish NLP. [...] The current paper proposes tokenizing a surface word into multiple IGs only in case one of the following is true.
a. Parts of the word may have potentially conflicting inflectional features.
b. Parts of the word may participate in different syntactic relations.

@jnivre
Contributor
jnivre commented Dec 15, 2016

Clause a. only works for languages and words that inflect, but it can be generalized to postags, I think. Fusions are a case in point. Something like French "au" does not have conflicting inflectional features, because the preposition "à" does not inflect, but it has conflicting postags, ADP vs. DET.

@dan-zeman dan-zeman added this to the lg-specific v2 milestone Dec 16, 2016
@dan-zeman
Member

Wrt clitics, I was wondering how we would explain the difference between clitics attached to a verb (Spanish dámelo, vámonos) and morphemes that encode agreement of the verb with one or more arguments (and which could be interpreted as hidden pronouns when the respective argument does not appear as an overt nominal). Spanish vamos itself could be said to encode nosotros vamos, and head-marking languages like Basque may even encode agreement with a second and third argument, but we are quite clear about UD not analyzing these as separate words ("do not annotate things that are not there"). While I agree that we should not annotate dropped subjects (or arguments in general), I would love to hear a sound rule that says why vámonos is different.

One clear difference I can think of right now is exactly the possibility (in the agreement version) that the argument also appears as a separate word. However, I am not sure that a guideline based on this would work in languages that have clitic doubling.

@jnivre
Contributor
jnivre commented Dec 16, 2016

I agree that the distinction between pronominal clitics and inflectional endings that mark agreement is a bit fuzzy and that the former could develop into the latter historically. For the Spanish case, I would appeal to language-internal arguments about systematicity. Subject agreement is present in all verbs in Spanish, even with a separate overt subject, but "object agreement" is not. Therefore, when a morpheme referring to the object is glued to the verb, we avoid exceptions if we separate it from its host. How does that sound?

@dan-zeman
Member

On a more general note: I think that the default practice in UD is that we take orthographic words (in languages where they exist) as units. Only when we see good reasons to treat certain cases as contractions of multiple syntactic words, we split them; but if they work reasonably well as one word, we keep them together even though a parallel expression in a related language is rendered as multiple words. And there is some flexibility to these decisions, as always.

Conversely, when we have a good reason to say that one syntactic word spans several orthographic words, we connect them using the fixed relation, except for the very limited cases where we are allowed to include a space in a word form.

@dan-zeman
Member

@jnivre : sounds good to me, thanks. If this discussion results in a set of guidelines for UD-wordness, we should include the Spanish example with this explanation.

@spyysalo
Member

Stupid question: what's the rule against analyzing e.g. joined as join/VERB ed/AUX? Would it work the same if English didn't use space?

@dan-zeman
Member

@spyysalo : irregular verbs? If the English writing system did not use space, it would still not be clear where the auxiliary begins in threw.

@spyysalo
Member

Nice :-) How might that look as a general rule? Are there other arguments that could apply?

@manning
Contributor
manning commented Dec 16, 2016

There is a lot of literature in linguistics on distinguishing clitics from morphology. Among others, Arnold Zwicky has written a lot on this issue. Of course, as @jnivre notes, the results are not always crystal clear due to grammaticalization. There is an argument that some have pursued that romance "clitics" should really be treated as morphology. So I suspect that @dan-zeman's "On a more general note" is really part of the answer for us. But if you'd like to read more about how words, clitics, and morphology have been argued to be distinguishable, here's a little reading list:

@dan-zeman
Member

Awesome, thanks!

@spyysalo
Member

@manning: thanks, that great! I'll dig in after recovering from Coling :-)

The reason I'm asking this particular question now is the proposed UD Japanese analysis of e.g. 食べた tabeta “ate” as 食べ/VERB た/AUX (with aux dependency: http://universaldependencies.org/ja/overview/syntax.html#special-clausal-dependents). 食べた strikes me as equally obviously a single-word past form verb as the English joined, and I'd like to know what arguments in the UD framework would support keeping this as one word.

@amir-zeldes
Contributor

@spyysalo I think what you're looking for is what this paper calls "MUW" (Middle Unit Word):

http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf

@spyysalo
Member

@amir-zeldes: this work is certainly highly relevant (and discussed at length in http://www.cjvlang.com/Spicks/udjapanese.html, referenced above), but unfortunately even the longest "LUW" (Long Unit Word) in the paper splits 食べた into 食べ/VERB andた/AUX (see Figure 1).

@amir-zeldes
Contributor

You're right, sorry for the confusion! I agree, it's odd for this to be a token while other agglutinative morphemes (e.g. potential) are not.

@kanayamah
Contributor

@spyysalo, one strong reason to split 食べた (ate) into 食べ/VERB and た/AUX is other elements such as a polite marker can be inserted between these two words (食べました - 食べ/VERB まし/AUX た/AUX). Considering English perfective and French passé composé, I think it is quite natural to represent a past form with two words.

@dan-zeman
Member

@kanayamah, considering simple past in English as well as in good many other languages, it is also quite natural to represent a past form with one word :-) Is the range of "words" that you can insert before た wider than just the politeness morpheme? Like in English "I have just eaten", "I have not eaten", "I have already eaten" (and the inserted thing is always indisputably an independent word).

Multiple affixes can attach to a stem in morphologically rich languages, but it does not necessarily warrant wordness of these affixes. For instance, the Czech adjective zvláštn-í "strange" ends with the morpheme í, which contributes the Case and Number features, and you can insert the comparative morpheme ejš before that (zvláštn-ějš-í "stranger") but it does not imply that either ejš or í are independent words.

@amir-zeldes
Contributor

@kanayamah I think one of the reasons some people are uncomfortable with this but have less problems with French is that the French auxiliary can stand by itself: it is white-space separated, it looks identical to either the verb 'be' or 'have' as used outside of the past tense construction, and it can be separated from the lexical verb by adverbs: 'il est déjà arrivé'. The morpheme た behaves more like an inflectional marker: its position is completely predictable and it cannot be used independently.

@spyysalo
Member

I agree with the points raised by @dan-zeman and @amir-zeldes. Specifically regarding the argument given by @kanayamah, I don't think the occurrence ofます between 食べ and in 食べました is a reason to analyze as an independent (syntactic) word rather than as a morpheme, as we can either consider 食べました as a word or as an inflectional morpheme on ます.

@spyysalo
Member

@kanayamah : to clarify my concern a bit: the currently proposed UD Japanese representation implies the radical claim that Japanese has (effectively) no inflectional morphology. To the best of my knowledge, this is is a departure from both traditional and previous theoretical linguistic analyses of Japanese verb behavior.

Some of the published UD Japanese material suggests that you agree that the proposed (SUW) analysis segments into morphemes rather than syntactic words, e.g. http://www.lrec-conf.org/proceedings/lrec2016/pdf/122_Paper.pdf:

Short Unit Word (SUW): SUW is a minimal language unit that has a morphological function.

This kind of construction [...] is considered to be morphological (a word formation), rather than a syntactic relation. [regarding the derivation of e.g. かわいさ and 子どもっぽい]

However, the UD representation does not segment words into morphemes (see http://universaldependencies.org/u/overview/tokenization.html).

Are you using UD to represent the morphological structure of Japanese verbs (etc.), or does UD Japanese claim that Japanese verbs have no morphological structure?

@miyao-yusuke
Contributor

@dan-zeman @spyysalo Many other expressions (aspect, modality, passive, causative, benefactive, etc.) can appear between 食べ and た. Their composition is very systematic (no exceptions like "threw"), and some expressions can appear multiple times (e.g. double negation) and sometimes can be coordinated. I don't find any reason to prevent from segmenting them.

@spyysalo Modern syntactic theories consider these expressions as independent tokens. For example, the following book presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens and given category S\S.

Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee [Formal theory of Japanese grammar: The system of conjugation, syntactic structure, and semantic composition]. Japanese Frontier Series 24. Kurosio Publishers.

Anyway, I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.

@spyysalo
Member

@miyao-yusuke : thanks for the quick response, your input is much appreciated! Some comments:

I think we need a clear definition of "word" in the UD sense, which does not depend on the existence of spaces.

Very much agreed! In the absence of more qualified volunteers, I've been trying to draft a rough proposal on this, but the more I read on attempts to define "word" cross-linguistically (e.g. Haspelmath 2011), the less confident I am...

no exceptions like "threw"

Wouldn't するした, くるきた, and いくいった qualify as irregular past forms?

Modern syntactic theories consider these expressions as independent tokens. For example, [Bekki, Daisuke, 2010. Nihongo Bunpoo no Keesiki Riron: Katsuyoo taikee, Toogo koozoo, Imi goosee] presents a comprehensive theory of Japanese syntax based on CCG, in which verb-following expressions are considered as independent tokens

Thank you for the pointer! I could unfortunately not find a copy of 日本語文法の形式理論 (I take it) online, but will try to follow up on this. A couple of clarifying questions:

  • Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as in 食べた as independent syntactic words?

  • Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?

Also, regarding the above statement from Tanaka et al. 2016, do you agree with

For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.

i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?

@miyao-yusuke
Contributor

Wouldn't する → した, くる → きた, and いく → いった qualify as irregular past forms?

Popular analysis is する is inflected to さ, し, せ, する, すれ, etc., and function words/morphemes attach to one of them; e.g. し + ない, ている, た, よう, etc. くる and いく are similar.

Is it OK to equate "token" with (syntactic) "word" here? That is, would it be correct to say Modern syntactic theories consider expressions such as た in 食べた as independent syntactic words?

They avoid defining "word". The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.

Would it be correct to say that UD Japanese follows the approach of Bekki (2010) and rejects the view that Japanese verbs have past forms, analyzing past expression formation as a syntactic (rather than morphological) process?

We are not following this specific theory, but follow the definition of SUW. However, I think we share the fundamental idea. Function words/morphemes of Japanese cannot be clearly classified into word or suffix/morpheme. Some tend to behave more like word, while others are closer to suffix, but most are somewhere in between. As far as I know, linguistics does not have a clear conclusion until now.

I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one. We simply selected one of them (mainly due to practical reasons).

For example, the suffix さ sa changes an adjective into a noun [...] and っぽい ppoi changes a noun into an adjective [...] This kind of construction is very generative, and it is considered to be morphological (a word formation), rather than a syntactic relation.
i.e. to analyze the formation of e.g. かわいさ and 子どもっぽい as morphological rather than syntactic processes?

These specific examples might look like derivational morphemes (actually we had some discussions in the Japanese UD team about these constructions). However, they also have many characteristics of "wordness" (actually, many of the "wordness" criteria of Zwicky, Haspelmath, etc. can apply to them).

@jnivre
Contributor
jnivre commented Jan 13, 2017

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. I still think we can use the paper by Cagri on principles for Turkish as a starting point, but it obviously needs to be refined and extended to cover typologically different languages. My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes, but this does obviously not carry a lot of weight.

@spyysalo
Member

@miyao-yusuke :

The important message here is that the distinction between word and suffix/morpheme is not very clear nor meaningful in Japanese, and considering everything as "syntactic unit" can explain various complications and interactions with syntax.

I appreciate that, and I understand the merits of segmenting down to morphemes and representing morphological and syntactic structures in a unified fashion.

However, UD does not (currently :-)) address word formation, and I'm concerned that not defining a word unit above the morpheme level will have a negative impact on the value of UD Japanese annotations for cross-linguistic use cases.

I'm not objecting to the inflection-based analysis. I mean both (and other possibilities in between) are acceptable and I cannot find any clear reason to reject one.

I'm very happy to hear that, and I hope we can establish a good argument in support of a choice soon!

@jnivre :

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups. [...] My own superficial impression of the Japanese situation is that we need a compromise solution, where some of the contested items are regarded as words and others as bound morphemes

+1 on both!

@rtsarfaty
Contributor

@jnivre :

I think we should form a working group on universal guidelines for word segmentation, with representation from diverse language groups.

Great idea! happy to join the working group and represent the Semitic angle.

@mojgan-seraji
Contributor

I would also be happy to join the working group and represent the Indo-European languages with Perso-Arabic script. Word segmentation is a major challenge in the languages like Persian with regard to its special characteristics of different writing styles of "words".

@spyysalo
Member

I'd be happy to represent the Finnic language family :-)

@jnivre
Contributor
jnivre commented Jan 20, 2017

Following up on the discussion about UD Japanese, it seems clear that the segmentation of derivational morphology is not consistent with general UD principles, nor with practice in other UD treebanks, so I definitely think that should be changed. Similarly, when it comes to verbal morphology, all the descriptions so far point to this being a regular and systematic case of agglutination, similar to the situation in Turkish and Finnish, so it seems that an analysis using morphological features is most compelling there as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment