Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support more than one morpheme separator #6

Closed
fmatter opened this issue Apr 17, 2020 · 13 comments
Closed

Support more than one morpheme separator #6

fmatter opened this issue Apr 17, 2020 · 13 comments
Assignees

Comments

@fmatter
Copy link

fmatter commented Apr 17, 2020

pyigt only seems to allow for a single morpheme separator (morpheme_separator, which must be a string). The LGR specify the equals sign = for clitics (p. 2), and the tilde ~ for reduplication (p. 8). I have a fairly large IGT corpus (already in CLDF format) which contains all three separators. Was the possibility of multiple morpheme separators excluded on purpose?

@LinguList LinguList changed the title What about different morpheme separators? Support more than one morpheme separator Apr 18, 2020
@LinguList
Copy link

No, not on purpose, the first corpus we had had only the simple morpheme separator. It should be easy to adjust the code to add additional separators, and I think this should definitely be done (see my change of title).

For the general idea of design, however, I find multiple morpheme separators now utterly problematic, since they add semantic information on following or preceding segments, so the information is not contained in the segment itself.

The information on the type of a specific form in an utterance should rather be contained in the gloss. Design-wise this is much more transparent, although I understand that the practice is quite different here.

A question, @fmatter, do you have the semantic distinction between clitics and the like represented on the level of the glosses as well, or would you require to add it from the phrase/utterance/form-part?

@LinguList LinguList self-assigned this Apr 18, 2020
@LinguList
Copy link

@xrotwang, if we go from simple split to re.split, this should be fine, right? The only problem is escaping of symbols like tilde (I always forget what needs to be escaped in Python and what not). But it would mean that we only support the pure splitting, but not the idea of putting semantics into the separators.

@fmatter
Copy link
Author

fmatter commented Apr 18, 2020

For the general idea of design, however, I find multiple morpheme separators now utterly problematic, since they add semantic information on following or preceding segments

Well, that's why they're there in the first place -- to add semantic information. The = tells us that the element in question is syntactically free, but phonologically bound. The ~ tells us that the element in question copies segmental material from the base it attaches to (this will also escape pyigt's morpheme recognition method).

A question, @fmatter, do you have the semantic distinction between clitics and the like represented on the level of the glosses as well, or would you require to add it from the phrase/utterance/form-part?

I don't quite understand -- do you refer to the glossing line? Yes, the same separators are used there, too.

@LinguList
Copy link

The information is, however, ambiguous, if you use the = etc., as a computer cannot tell if it's the element before or after, or is this cross-linguistically the case that the = applies only to the element following? I mean, you have pro-clitics and post-clitics (enclitics, right?) cross-linguistically, I assume some weird language will even allow to place them inside a word: how do I know what part is the clitic, if it's not language-specific?

In general, the semantic information would be better not displayed in the segmentation symbol, this is computationally a bad idea, but typical for qualitative practice, I know this, and I know it is desirable to find a way to lift these data, but if the data are inherently ambiguous, one will need specific actions or pre-processing to handle the problem, and a general solution may not be possible.

So there are two ways to handle the problem:

  1. if information on clitics is also present on the glosses themselves, we just treat a list of symbols as being equal in terms of splitting a word/phrase into morphemes, the easy solution we can quickly add
  2. if information on whether a form is a clitic is not given in the glosses, this requires a dataset-specific pre-processing of the data, since language-specific information needs to be taken into account (maybe even post-annotation), given that the notion of a segmentation marker with semantics is inherently ambiguous. Pre-processing can be done on an individual level, even manually, but easily also computationally, e.g., by searching for clitics, assuming they are post-clitics, and then adding semantic information to the corresponding gloss. After this preprocessing stage, the data could then be handled readily by pyigt.

@LinguList
Copy link

I don't quite understand -- do you refer to the glossing line? Yes, the same separators are used there, too.

Okay, but I mean: assume, I would go through your data and replace all = by -, in all the text, would you still know if a word-form-part (morpheme) is a clitic or a reduplicated part?

Imagine the -que clitic in Latin, and the following text:

Line w1 w2 w3
text senatus populus=que romanus
gloss senate people=and roman

You could also imagine to gloss it like this:

Line w1 w2 w3
text senatus populus=que romanus
gloss senate people=and.CLITIC roman

I am not an expert in glossing, but you see what I mean, right?

@LinguList
Copy link

In the first case of my example, you would not know automatically, if populus is the clitic or que. But in the second case, this information would be in the gloss and.CLITIC, as opposed to simple "and".

@fmatter
Copy link
Author

fmatter commented Apr 18, 2020

I don't quite understand -- do you refer to the glossing line? Yes, the same separators are used there, too.

Okay, but I mean: assume, I would go through your data and replace all = by -, in all the text, would you still know if a word-form-part (morpheme) is a clitic or a reduplicated part?

Imagine the -que clitic in Latin, and the following text:
Line w1 w2 w3
text senatus populus=que romanus
gloss senate people=and roman

You could also imagine to gloss it like this:
Line w1 w2 w3
text senatus populus=que romanus
gloss senate people=and.CLITIC roman

I am not an expert in glossing, but you see what I mean, right?

Ah, that was the second possible interpretation of "in the gloss". No, that information is not explicitly encoded; just like there is no explicit information about wordhood in the glosses. Most people will simply use =. You do sometimes see explicitly coded reduplication, but that is not LGR-conformant (and, frankly, it makes no sense to gloss something as "RED" -- I see that it's reduplicated, but what does it mean?).

The Leipzig Glossing Rules are intended for human use, which can complicate things for computers. If pyigt is intended to be LGR-compatible (and human-friendly), things like clitics, reduplicated elements, infixes… should somehow be accommodated.

The information is, however, ambiguous, if you use the = etc., as a computer cannot tell if it's the element before or after, or is this cross-linguistically the case that the = applies only to the element following? I mean, you have pro-clitics and post-clitics (enclitics, right?) cross-linguistically, I assume some weird language will even allow to place them inside a word: how do I know what part is the clitic, if it's not language-specific?

I've run into that problem many times, as well. In addition, does that not equally apply to - and affixes? If we have a string X-Y-Z, how do we know what are affixes and what are suffixes?

if information on whether a form is a clitic is not given in the glosses, this requires a dataset-specific pre-processing of the data, since language-specific information needs to be taken into account (maybe even post-annotation), given that the notion of a segmentation marker with semantics is inherently ambiguous. Pre-processing can be done on an individual level, even manually, but easily also computationally, e.g., by searching for clitics, assuming they are post-clitics, and then adding semantic information to the corresponding gloss. After this preprocessing stage, the data could then be handled readily by pyigt.

Right!

@LinguList
Copy link

Okay, I guess, then we are on the same page: pyigt SHOULD support Leipzig glossing rules and also other practices, if there is the need / interest, but we cannot do magic if human annotation is inherently ambiguous, so we would go for supporting solution 1 automatically (expanding the splitter-elements to allow for more than one), but in case of 2, we would make a disclaimer, that scholars may need to go through the data again and post-annotate this in the concordance (for example), which is, I think, also okay, as the concordance serves ideally to enhance the data, and it would allow to qualify que glossed as and in Latin as an enclitic. And the same would hold for affixes.

@fmatter
Copy link
Author

fmatter commented Apr 18, 2020

So you would go through the concordance and assign morpheme types like prefix or enclitic? Would this enhanced concordance file also contain roots? You already differentiate between grammatical and lexical morphemes, right?

@LinguList
Copy link

That's what I would consider crucial, to extract specifically the lexical ones, as those are interesting for me as a historical linguist, etc.. But other poeple would like to extract the grammatical ones, yes, and I think the best IGT should indicate what is a lexical root, what is a grammatical form, etc. Even if two roots are the same, even when glossed differently (!), which is not trivial...

@xrotwang
Copy link
Contributor

xrotwang commented Apr 18, 2020 via email

@fmatter
Copy link
Author

fmatter commented Apr 18, 2020

@xrotwang The issue is: as soon as you're distinguishing different morpheme separators according to semantics, you will need a way to assign these semantics to the morphemes the morpheme separators relate to. And that is not straightforward, as we discussed above. Even a simple - would then have to be interpreted in a certain way, and pyigt would have to decide on how the separated morphemes are affected by these semantics (is it a prefix? a root?).

For the moment, I think it would be a good solution to just allow for more morpheme separators, (since these are described by the LGR and widely used in existing corpora), but ignore the semantics.

Ignoring the semantics is not a problem, because the current solution of just using - also ignores the semantics, since it only uses - to separate morphemes, without consequences for interpretation as affix, suffix, etc. By additionally allowing =, ~, and < > (but not distinguishing them functionally from -), you would not lose any functionality currently present. You would gain compatibility with the LGR however.

@fmatter
Copy link
Author

fmatter commented Apr 18, 2020

That's what I would consider crucial, to extract specifically the lexical ones, as those are interesting for me as a historical linguist, etc.. But other poeple would like to extract the grammatical ones, yes, and I think the best IGT should indicate what is a lexical root, what is a grammatical form, etc. Even if two roots are the same, even when glossed differently (!), which is not trivial...

An approach that might work then is to identify lexical morphemes the way you do now, but then decide on the prefix- or suffix-status (or pro- and enclitic…) of grammatical morphemes by checking the relative position in the phonological word to the morphemes identified as lexical.

Two issues with that approach:

a) some grammatical morphemes might be analyzed as lexical ( because they're not glossed with caps) -- on the other hand, that already happens now.

b) it runs into problems with polysynthetic languages, or just languages with phonological words where it is not necessarily the case that grammatical morphemes are on the outside, like in this example from Kwaza:

cari-hỹ-ta'dy=jã-ki
shoot-NOM-EXCL=be-DEC

A semantic interpretation of morpheme separators would have to

a) identify 'shoot' and 'be' as lexical (can be done based on lowercase)
b) identify -NOM, -EXCL, and -DEC as affixes (can be done based on uppercase & being preceded by lexical items)
c) identify 'be' (including its suffixes, technically) as a clitic -- OR it could just say "let's ignore that clitic, since it connects two lexical elements"! This latter solution would probably be better. If, on the other hand, we find a pattern like lake=LOC, then the locative morpheme should be categorized as an enclitic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants