-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support more than one morpheme separator #6
Comments
No, not on purpose, the first corpus we had had only the simple morpheme separator. It should be easy to adjust the code to add additional separators, and I think this should definitely be done (see my change of title). For the general idea of design, however, I find multiple morpheme separators now utterly problematic, since they add semantic information on following or preceding segments, so the information is not contained in the segment itself. The information on the type of a specific form in an utterance should rather be contained in the gloss. Design-wise this is much more transparent, although I understand that the practice is quite different here. A question, @fmatter, do you have the semantic distinction between clitics and the like represented on the level of the glosses as well, or would you require to add it from the phrase/utterance/form-part? |
@xrotwang, if we go from simple split to |
Well, that's why they're there in the first place -- to add semantic information. The
I don't quite understand -- do you refer to the glossing line? Yes, the same separators are used there, too. |
The information is, however, ambiguous, if you use the In general, the semantic information would be better not displayed in the segmentation symbol, this is computationally a bad idea, but typical for qualitative practice, I know this, and I know it is desirable to find a way to lift these data, but if the data are inherently ambiguous, one will need specific actions or pre-processing to handle the problem, and a general solution may not be possible. So there are two ways to handle the problem:
|
Okay, but I mean: assume, I would go through your data and replace all Imagine the -que clitic in Latin, and the following text:
You could also imagine to gloss it like this:
I am not an expert in glossing, but you see what I mean, right? |
In the first case of my example, you would not know automatically, if populus is the clitic or que. But in the second case, this information would be in the gloss and.CLITIC, as opposed to simple "and". |
Ah, that was the second possible interpretation of "in the gloss". No, that information is not explicitly encoded; just like there is no explicit information about wordhood in the glosses. Most people will simply use The Leipzig Glossing Rules are intended for human use, which can complicate things for computers. If
I've run into that problem many times, as well. In addition, does that not equally apply to
Right! |
Okay, I guess, then we are on the same page: pyigt SHOULD support Leipzig glossing rules and also other practices, if there is the need / interest, but we cannot do magic if human annotation is inherently ambiguous, so we would go for supporting solution 1 automatically (expanding the splitter-elements to allow for more than one), but in case of 2, we would make a disclaimer, that scholars may need to go through the data again and post-annotate this in the concordance (for example), which is, I think, also okay, as the concordance serves ideally to enhance the data, and it would allow to qualify que glossed as and in Latin as an enclitic. And the same would hold for affixes. |
So you would go through the concordance and assign morpheme types like |
That's what I would consider crucial, to extract specifically the lexical ones, as those are interesting for me as a historical linguist, etc.. But other poeple would like to extract the grammatical ones, yes, and I think the best IGT should indicate what is a lexical root, what is a grammatical form, etc. Even if two roots are the same, even when glossed differently (!), which is not trivial... |
I could imagine some sort of "LGR dialect spec" as config for pyigt,
declaring morpheme separators, what semantics they carry, etc.
Ideally this should be metadata on the CLDF examples component.
Johann-Mattis List <notifications@github.com> schrieb am Sa., 18. Apr.
2020, 12:16:
… That's what I would consider crucial, to extract specifically the lexical
ones, as those are interesting for me as a historical linguist, etc.. But
other poeple would like to extract the grammatical ones, yes, and I think
the best IGT should indicate what is a lexical root, what is a grammatical
form, etc. Even if two roots are the same, even when glossed differently
(!), which is not trivial...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAGUOKAXMWLI3QGIWORSWPTRNF4YZANCNFSM4MLCLKFQ>
.
|
@xrotwang The issue is: as soon as you're distinguishing different morpheme separators according to semantics, you will need a way to assign these semantics to the morphemes the morpheme separators relate to. And that is not straightforward, as we discussed above. Even a simple For the moment, I think it would be a good solution to just allow for more morpheme separators, (since these are described by the LGR and widely used in existing corpora), but ignore the semantics. Ignoring the semantics is not a problem, because the current solution of just using |
An approach that might work then is to identify lexical morphemes the way you do now, but then decide on the prefix- or suffix-status (or pro- and enclitic…) of grammatical morphemes by checking the relative position in the phonological word to the morphemes identified as lexical. Two issues with that approach: a) some grammatical morphemes might be analyzed as lexical ( because they're not glossed with caps) -- on the other hand, that already happens now. b) it runs into problems with polysynthetic languages, or just languages with phonological words where it is not necessarily the case that grammatical morphemes are on the outside, like in this example from Kwaza: cari-hỹ-ta'dy=jã-ki A semantic interpretation of morpheme separators would have to a) identify 'shoot' and 'be' as lexical (can be done based on lowercase) |
pyigt only seems to allow for a single morpheme separator (
morpheme_separator
, which must be a string). The LGR specify the equals sign=
for clitics (p. 2), and the tilde~
for reduplication (p. 8). I have a fairly large IGT corpus (already in CLDF format) which contains all three separators. Was the possibility of multiple morpheme separators excluded on purpose?The text was updated successfully, but these errors were encountered: