Skip to content
This repository has been archived by the owner on Oct 22, 2019. It is now read-only.

Final fixes #123

Merged
merged 15 commits into from
Nov 27, 2018
Merged

Final fixes #123

merged 15 commits into from
Nov 27, 2018

Conversation

tresoldi
Copy link
Contributor

This PR relates to most of the stuff discussed in #121

In detail:

  • I have added the missing linguolabial as a value for feature place of consonants,, as well as the most important sounds to the catalog.
    There are some issues for discussion here, as per IPA all linguolabial
    consonants need a diacritic (i.e., there is no linguolabial consonant with
    its own, diacritic-less representation), which in turn makes things a bit
    complex when setting an alias. As such, no stuff like U032B (combining arches) was
    implemented: the only diacritic for linguolabial place of articulaton is
    the standard IPA U033C (the seagull).
  • I have merged the "centralization", "retraction", and "advancement" features
    into a single feature "relative_articulation" (possibly not the best name),
    as we were allowing for things like "advanced retracted centralized open front
    vowel". This is now fixed.
  • The above feature of "relative_articulation" can now be applied to consonants (so we support things like and ŋ˗).
  • Consonants can now have "mid-long" as "duration", so we allow for things
    like .
  • Laminal fricatives are there (such as and ).
  • Grapheme ł (Polish letter) is now normalized to ɬ (voiceless alveolar lateral fricative).
  • I've added (voiced retroflex implosive), (voiced labio-dental stop),
    and (voiceless labio-dental stop) to BIPA.
  • I've renamed feature value "labialized-velar" to "labio-velar" and
    "labialized-palatal" to "labio-palatal" in CLTS; all the transcription data
    and systems were updated (using sed from the command line, I've checked many
    times and don't think there are any false positives or negatives in the
    replacements)

Things I didn't implement from the issue:

  • The dot for syllabicity would lead to confusion and it has not been standard
    IPA for quite some-time; it's implementation, if really necessary, should be
    part of specific TranscriptionData and TranscriptionSystems, not CLTS/BIPA.
  • I didn't add ◌͇ (U0347) as a diacritic for alveolar, as it is not IPA (once
    more, it should be included in ad-hoc transcription data, not in BIPA), and diacritics for place of articulation are better part of the catalog of sounds than that of diacritics (as place of articulation is one of the essential features).
  • I didn't add uvularization of vowels (cases such as ʌʶ); while I believe
    there is room for them, I agree with @cormacanderson that as a general feature
    is questionable, and we should discuss this in more detail; one potential
    source can be found here,
    but many more references are presented by the most basic Google serach (mostly when describing dialects, and
    I couldn't find a clear-cut case where it is phonemic).
  • I didn't change ɗ from dental to alveolar (even though alveolar as a default
    place of articulation, requiring a diacritic for the dental, makes sense to me);
    we need to discuss this in more detail, especially considering that the
    transcription data we are linking to seem to use it as dental.
  • I didn't add roundness to approximants, as this would involve adding the
    feature to all consonants; while I don't oppose this from an articulatory point
    of view, it should be further discussed (if we go for the feature only for
    approximants, things are more complicated, as we'd need to either set
    approximants as a different sound type from consonants or to change the code
    in order to implement the limitation).
  • I made no changes to the position of the voiceless diacritic (above or below
    the glyph), as it is currently not possible to have a one-fits-all solution;
    any apparently simple change can result in unintended consequences (the easiest
    solution is probably to just default to one position and manually list the
    alternatives in the sound catalog, but once more this is something to be
    discussed and agreed upon).
  • While, as @cormacanderson, I am very in favor of adding qp and db digraphs
    for labio-dental affricates, there is no rush in doing so and it is
    probably a good idea to only include them in a second release (Cormac is
    also waiting for an answer from Anne-Maria). They are not
    formally IPA, but my opinion is that they would fit very well in BIPA considering
    that [i] there is no independent glyph for those sounds, [ii] the graphical
    solution is very good, and [iii] the symbols have been in use for quite some
    time
  • I didn't add ı (dotless i U0131) as an alias of ɯ U026F: this is really
    a matter of Turkish orthography and not phonological transcription, and if
    necessary should be part of a Turkish orthographic profile.
  • I didn't touch triphtongs and all other complex clusters, as CLTS only
    supports two-sound clusters by design. I can understand objections to that,
    but this should really be first discussed with @LinguList .
  • Some redudant/tautological information (such as syllabic vowels and
    nasalized nasal stops) are part of the design of CLTS; this can be changed
    by checking for redundant features, but it is not something that can be
    implemented with five minutes of coding. In any case, @LinguList should be
    part of this discussion.
  • I didn't add ḱ and ɡ́ as aliases of palatal stops, as this should be
    part of transcription data / orthographic profiles dealing with PIE, not
    BIPA.
  • I tend to agree with @cormacanderson that rounding and roundedness should
    not be different features, which would mean adding a continuum of
    unrounded, less-rounded, rounded, and more-rounded values. However,
    I didn't change that as it would brake some datasets such Eurasian (which
    might be problematic anyway, with its "more-rounded unrounded" vowels which,
    from a quick inspection, likely come from problems in parsing the diacritic
    for "more-rounded" as one for "rounded", see the case of
    Bulgarian
    in their website) and it is something that should be investigated (for example,
    shold we just take this as aliases for protusion and compression, or
    endo- and exolabial?).

Many stuff from the issue needs to be discussed, perhaps individually;
among those:

  • Tone contours starting with zero (such as ⁰²)
  • Palatalized vowels as aliases for diphthongs (such as ɛʲ, and also long
    such as oːʷ)
  • Aspirated vowels (such as ɔʰ)

As I said, most if not all of the other issues are related to individual
transcription systems/data; I've added to CLTS/BIPA those that I found important
and necessary, but the remaining ones should probably be kept in their
specific contexts.

@codecov-io
Copy link

codecov-io commented Nov 27, 2018

Codecov Report

Merging #123 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #123   +/-   ##
=======================================
  Coverage   99.56%   99.56%           
=======================================
  Files           8        8           
  Lines         696      696           
=======================================
  Hits          693      693           
  Misses          3        3
Impacted Files Coverage Δ
src/pyclts/models.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update acb4323...68f0b4a. Read the comment docs.

@tresoldi
Copy link
Contributor Author

I've run the local commands for release and the checks, but please note that most of the steps for releasing (such as bumping version number, preparing for PyPI etc.) are not in place yet.

I can take care of that once the changes are approved.

@LinguList
Copy link
Collaborator

While, as @cormacanderson, I am very in favor of adding qp and db digraphs
for labio-dental affricates, there is no rush in doing so and it is
probably a good idea to only include them in a second release (Cormac is
also waiting for an answer from Anne-Maria). They are not
formally IPA, but my opinion is that they would fit very well in BIPA considering
that [i] there is no independent glyph for those sounds, [ii] the graphical
solution is very good, and [iii] the symbols have been in use for quite some
time

I am against that, for reasons of the parsing procedure: if they qualify as a cluster, we can't handle them, only if they are a consonant.

I'm happy with this PR, but also note that there is no way to have diphtongs being aliased, as a dipthong is a gain a derived sound, so no way to define it. Here, you need to go to the data in sources/ and manually override, and for the future, this is the preferred way, also for @cormacanderson to propose corrections that are beyond problems of our bipa-parser.

@LinguList LinguList merged commit 64cd806 into cldf-clts:master Nov 27, 2018
@tresoldi
Copy link
Contributor Author

Sorry, my comment was not clear: the ȹ and ȸ digraphs are used for labiodental plosives, which would make it easier to annotate labiodental affricates with stuff like ȹs.

As for the diphthongs, one more things to discuss.

@cormacanderson
Copy link
Collaborator

Thanks very much for this @tresoldi. I have a few comments.

  1. I don't see the logic of roundedness vs rounding. As far as I can see this is simple duplication. The relevant diacritics are defined twice here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv for vowels, i.e. line 83 and line 86; line 84 and 85. We have looked at these and they aren't different unicode points, just simple duplicates. There is no reason for this, but it's just a simple oversight, unless I am misunderstanding something. I suggest we merge with a single feature "roundedness", with two possible values, "more-rounded" and "less-rounded".
  2. I'm inclined to consider the use of ɗ for dental not alveolar as the problem of the transcription dataset, and that BIPA should reflect the correct usage.
  3. We should add http://graphemica.com/%C9%9D U+025D, i.e. ɝ
  4. For the next release, I will do another pass through, but would note here already that the treatment of diphthongs, including ej, eʲ, etc. and triphthongs should be prioritised (possibly also consonants such as tswʰ?). There's also more that might need to be done with clicks.

@LinguList
Copy link
Collaborator

  1. I agree, be careful though, as rounding is a vowel main feature (or is it roundedness?), so if we drop one, this should be the one that is NOT a main feature of vowels (check vowels,tsv to confirm).
  2. no opinion here, as long as the bipa form is less complicated.
  3. if so, we need to add the composed version (E + retroflexation). Or is there a difference to the thingy with schwa?
  4. triphthongs won't be accepted, as they don't make sense and can be decomposed into segments most of the time, if you want to add them, we need to add a new class of sounds (probably less problematic, but not that trivial). Diphtongs like ej etc. should be handled in the transcription data by providing valid counterparts in bipa (ei or e + i with the little thing under it indicating glide).

@tresoldi
Copy link
Contributor Author

tresoldi commented Nov 28, 2018

  1. I agree that roundness would better be a continuum, and in any case we should not have, as we do now, stuff like "more-rounded unrounded blablabla vowel". These are found especially in the Eurasian transcription data, and are due to the more-rounded value being a different feature which is appended to the base vowel. The problem here is that people use this "more-rounded" diacritic with unrounded vowels, when they actually just mean "rounded". The solution would not be quick enough to fix this by yesterday's noon, as it would break some datasets...

  2. I would in theory, but the official IPA chart has it in the middle of the coronals, just under the click which, as far as I know, everyone considers alveolar and not dental. If we want the BIPA to be a superset of IPA, we are kinda stretching here...

  3. It is already there:

In [3]: bipa['rhotacized unrounded open-mid central vowel'].s                                                   
Out[3]: 'ɜ˞'

As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)

  1. I would be very, very timidly in favor of adding triphtongs as a class in order to have people adopting CLTS, even if I agree with Mattis that it does not make much sense from our point of view. The lingpy pipeline, for example, would surely need to split them up, just like complex consonantal clusters. I didn't want to touch this as it needs more discussion; @cormacanderson , could you provide some examples where in your opinion it makes much more sense to treat a subsequence this way?

@LinguList
Copy link
Collaborator

As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)

Yes, this is an example for normalization.

In general, we need to always be aware of where to fix problems. We have the following:

  1. in the original transcriptiondata (preferred way, as most errors now are there, see folder sources where you can easily fix by putting a correct bipa-value in the left-most cell)
  2. in the normalization, or by manipulating bipa's consonants, vowels, diacritics, etc.
  3. in the deep code of clts

Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.

@LinguList
Copy link
Collaborator

IMPORTANT: less rounded and more rounded as "roundedness" should be given the preference, "rounding" is a main feature of a sound, and per definitionem, they can't be modified via diacritic, unless you add FULL CHARACTERS WITH DIACRITICS in vowels.tsv! This is a no-discussion, there is a workaround, and since the character is duplicated, roundedness should be deleted. These lines are anyway ignored in teh code by now.

@tresoldi
Copy link
Contributor Author

Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.

I fully agree here. Triphthongs should probably only have two patterns: with a trailing schwa or with with a central vocoid between approximants. For consonant clusters, I would only really accept sibilants+plosives+liquids, but I trust Cormac might convince me here. 😉

No matter what, the priority should however be adding more normalizations and other transcription systems. Now that I know the code in more detail it should not take me too long to do a PR with my unified feature system, which would be my priority in terms of CLTS innovations.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants