Final fixes #123

tresoldi · 2018-11-27T10:43:34Z

This PR relates to most of the stuff discussed in #121

In detail:

I have added the missing linguolabial as a value for feature place of consonants,, as well as the most important sounds to the catalog.
There are some issues for discussion here, as per IPA all linguolabial
consonants need a diacritic (i.e., there is no linguolabial consonant with
its own, diacritic-less representation), which in turn makes things a bit
complex when setting an alias. As such, no stuff like U032B (combining arches) was
implemented: the only diacritic for linguolabial place of articulaton is
the standard IPA U033C (the seagull).
I have merged the "centralization", "retraction", and "advancement" features
into a single feature "relative_articulation" (possibly not the best name),
as we were allowing for things like "advanced retracted centralized open front
vowel". This is now fixed.
The above feature of "relative_articulation" can now be applied to consonants (so we support things like t̟ and ŋ˗).
Consonants can now have "mid-long" as "duration", so we allow for things
like mˑ.
Laminal fricatives are there (such as s̻ and z̻).
Grapheme ł (Polish letter) is now normalized to ɬ (voiceless alveolar lateral fricative).
I've added ᶑ (voiced retroflex implosive), b̪ (voiced labio-dental stop),
and p̪ (voiceless labio-dental stop) to BIPA.
I've renamed feature value "labialized-velar" to "labio-velar" and
"labialized-palatal" to "labio-palatal" in CLTS; all the transcription data
and systems were updated (using sed from the command line, I've checked many
times and don't think there are any false positives or negatives in the
replacements)

Things I didn't implement from the issue:

The dot for syllabicity would lead to confusion and it has not been standard
IPA for quite some-time; it's implementation, if really necessary, should be
part of specific TranscriptionData and TranscriptionSystems, not CLTS/BIPA.
I didn't add ◌͇ (U0347) as a diacritic for alveolar, as it is not IPA (once
more, it should be included in ad-hoc transcription data, not in BIPA), and diacritics for place of articulation are better part of the catalog of sounds than that of diacritics (as place of articulation is one of the essential features).
I didn't add uvularization of vowels (cases such as ʌʶ); while I believe
there is room for them, I agree with @cormacanderson that as a general feature
is questionable, and we should discuss this in more detail; one potential
source can be found here,
but many more references are presented by the most basic Google serach (mostly when describing dialects, and
I couldn't find a clear-cut case where it is phonemic).
I didn't change ɗ from dental to alveolar (even though alveolar as a default
place of articulation, requiring a diacritic for the dental, makes sense to me);
we need to discuss this in more detail, especially considering that the
transcription data we are linking to seem to use it as dental.
I didn't add roundness to approximants, as this would involve adding the
feature to all consonants; while I don't oppose this from an articulatory point
of view, it should be further discussed (if we go for the feature only for
approximants, things are more complicated, as we'd need to either set
approximants as a different sound type from consonants or to change the code
in order to implement the limitation).
I made no changes to the position of the voiceless diacritic (above or below
the glyph), as it is currently not possible to have a one-fits-all solution;
any apparently simple change can result in unintended consequences (the easiest
solution is probably to just default to one position and manually list the
alternatives in the sound catalog, but once more this is something to be
discussed and agreed upon).
While, as @cormacanderson, I am very in favor of adding qp and db digraphs
for labio-dental affricates, there is no rush in doing so and it is
probably a good idea to only include them in a second release (Cormac is
also waiting for an answer from Anne-Maria). They are not
formally IPA, but my opinion is that they would fit very well in BIPA considering
that [i] there is no independent glyph for those sounds, [ii] the graphical
solution is very good, and [iii] the symbols have been in use for quite some
time
I didn't add ı (dotless i U0131) as an alias of ɯ U026F: this is really
a matter of Turkish orthography and not phonological transcription, and if
necessary should be part of a Turkish orthographic profile.
I didn't touch triphtongs and all other complex clusters, as CLTS only
supports two-sound clusters by design. I can understand objections to that,
but this should really be first discussed with @LinguList .
Some redudant/tautological information (such as syllabic vowels and
nasalized nasal stops) are part of the design of CLTS; this can be changed
by checking for redundant features, but it is not something that can be
implemented with five minutes of coding. In any case, @LinguList should be
part of this discussion.
I didn't add ḱ and ɡ́ as aliases of palatal stops, as this should be
part of transcription data / orthographic profiles dealing with PIE, not
BIPA.
I tend to agree with @cormacanderson that rounding and roundedness should
not be different features, which would mean adding a continuum of
unrounded, less-rounded, rounded, and more-rounded values. However,
I didn't change that as it would brake some datasets such Eurasian (which
might be problematic anyway, with its "more-rounded unrounded" vowels which,
from a quick inspection, likely come from problems in parsing the diacritic
for "more-rounded" as one for "rounded", see the case of
Bulgarian
in their website) and it is something that should be investigated (for example,
shold we just take this as aliases for protusion and compression, or
endo- and exolabial?).

Many stuff from the issue needs to be discussed, perhaps individually;
among those:

Tone contours starting with zero (such as ⁰²)
Palatalized vowels as aliases for diphthongs (such as ɛʲ, and also long
such as oːʷ)
Aspirated vowels (such as ɔʰ)

As I said, most if not all of the other issues are related to individual
transcription systems/data; I've added to CLTS/BIPA those that I found important
and necessary, but the remaining ones should probably be kept in their
specific contexts.

…rving for vowels).

feature

codecov-io · 2018-11-27T10:46:50Z

Codecov Report

Merging #123 into master will not change coverage.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master     #123   +/-   ##
=======================================
  Coverage   99.56%   99.56%           
=======================================
  Files           8        8           
  Lines         696      696           
=======================================
  Hits          693      693           
  Misses          3        3

Impacted Files	Coverage Δ
src/pyclts/models.py	`100% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update acb4323...68f0b4a. Read the comment docs.

tresoldi · 2018-11-27T10:52:39Z

I've run the local commands for release and the checks, but please note that most of the steps for releasing (such as bumping version number, preparing for PyPI etc.) are not in place yet.

I can take care of that once the changes are approved.

LinguList · 2018-11-27T10:54:59Z

While, as @cormacanderson, I am very in favor of adding qp and db digraphs
for labio-dental affricates, there is no rush in doing so and it is
probably a good idea to only include them in a second release (Cormac is
also waiting for an answer from Anne-Maria). They are not
formally IPA, but my opinion is that they would fit very well in BIPA considering
that [i] there is no independent glyph for those sounds, [ii] the graphical
solution is very good, and [iii] the symbols have been in use for quite some
time

I am against that, for reasons of the parsing procedure: if they qualify as a cluster, we can't handle them, only if they are a consonant.

I'm happy with this PR, but also note that there is no way to have diphtongs being aliased, as a dipthong is a gain a derived sound, so no way to define it. Here, you need to go to the data in sources/ and manually override, and for the future, this is the preferred way, also for @cormacanderson to propose corrections that are beyond problems of our bipa-parser.

tresoldi · 2018-11-27T12:02:08Z

Sorry, my comment was not clear: the ȹ and ȸ digraphs are used for labiodental plosives, which would make it easier to annotate labiodental affricates with stuff like ȹs.

As for the diphthongs, one more things to discuss.

cormacanderson · 2018-11-27T19:49:11Z

Thanks very much for this @tresoldi. I have a few comments.

I don't see the logic of roundedness vs rounding. As far as I can see this is simple duplication. The relevant diacritics are defined twice here https://github.com/cldf/clts/blob/master/src/pyclts/transcriptionsystems/bipa/diacritics.tsv for vowels, i.e. line 83 and line 86; line 84 and 85. We have looked at these and they aren't different unicode points, just simple duplicates. There is no reason for this, but it's just a simple oversight, unless I am misunderstanding something. I suggest we merge with a single feature "roundedness", with two possible values, "more-rounded" and "less-rounded".
I'm inclined to consider the use of ɗ for dental not alveolar as the problem of the transcription dataset, and that BIPA should reflect the correct usage.
We should add http://graphemica.com/%C9%9D U+025D, i.e. ɝ
For the next release, I will do another pass through, but would note here already that the treatment of diphthongs, including ej, eʲ, etc. and triphthongs should be prioritised (possibly also consonants such as tswʰ?). There's also more that might need to be done with clicks.

LinguList · 2018-11-27T20:57:59Z

I agree, be careful though, as rounding is a vowel main feature (or is it roundedness?), so if we drop one, this should be the one that is NOT a main feature of vowels (check vowels,tsv to confirm).
no opinion here, as long as the bipa form is less complicated.
if so, we need to add the composed version (E + retroflexation). Or is there a difference to the thingy with schwa?
triphthongs won't be accepted, as they don't make sense and can be decomposed into segments most of the time, if you want to add them, we need to add a new class of sounds (probably less problematic, but not that trivial). Diphtongs like ej etc. should be handled in the transcription data by providing valid counterparts in bipa (ei or e + i with the little thing under it indicating glide).

tresoldi · 2018-11-28T07:07:19Z

I agree that roundness would better be a continuum, and in any case we should not have, as we do now, stuff like "more-rounded unrounded blablabla vowel". These are found especially in the Eurasian transcription data, and are due to the more-rounded value being a different feature which is appended to the base vowel. The problem here is that people use this "more-rounded" diacritic with unrounded vowels, when they actually just mean "rounded". The solution would not be quick enough to fix this by yesterday's noon, as it would break some datasets...
I would in theory, but the official IPA chart has it in the middle of the coronals, just under the click which, as far as I know, everyone considers alveolar and not dental. If we want the BIPA to be a superset of IPA, we are kinda stretching here...
It is already there:

In [3]: bipa['rhotacized unrounded open-mid central vowel'].s                                                   
Out[3]: 'ɜ˞'

As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)

I would be very, very timidly in favor of adding triphtongs as a class in order to have people adopting CLTS, even if I agree with Mattis that it does not make much sense from our point of view. The lingpy pipeline, for example, would surely need to split them up, just like complex consonantal clusters. I didn't want to touch this as it needs more discussion; @cormacanderson , could you provide some examples where in your opinion it makes much more sense to treat a subsequence this way?

LinguList · 2018-11-28T07:20:43Z

As usual this is a Unicode problem of pre-composed vs. composed. It should be part of the normalization, but we'd better wait for the next version (this was my fault, I looked at the list of problems and parsed the grapheme in my mind, it didn't occur to me that it could be a pre-composed one...)

Yes, this is an example for normalization.

In general, we need to always be aware of where to fix problems. We have the following:

in the original transcriptiondata (preferred way, as most errors now are there, see folder sources where you can easily fix by putting a correct bipa-value in the left-most cell)
in the normalization, or by manipulating bipa's consonants, vowels, diacritics, etc.
in the deep code of clts

Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.

LinguList · 2018-11-28T07:24:37Z

IMPORTANT: less rounded and more rounded as "roundedness" should be given the preference, "rounding" is a main feature of a sound, and per definitionem, they can't be modified via diacritic, unless you add FULL CHARACTERS WITH DIACRITICS in vowels.tsv! This is a no-discussion, there is a workaround, and since the character is duplicated, roundedness should be deleted. These lines are anyway ignored in teh code by now.

tresoldi · 2018-11-28T07:28:04Z

Keeping in mind where a problem needs to be fixed will help in the future, we'll have to adjust our labels accordingly. I won't move a finger for solving problems like "ej" from the code, neither handling triphthongs, but if any of you wants to adjust the code accordingly here, feel free to do so. I think, however, it's more important to make an explicit list of accepted consonant combinations for clusters (listing all nasal+stop, stop+nasal, etc. whatever you want), as those are currently produced in an erratic fashion.

I fully agree here. Triphthongs should probably only have two patterns: with a trailing schwa or with with a central vocoid between approximants. For consonant clusters, I would only really accept sibilants+plosives+liquids, but I trust Cormac might convince me here. 😉

No matter what, the priority should however be adding more normalizations and other transcription systems. Now that I know the code in more detail it should not take me too long to do a PR with my unified feature system, which would be my priority in terms of CLTS innovations.

Tiago Tresoldi added 14 commits November 26, 2018 17:07

Adding mid-long for consonants.

9c2891e

Removing ultra-long and ultra-short new features for consonants (rese…

7406f7c

…rving for vowels).

Added raising features for consonants.

082b2ac

Added the voiced retroflex implosive grapheme

2e91299

Fixed problem with Polish ɬ

bfe3626

Fixing missing changes for mid-long consonants

f82a648

Fixing missing changes for raising consonants

8a98978

Unifying centralization, retraction, and advancement in a single vowel

b35aaa3

feature

Updating app_data and dumps

4367bc3

Adding relative articulation to all consonants

cf2eb36

Added labio-dental stops to BIPA

24f2242

Renaming labialized-velar and labialized-palatal feature values

6d57f7d

Adding linguolabial consonants

cd10daf

Merge remote-tracking branch 'upstream/master'

f29dab2

tresoldi requested review from xrotwang, LinguList and cormacanderson November 27, 2018 10:43

Preparing for release

68f0b4a

LinguList merged commit 64cd806 into cldf-clts:master Nov 27, 2018

tresoldi mentioned this pull request Nov 27, 2018

Report of Sound Comparisons CLTS errors #120

Closed

2 tasks

LinguList mentioned this pull request Nov 28, 2018

Before final release: delete roundedness from diacritics #124

Closed

tresoldi mentioned this pull request Nov 28, 2018

Minor changes for "official" release #125

Merged

tresoldi mentioned this pull request Nov 12, 2020

ȸ and ȹ ligatures cldf-clts/clts#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final fixes #123

Final fixes #123

tresoldi commented Nov 27, 2018

codecov-io commented Nov 27, 2018 •

edited

Loading

tresoldi commented Nov 27, 2018

LinguList commented Nov 27, 2018

tresoldi commented Nov 27, 2018

cormacanderson commented Nov 27, 2018

LinguList commented Nov 27, 2018

tresoldi commented Nov 28, 2018 •

edited

Loading

LinguList commented Nov 28, 2018

LinguList commented Nov 28, 2018

tresoldi commented Nov 28, 2018

Final fixes #123

Final fixes #123

Conversation

tresoldi commented Nov 27, 2018

codecov-io commented Nov 27, 2018 • edited Loading

Codecov Report

tresoldi commented Nov 27, 2018

LinguList commented Nov 27, 2018

tresoldi commented Nov 27, 2018

cormacanderson commented Nov 27, 2018

LinguList commented Nov 27, 2018

tresoldi commented Nov 28, 2018 • edited Loading

LinguList commented Nov 28, 2018

LinguList commented Nov 28, 2018

tresoldi commented Nov 28, 2018

codecov-io commented Nov 27, 2018 •

edited

Loading

tresoldi commented Nov 28, 2018 •

edited

Loading