German transliteration issues #64

trenslow · 2021-01-19T09:24:12Z

Hello,

I came across what I believe to be a bug in German transliteration of the grapheme 's'. This occurs when using the 'deu-Latn' and the 'deu-Latn-nar' dictionaries. Take for example the word 'sehr':

In [14]: epi1.transliterate('sehr')
Out[14]: 't͡seːə'
In [16]: epi3.transliterate('sehr')
Out[16]: 't͡seːɐ'

Here epi1 was initialized with the 'deu-Latn' dictionary and epi3 with the 'deu-Latn-nar' dictionary.

In both cases I would expect the 's' in 'sehr' to be transliterated with [z]. I know that [s] is also possible in this case when dealing with southern German dialects, and I see this transliteration when using the 'deu-Latn-np' dictionary. However, after consulting all my sources, I don't see a case where this can be transliterated as [t͡s].

Another example would be the word 'Stock':

In [20]: epi1.transliterate('Stock')
Out[20]: 'stok'
In [21]: epi3.transliterate('Stock')
Out[21]: 'stok'

In the case of the 'deu-Latn' example, I can understand why this may be transliterated as [s], but at least with the narrow transliteration I would expect [ʃ]. As far as I know, [s] only occurs in this environment in northern German dialects.

Would you mind investigating this with me? What I've done so far is look at 10s of examples(I'm transliterating a large corpus) and it seems that it happens across the board, no exceptions. I also made sure that I pip installed the latest version of Epitran.

The text was updated successfully, but these errors were encountered:

dmort27 · 2021-01-19T12:52:42Z

These clearly are bugs—bugs which should have been caught by my tests. Let me look into this.

dmort27 · 2021-01-19T13:09:13Z

These are both due, to some degree, to the same problem (a rule introduced by a PR I should have vetted more carefully). I think I have it fixed, but I have to do more testing.

dmort27 · 2021-01-19T14:28:47Z

I have uploaded a new version of Epitran to PyPI. I have fixed the bugs you mentioned. However, I believe that there are other bugs in the German modules (dealing, for example, with vowel length). If you are willing to check this out, I will try to fix them.

trenslow · 2021-01-19T16:59:01Z

I just checked through a decent amount of examples and it looks like the /s/ is fixed for the environments I mentioned above.

I went through some examples with the vowel length, it's ok when there's an /h/ in the orthography. But it has a bug when it comes to the letter /ß/. Here are a couple examples (epi2 was instantiated with 'deu-Latn-nar'):

In [21]: epi2.transliterate('Busse')
Out[21]: 'busə'
In [22]: epi2.transliterate('Buße')
Out[22]: 'busə'

In [13]: epi2.transliterate('Massen')
Out[13]: 'masən'
In [14]: epi2.transliterate('Maßen')
Out[14]: 'masən'

Here in both pairs, the second should have a long vowel.

Here are a couple example in the other direction:

In [25]: epi2.transliterate('so')
Out[25]: 'zoː'
In [26]: epi2.transliterate('also')
Out[26]: 'alsoː'
In [30]: epi2.transliterate('nanu')
Out[30]: 'naːnuː'

In the first two, I'd expect the final vowel to be short and in the last one I would expect the first vowel to be short. I'd also expect the /s/ to be transcribed as [z] in the second example.

Thanks for all your help so far. Let me know if you need some more examples and I can go digging!

trenslow · 2021-01-20T10:10:49Z

A couple more examples that might be helpful:

In [33]: epi2.transliterate('kreativ')
Out[33]: 'kʁeaːtif'
In [34]: epi2.transliterate('sozial')
Out[34]: 'zoːt͡sial'
In [37]: epi2.transliterate('Lokomotive')
Out[37]: 'loːkomoːtifə'

In these examples, the last vowel is the one that should be long, and the others short.

In [35]: epi2.transliterate('platzen')
Out[35]: 'plaːt͡sən'
In [36]: epi2.transliterate('knöpfen')
Out[36]: 'knøːp͡fən'
In [38]: epi2.transliterate('Knochen')
Out[38]: 'knoːxən'
In [40]: epi2.transliterate('stricken')
Out[40]: 'ʃtʁiːkən'

In these examples, all the initial vowels should be short, as they are followed by a consonant cluster. This follows the same rule that vowels before double written consonants are always short, but German doesn't double /z/, /k/, /ch/, /pf/ in orthography, instead opting for /tz/, /ck/, /ch/ and /pf/.

dmort27 · 2021-01-20T13:46:05Z

Thank you. This is helpful. Is vowel length in German better stated as lengthening or shortening?

dmort27 · 2021-01-21T14:49:31Z

The most helpful way of sharing this information would be in terms of tests: <input, correct_output> pairs. For what I gather, the right pairs for the examples you provided would be as follows:

Busse → busə
Buße → buːsə
Massen → masən
Maßen → maːsən
so → zo # I'm confused about so and also; The is in a open syllable, so shouldn't it be long?
also → alzo
nanu → nanuː
kreativ → kʁeatiːf # These, too, are a little confusing. Can you give me a rule?
sozial → zot͡siaːl
Lokomotive → lokomotiːfə
platzen → plat͡sən # The following all make sense to me
knöpfen → knøp͡fən
Knochen → knoxən
stricken → ʃtʁikən

Any more examples you could provide would be good, as well as rules that describe why vowels are long or short in a particular context. Thanks!

trenslow · 2021-01-22T09:55:51Z

Thank you. This is helpful. Is vowel length in German better stated as lengthening or shortening?

In the literature they talk more often about a tense/lax distinction, which is conflated with the long/short distinction, as short, tense vowels occur so infrequently. You'd then have tense(long) vowels as the default, with a 'laxing'(shortening) process triggered by the different orthographic contexts.

so → zo # I'm confused about so and also; The is in a open syllable, so shouldn't it be long?

You're right about so that it should be long. My mistake there. But the /o/ should definitely be short in also This makes the rule generalization a little harder. The more I think about it, there seems to be a lot of exceptions to the rules. My gut feeling tells me that since the stress falls on the /a/, the /o/ is not 'allowed' to be long.

This same rule could then apply to kreativ, sozial and Lokomotive, since the stress falls on the last syllable. This seems to align with what this document is saying.

Now that I think about it, a lot of exceptions to the rules could be explained by the frequency of the word's occurrence in daily speech, but I guess that transliteration logic is out of the scope of Epitran.

As I continue with my research and come across more interesting cases, I'll report back asap.

dmort27 · 2021-02-02T14:41:08Z

Sorry to have dropped this. It seems as if the situation in German is not unlike that of English—there are tense and lax vowels; the tense vowels are long and the lax vowels are short—except that the correlation is imperfect in German. Is this correct? I was working with sources that described the German distinction in terms of length rather than vowel quality, but I'd be willing change how this works in Epitran if you can point me to the literature I should follow.

trenslow · 2021-03-31T11:51:22Z

Hi @dmort27 sorry for the late response. I got caught up with other topics and am finally now making the rounds back to German transliteration.

According to sections 1.3 and 1.4 in the document in the last comment I made, it seems like the tense/lax distinction is often conflated with the long/short distinction, which is different than English because you can have a long, lax vowel and a short, tense vowel (if my memory isn't failing me.). In the document, there can be short, tense vowels, but no long, lax vowels in German (section 1.4). What's interesting to me is that all the examples of short, tense vowels they give are words of foreign origin, but I can't seem to find a rule anywhere saying this is the case across the board.

I also stumbled across an interesting post here which would explain the cases where the short-vowel-before-double-consonant rule doesn't apply. You can see it in the second comment in the link. Even though I'm not sure how he came to the conclusion that the 'd' in Mond is a suffix, the idea that if a syllable's coda is long, then the nucleus is short and vice-versa is a better rule than simply relying on the orthography.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German transliteration issues #64

German transliteration issues #64

trenslow commented Jan 19, 2021

dmort27 commented Jan 19, 2021

dmort27 commented Jan 19, 2021

dmort27 commented Jan 19, 2021

trenslow commented Jan 19, 2021

trenslow commented Jan 20, 2021

dmort27 commented Jan 20, 2021

dmort27 commented Jan 21, 2021

trenslow commented Jan 22, 2021

dmort27 commented Feb 2, 2021

trenslow commented Mar 31, 2021

German transliteration issues #64

German transliteration issues #64

Comments

trenslow commented Jan 19, 2021

dmort27 commented Jan 19, 2021

dmort27 commented Jan 19, 2021

dmort27 commented Jan 19, 2021

trenslow commented Jan 19, 2021

trenslow commented Jan 20, 2021

dmort27 commented Jan 20, 2021

dmort27 commented Jan 21, 2021

trenslow commented Jan 22, 2021

dmort27 commented Feb 2, 2021

trenslow commented Mar 31, 2021