Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German transliteration issues #64

Open
trenslow opened this issue Jan 19, 2021 · 10 comments
Open

German transliteration issues #64

trenslow opened this issue Jan 19, 2021 · 10 comments

Comments

@trenslow
Copy link
Contributor

Hello,

I came across what I believe to be a bug in German transliteration of the grapheme 's'. This occurs when using the 'deu-Latn' and the 'deu-Latn-nar' dictionaries. Take for example the word 'sehr':

In [14]: epi1.transliterate('sehr')
Out[14]: 't͡seːə'
In [16]: epi3.transliterate('sehr')
Out[16]: 't͡seːɐ'

Here epi1 was initialized with the 'deu-Latn' dictionary and epi3 with the 'deu-Latn-nar' dictionary.

In both cases I would expect the 's' in 'sehr' to be transliterated with [z]. I know that [s] is also possible in this case when dealing with southern German dialects, and I see this transliteration when using the 'deu-Latn-np' dictionary. However, after consulting all my sources, I don't see a case where this can be transliterated as [t͡s].

Another example would be the word 'Stock':

In [20]: epi1.transliterate('Stock')
Out[20]: 'stok'
In [21]: epi3.transliterate('Stock')
Out[21]: 'stok'

In the case of the 'deu-Latn' example, I can understand why this may be transliterated as [s], but at least with the narrow transliteration I would expect [ʃ]. As far as I know, [s] only occurs in this environment in northern German dialects.

Would you mind investigating this with me? What I've done so far is look at 10s of examples(I'm transliterating a large corpus) and it seems that it happens across the board, no exceptions. I also made sure that I pip installed the latest version of Epitran.

@dmort27
Copy link
Owner

dmort27 commented Jan 19, 2021

These clearly are bugs—bugs which should have been caught by my tests. Let me look into this.

@dmort27
Copy link
Owner

dmort27 commented Jan 19, 2021

These are both due, to some degree, to the same problem (a rule introduced by a PR I should have vetted more carefully). I think I have it fixed, but I have to do more testing.

@dmort27
Copy link
Owner

dmort27 commented Jan 19, 2021

I have uploaded a new version of Epitran to PyPI. I have fixed the bugs you mentioned. However, I believe that there are other bugs in the German modules (dealing, for example, with vowel length). If you are willing to check this out, I will try to fix them.

@trenslow
Copy link
Contributor Author

I just checked through a decent amount of examples and it looks like the /s/ is fixed for the environments I mentioned above.

I went through some examples with the vowel length, it's ok when there's an /h/ in the orthography. But it has a bug when it comes to the letter /ß/. Here are a couple examples (epi2 was instantiated with 'deu-Latn-nar'):

In [21]: epi2.transliterate('Busse')
Out[21]: 'busə'
In [22]: epi2.transliterate('Buße')
Out[22]: 'busə'

In [13]: epi2.transliterate('Massen')
Out[13]: 'masən'
In [14]: epi2.transliterate('Maßen')
Out[14]: 'masən'

Here in both pairs, the second should have a long vowel.

Here are a couple example in the other direction:

In [25]: epi2.transliterate('so')
Out[25]: 'zoː'
In [26]: epi2.transliterate('also')
Out[26]: 'alsoː'
In [30]: epi2.transliterate('nanu')
Out[30]: 'naːnuː'

In the first two, I'd expect the final vowel to be short and in the last one I would expect the first vowel to be short. I'd also expect the /s/ to be transcribed as [z] in the second example.

Thanks for all your help so far. Let me know if you need some more examples and I can go digging!

@trenslow
Copy link
Contributor Author

A couple more examples that might be helpful:

In [33]: epi2.transliterate('kreativ')
Out[33]: 'kʁeaːtif'
In [34]: epi2.transliterate('sozial')
Out[34]: 'zoːt͡sial'
In [37]: epi2.transliterate('Lokomotive')
Out[37]: 'loːkomoːtifə'

In these examples, the last vowel is the one that should be long, and the others short.

In [35]: epi2.transliterate('platzen')
Out[35]: 'plaːt͡sən'
In [36]: epi2.transliterate('knöpfen')
Out[36]: 'knøːp͡fən'
In [38]: epi2.transliterate('Knochen')
Out[38]: 'knoːxən'
In [40]: epi2.transliterate('stricken')
Out[40]: 'ʃtʁiːkən'

In these examples, all the initial vowels should be short, as they are followed by a consonant cluster. This follows the same rule that vowels before double written consonants are always short, but German doesn't double /z/, /k/, /ch/, /pf/ in orthography, instead opting for /tz/, /ck/, /ch/ and /pf/.

@dmort27
Copy link
Owner

dmort27 commented Jan 20, 2021

Thank you. This is helpful. Is vowel length in German better stated as lengthening or shortening?

@dmort27
Copy link
Owner

dmort27 commented Jan 21, 2021

The most helpful way of sharing this information would be in terms of tests: <input, correct_output> pairs. For what I gather, the right pairs for the examples you provided would be as follows:

Busse → busə
Buße → buːsə
Massen → masən
Maßen → maːsən
so → zo # I'm confused about so and also; The is in a open syllable, so shouldn't it be long?
also → alzo
nanu → nanuː
kreativ → kʁeatiːf # These, too, are a little confusing. Can you give me a rule?
sozial → zot͡siaːl
Lokomotive → lokomotiːfə
platzen → plat͡sən # The following all make sense to me
knöpfen → knøp͡fən
Knochen → knoxən
stricken → ʃtʁikən

Any more examples you could provide would be good, as well as rules that describe why vowels are long or short in a particular context. Thanks!

@trenslow
Copy link
Contributor Author

Thank you. This is helpful. Is vowel length in German better stated as lengthening or shortening?

In the literature they talk more often about a tense/lax distinction, which is conflated with the long/short distinction, as short, tense vowels occur so infrequently. You'd then have tense(long) vowels as the default, with a 'laxing'(shortening) process triggered by the different orthographic contexts.

so → zo # I'm confused about so and also; The is in a open syllable, so shouldn't it be long?

You're right about so that it should be long. My mistake there. But the /o/ should definitely be short in also This makes the rule generalization a little harder. The more I think about it, there seems to be a lot of exceptions to the rules. My gut feeling tells me that since the stress falls on the /a/, the /o/ is not 'allowed' to be long.

This same rule could then apply to kreativ, sozial and Lokomotive, since the stress falls on the last syllable. This seems to align with what this document is saying.

Now that I think about it, a lot of exceptions to the rules could be explained by the frequency of the word's occurrence in daily speech, but I guess that transliteration logic is out of the scope of Epitran.

As I continue with my research and come across more interesting cases, I'll report back asap.

@dmort27
Copy link
Owner

dmort27 commented Feb 2, 2021

Sorry to have dropped this. It seems as if the situation in German is not unlike that of English—there are tense and lax vowels; the tense vowels are long and the lax vowels are short—except that the correlation is imperfect in German. Is this correct? I was working with sources that described the German distinction in terms of length rather than vowel quality, but I'd be willing change how this works in Epitran if you can point me to the literature I should follow.

@trenslow
Copy link
Contributor Author

Hi @dmort27 sorry for the late response. I got caught up with other topics and am finally now making the rounds back to German transliteration.

According to sections 1.3 and 1.4 in the document in the last comment I made, it seems like the tense/lax distinction is often conflated with the long/short distinction, which is different than English because you can have a long, lax vowel and a short, tense vowel (if my memory isn't failing me.). In the document, there can be short, tense vowels, but no long, lax vowels in German (section 1.4). What's interesting to me is that all the examples of short, tense vowels they give are words of foreign origin, but I can't seem to find a rule anywhere saying this is the case across the board.

I also stumbled across an interesting post here which would explain the cases where the short-vowel-before-double-consonant rule doesn't apply. You can see it in the second comment in the link. Even though I'm not sure how he came to the conclusion that the 'd' in Mond is a suffix, the idea that if a syllable's coda is long, then the nucleus is short and vice-versa is a better rule than simply relying on the orthography.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants