<div align=right>
LAP | Computational Morphology | Final project<br>
</div>


<h1 align=center>G2P: TURKISH</h1>
(Bersun Şipal, Inés Martínez, María López)

# Introduction

The goal of this final project
is to create a cascade of rules that maps from orthographical strings in Turkish (```Input```) down to strings that represent their pronunciation (```Output```), based on the Brazilian Portuguese G2P system developed in class.

By convention, the input string appears on top and the output string on the bottom, as shown in the following example:

```
Input: gece
Output: ɟe̞d͡ʒe̞
```

The purpose of this task is to automatize the conversion from orthography to IPA (International Phonetic Alphabet).

To this end, we researched and analyzed the fundamental characteristics of the Turkish language in order to identify all the aspects that needed to be converted into grammatical rules. In this way, we anticipated the need to implement not only a grammar but also a syllabifier, so that the rules could be applied correctly. Taking the word *benim* as an example, the G2P works as follows:

```
Input: benim

---> Syllabifier: be+nim
---> Grammar: be̞+nim
---> Cleaner: be̞nim

Output: be̞nim
```

Finally, an extensive list of words was used to verify that all the cases mentioned above are correctly represented when the program is applied.

# The Turkish language

The Turkish alphabet has 29 letters (8 vowels, 21 consonants). Seven of these are distinct from the standard Latin alphabet in order to present Turkish sounds more accurately: ç, ğ, ı, i, ö, ş and ü. In contrast, only three Latin letters are not present in Turkish: q, w and x.

```
Vowels: a e ı i o ö u ü
Consonants: b c ç d f g ğ h j k l m n p r s ş t v y z
```
The Turkish language is characterized by vowel harmony, which means that the vowels in a word and in its suffixes need to harmonize. In practice, that means that the vowel in suffixes will change depending on the vowel in the root word. There are two types of vowel harmony, and the way Turkish harmonizes it depends on whether the vowel is front or back and if it is rounded or unrounded.


```
Front or soft vowels: e, i, ö, ü
Back or hard vowels: a, ı, o, u
```

#### **Vowel harmony**

*   **With front or soft vowels:** Ev (house) has the front vowel e, and for making the plural form, it takes the suffix -ler, which also has a front vowel, to become, Evler (houses).


*   **With back or hard vowels:** Kapı (door) has the back vowel ı, and for making the plural form, it takes the suffix -lar, which also has a back vowel, to become, Kapılar (doors).

There is also a second layer of vowel harmony, also known as four part vowel harmony. In certain suffixes, the vowel in the suffix follows the rounding of the last vowel in the root word only if front/back vowel harmony has previously matched.

```
Rounded vowels: o, ö, u, ü
Unrounded vowels: a, e, ı, i
```
In this case, the Turkish language uses:
*   An **ü** for the harmony in **front-rounded vowels**. For example, *üzüm* (grape) with the accusative suffix, *üzümü*.
*   An **u** for the harmony in **back-rounded vowels**. For example, *kutu* (box) with the accusative suffix, *kutuyu*.
*   An **i** for the harmony in **front-unrounded vowels**. For example, *kedi* (cat) with accusative suffix, *kediyi*.
*   An **ı** for the harmony in **back-unrounded vowels**. For example, *araba* (car) with the accusative suffix, *arabayı*.

#### **Agglutination**
Another particularity is the extensive agglutination the Turkish language has, which allows the construction of words by linking together various morphemes. For instance:
```
ev+de+ki+nin+ki+ler+de+ki
```

This construction presents the word *ev* (house) followed by a sequence of morphemes: locative case, relativizer, genitive case, relativizer, plural form, locative case and relativizer. It can be translated for *those belonging to the people at home*.


# G2P

## Facts to be modeled



The following G2P is based on the Turkish variant from Istanbul known as *Istanbul Türkçesi*, as there are different dialects within the language. The results produced by the transducer generate IPA pronunciations from input strings written in lower case *Istanbul Türkçesi* orthography.
The output strings produced by our grammar are written in IPA and therefore there are some special characters.

The output strings produced by our grammar are written in IPA and therefore there are some special characters:

```
d͡ʒ → voiced palatalized d, similar to the phoneme spelled j in English “jam”
t͡ʃ → voiceless palatalized t, similar to the phoneme spelled ch in English “teacher”
d̪ → voiced dental plosive, like the phoneme spelled d in Spanish “dedo”
æ → near-open front unrounded vowel, like the phoneme spelled a in English “cat”
e̞ → mid front unrounded vowel, similar to the phoneme spelled a in Spanish “mesa”
ɟ → voiced palatal plosive, close to y in Spanish “ya” in some accents
ɯ → close back unrounded vowel, similar to i in South African English “pill”
ʒ → voiced postalveolar fricative, like the s in English “measure”
ɫ → voiceless alveolar lateral fricative, like the l in English “full”
o̞ → mid back rounded vowel, similar to the two vowels o in Spanish “lobo”
ø̞ → close-mid front rounded vowel, like the phoneme spelled in eu in French “peur”
ɾ → voiced alveolar tap, like the r in Spanish “pero”
ʃ → voiceless postalveolar fricative, like the phoneme spelled sh in English “ship”
ʋ → voiced labiodental approximant, similar to w in German “was”
```


The mapping from orthography to pronunciation cover the following alternations:

The orthographical c is always pronounced /d͡ʒ/.

```
Input: cem
Output: d͡ʒem
```
The orthographical ç is always pronounced / t͡ʃ/.
```
Input: çam
Output: t͡ʃam
```

The orthographical d is always pronounced /d̪/.
```
Input: demir
Output: d̪emiɾ
```

Meanwhile, the e can be open or closed. This depends not only on specific rules, but also on the syllable level.

The orthographical e is pronounced ‘open’ as /æ/ when it appears before  l, m, n, r or z and in the same syllable.
```
Input: el
Output: æl
```

The ‘close’ e is pronounced elsewhere as /e̞/.
```
Input: ev
Output: e̞v
```

Syllabification is essential in cases such as the following, where the word presents an n after an e. However, since it is in another syllable, the e remains closed.
```
Input: benim
Output: be̞nim
```

The orthographical g is pronounced /ɟ/ before front vowels and /g/ before back vowels.
```
Input: gemi
Output: ɟemi

Input: garaj
Output: gaɾaʒ
```
The orthographical ğ never occurs initially, and its pronunciation may vary between Turkish dialects. In İstanbul Türkçesi its pronunciation lengthens the previous sound, and it is indicated with ```ː```.

```
Input: uğur
Output: uːɾ
```

The orthographical ı is always pronounced /ɯ/.
```
Input: ısı
Output: ɯsɯ
```
The orthographical j is always pronounced /ʒ/.
```
Input: jeoloji
Output: ʒeoɫoʒi
```
The orthographical k is pronounced /c/ before front vowels and /k/ before back vowels.
```
Input: kedi
Output: cedi

Input: kapı
Output: kapɯ
```
The orthographical l is pronounced /l/ if it is preceded or followed by front vowel and it is pronounced as /ɫ/ if is preceded or followed by back vowels
```
Input: el
Output: æl

Input: al
Output: aɫ
```
The orthographical o is always pronounced /o̞/.
```
Input: ordu
Output: o̞ɾdu
```
The orthographical ö is always pronounced /ø̞/.
```
Input: ördek
Output: ø̞ɾde̞k
```
The orthographical r is always pronounced /ɾ/.
```
Input: radyo
Output: ɾadjo
```
The orthographical ş is always pronounced /ʃ/.
```
Input: şapşal
Output: ʃapʃaɫ
```
The orthographical ü is always pronounced /y/.
```
input: üzüm
output: yzym
```
The orthographical v is always pronounced /ʋ/.
```
input: vazo
output: ʋazo
```
The orthographical y is always pronounced /j/.
```
input: yol
output: joɫ
```

## Test data

This list covers all the cases discussed above and have been use as our test data. There are a total of 35 words along with its expected phonetic representation.

```
avcı         aʋd͡ʒɯ       
bekle        be̞cle̞      
cam          d͡ʒam              
dağ          d̪aː         
demir        d̪e̞miɾ      
donmak       d̪o̞nmak     
erkek        æɾce̞c       
eğlence      e̞ːlænd͡ʒe̞  
garaj        gaɾaʒ        
gemi         ge̞mi        
gizem        ɟizæm        
jalüze       ʒalyze̞      
kadın        kad̪ɯn       
kapı         kapɯ         
kar          kaɾ          
kaygan       kajgan       
kediler      ce̞d̪ilæɾ    
kelime       ce̞lime̞     
kör          cø̞ɾ         
kül          cyl          
küçük        cyt͡ʃyc      
kıl          kɯɫ          
ocak         o̞d͡ʒak      
rengarenk    ɾæŋgaɾæŋk    
renk         ɾæŋk         
soğuk        so̞ːuk       
sığınak      sɯːnak       
vali         ʋali         
yılan        jɯɫan        
çirkin       t͡ʃiɾcin     
çiçek        t͡ʃit͡ʃe̞c   
ölmek        ø̞lme̞c      
ılık         ɯɫɯk         
şehir        ʃe̞hiɾ       
şeker        ʃe̞cæɾ 
```

## Code

In [1]:
!pip install pyfoma



In [2]:
from pyfoma import *

fsts = {}
fsts['V'] = FST.re("a|e|ı|i|o|ö|u|ü") # All the vowels
fsts['fV'] = FST.re("i|e|ü|ö") # Front vowels
fsts['bV'] = FST.re("ı|u|a|o") # Back vowels
fsts['C'] = FST.re("b|c|ç|d|f|g|ğ|h|j|k|l|m|n|p|r|s|ş|t|v|y|z") # All the consonants

# Syllabifier
fsts['s1'] = FST.re("$^rewrite('':'+' / $C _ $C)", fsts)
fsts['s2'] = FST.re("$^rewrite('':'+' / $V _ $C $V)",fsts)

# Grammar rules
fsts['crule'] = FST.re("$^rewrite(c:'d͡ʒ')")
fsts['çrule'] = FST.re("$^rewrite(ç:'t͡ʃ')")
fsts['drule'] = FST.re("$^rewrite(d:'d̪')")
fsts['erule1'] = FST.re("$^rewrite(e:'æ' / _ (l|m|n|r|z))")
fsts['erule2'] = FST.re("$^rewrite(e:'e̞')")
fsts['grule'] = FST.re("$^rewrite(g:'ɟ' / _ $fV)", fsts)
fsts['softgrule'] = FST.re("$^rewrite(ğ:'ː')")
fsts['long_a'] = FST.re("$^rewrite(a:'' / a '+' 'ː' _)", fsts)
fsts['long_e'] = FST.re("$^rewrite(e:'' / e '+' 'ː' _)", fsts)
fsts['long_ı'] = FST.re("$^rewrite(ı:'' / ı '+' 'ː' _)", fsts)
fsts['long_i'] = FST.re("$^rewrite(i:'' / i '+' 'ː' _)", fsts)
fsts['long_ö'] = FST.re("$^rewrite(ö:'' / ö '+' 'ː' _)", fsts)
fsts['long_o'] = FST.re("$^rewrite(o:'' / o '+' 'ː' _)", fsts)
fsts['long_ü'] = FST.re("$^rewrite(ü:'' / ü '+' 'ː' _)", fsts)
fsts['long_u'] = FST.re("$^rewrite(u:'' / u '+' 'ː' _)", fsts)
fsts['ırule'] = FST.re("$^rewrite(ı:'ɯ')")
fsts['jrule'] = FST.re("$^rewrite(j:'ʒ')")
fsts['krule1'] = FST.re("$^rewrite(k:c / _ $fV)", fsts)
fsts['krule2'] = FST.re("$^rewrite(k:c / $fV _)", fsts)
fsts['lrule1'] = FST.re("$^rewrite(l:'ɫ' / $bV _)", fsts)
fsts['lrule2'] = FST.re("$^rewrite(l:'ɫ' / _ $bV)", fsts)
fsts['orule'] = FST.re("$^rewrite(o:'o̞')")
fsts['örule'] = FST.re("$^rewrite(ö:'ø̞')")
fsts['rrule'] = FST.re("$^rewrite(r:'ɾ')")
fsts['şrule'] = FST.re("$^rewrite(ş:'ʃ')")
fsts['ürule'] = FST.re("$^rewrite(ü:y)")
fsts['vrule'] = FST.re("$^rewrite(v:ʋ)")
fsts['yrule'] = FST.re("$^rewrite(y:j)")
fsts['nkgrule'] = FST.re("$^rewrite(n:'ŋ' / _ '+'(k|g))", fsts)

# Syllabifier
fsts['syllabifier'] = FST.re("$s1 @ $s2", fsts)

# Grammar
fsts['grammar'] = FST.re(
    "$softgrule @ $long_a @ $long_e @ $long_ı @ $long_i @ "
    "$long_ö @ $long_o @ $long_ü @ $long_u @ "
    "$crule @ $krule1 @ $krule2 @ $lrule1 @ $lrule2 @ $çrule @ $erule1 @ "
    "$drule @ $erule2 @ $ırule @ $jrule @ $yrule @ $orule @ $örule @ $rrule @ "
    "$şrule @ $ürule @ $vrule @ $grule @ $nkgrule",
    fsts
)

# Cleaner
fsts['cleaner'] = FST.re("$^rewrite('+':'')")

# Full G2P
fsts['full'] = FST.re("$syllabifier @ $grammar @ $cleaner", fsts)

# Testing


In [3]:
# Syllabifier test
list(fsts['syllabifier'].generate("evdekininkilerdeki"))

['ev+de+ki+nin+ki+ler+de+ki']

In [4]:
# G2P small test
list(fsts['full'].generate("sığınak")) # Example: benim --> be+nim --> be̞+nim --> be̞nim

['sɯːnak']

In [5]:
# G2P large-scale automated testing
inputs = ["kaygan", "çirkin", "kelime", "yılan", "demir", "garaj", "küçük", "avcı", "sığınak", "dağ", "eğlence", "soğuk", "kapı", "renk", "şehir", "donmak", "gemi", "erkek", "kadın", "kül", "jalüze", "bekle", "rengarenk", "ocak", "kıl", "şeker", "cam", "ılık", "kör", "vali", "çiçek", "kediler", "kar", "ölmek", "gizem"]
fsts['inputs'] = FST.re('|'.join([w for w in inputs]))
fsts['test'] = FST.re("$inputs @ $full", fsts)
print(Paradigm(fsts['test'], ".*"))

avcı         aʋd͡ʒɯ       
bekle        be̞cle̞      
cam          d͡ʒam        
dağ          d̪aː         
demir        d̪e̞miɾ      
donmak       d̪o̞nmak     
erkek        æɾce̞c       
eğlence      e̞ːlænd͡ʒe̞  
garaj        gaɾaʒ        
gemi         ge̞mi        
gizem        ɟizæm        
jalüze       ʒalyze̞      
kadın        kad̪ɯn       
kapı         kapɯ         
kar          kaɾ          
kaygan       kajgan       
kediler      ce̞d̪ilæɾ    
kelime       ce̞lime̞     
kör          cø̞ɾ         
kül          cyl          
küçük        cyt͡ʃyc      
kıl          kɯɫ          
ocak         o̞d͡ʒak      
rengarenk    ɾæŋgaɾæŋk    
renk         ɾæŋk         
soğuk        so̞ːuk       
sığınak      sɯːnak       
vali         ʋali         
yılan        jɯɫan        
çirkin       t͡ʃiɾcin     
çiçek        t͡ʃit͡ʃe̞c   
ölmek        ø̞lme̞c      
ılık         ɯɫɯk         
şehir        ʃe̞hiɾ       
şeker        ʃe̞cæɾ       

