# Overview of final language-specific code - Turkish

**LNG 409 Fall 2024**

Author: Hee Joong Choi, Yuzhang Fu, Xin You


In [None]:
# import pynini

This notebook will walkthrough our final code named `transducer_tur.py` and/or `transducer_tur.ipynb`, which implements vowel harmony correction for Turkish noun plural and verb progressive suffixes using Finite-State Transducers (FSTs) via the `pynini` library. We will demonstrate the tweaks we made, the logistics of the code, and the performance of these transducers. 

## Set up

To run this code, install pynini, a library for building and running weighted finite-state transducers.

```Python
!pip install pynini
```

**Note**: Throughout the project, we are only able import the pynini library on google colab. For instance, it is suggested to upload `transducer_tur.ipynb` along with `data/tur.out` and `data/tur.dev` to google drive and run the code. However, we have also attached `trasducer_tur.py` in our github repository for you to check.

## Defined functions

### Vowel categorization

We have created two helper functions that categorize vowels in Turkish words:

`n_vowel_categorize(lemma)` determines the frontness and roundness of the last vowel in a lemma, where **'ö, ü'** are front rounded vowels, **'e,i'** are front unrounded vowels, **'o, u'** are back rounded vowels, and **'a, ı'** are back unrounded vowels.

```python
def n_vowel_categorize(lemma):
    for char in reversed(lemma):
        if char in "öü":
            return {"frontness": True, "roundness": True}
        elif char in "ei":
            return {"frontness": True, "roundness": False}
        elif char in "ou":
            return {"frontness": False, "roundness": True}
        elif char in "aı":
            return {"frontness": False, "roundness": False}
```

`v_vowel_categorize(lemma)` finds the second-to-last vowel in the lemma to handle progressive suffix rules with the same vowel categorization rules above.

```python
def v_vowel_categorize(lemma):
    vowels = "öüeiouaı"
    n = 0
    for char in reversed(lemma):
        if char in vowels:
            n += 1
            if n == 2:
                if char in "öü":
                    return {"frontness": True, "roundness": True}
                elif char in "ei":
                    return {"frontness": True, "roundness": False}
                elif char in "ou":
                    return {"frontness": False, "roundness": True}
                elif char in "aı":
                    return {"frontness": False, "roundness": False}
```

## Plural Suffix Correction for Nouns

### Construction of transducers
```python
turkish_alphabet = "abcçdefgğhıijklmnoöprsştuüvyz " # Added " "(space) to include compounds like "sözdizimsel tuzlerde"
sigma = pynini.union(*turkish_alphabet).closure()
```
The Turkish alphabet is defined, including a space character to handle multi-word compounds. `sigma` is created to match any valid Turkish word.

```python
plural_correction_rule_a = pynini.cdrewrite(
        pynini.cross("e", "a"),
        "l",
        "r",
        sigma)

plural_correction_rule_e = pynini.cdrewrite(
        pynini.cross("a", "e"),
        "l",
        "r",
        sigma)
```
Two rewrite rules are defined:
1. **`plural_correction_rule_a`**: Converts *"e"* to *"a"* when the suffix should be *"lar"* (used for back vowels).
2. **`plural_correction_rule_e`**: Converts *"a"* to *"e"* when the suffix should be *"ler"* (used for front vowels).

### Code for noun processing
```python
n_n_pl = 0 # counter for total number or words
c_n_pl = 0 # counter for correct words

for i, line in enumerate(tur_out):
  lemma, msd, inflected = line[:3]
  correct = dev[i]

  if msd.startswith("N") and '(PL' in msd:
    v_cat = n_vowel_categorize(lemma)
```
First, we check for nouns and plural forms by creating a filter that only process entries with part-of-speech `N` (noun) and containing `(PL` (plural) in `msd`. Then, we categorize vowel harmony by using the `n_vowel_categorize` function to determine the vowel frontness of the lemma.

```python
    if v_cat["frontness"] == False and "ler" in inflected:
      output = inflected @ plural_correction_rule_a
      paths = list(output.paths().ostrings())
      correction = paths[0]
      print(lemma, inflected, "→", correction, correct == correction)
      n_n_pl += 1
      if correct == correction:
        c_n_pl += 1

    elif v_cat["frontness"] == True and "lar" in inflected:
      output = inflected @ plural_correction_rule_e
      paths = list(output.paths().ostrings())
      correction = paths[0]
      print(lemma, inflected, "→", correction, correct == correction)
      n_n_pl += 1
      if correct == correction:
        c_n_pl += 1

    else:
      correction = inflected

  print("--------------------------------")
  print("Number of words identified:", n_n_pl)
  print("Number of words corrected:", c_n_pl)
```
Next, we apply plural suffix rules which correct `ler` to `lar` using `plural_correction_rule_a` for back vowels and correct `lar` to `ler` using `plural_correction_rule_e` for front vowels. In the end, we compare the corrected form with the (`correct`) and track the total (`n`) and successful corrections (`c`).

### Sample output
The output shows each lemma, its original inflected form, the corrected form, and whether the correction matches the expected form (`True` or `False`).

Example:
```
anahtarlık anahtarlıklerimize → anahtarlıklarimize False
asimptot asimptotlerimizden → asimptotlarimizden False
küçüklük küçüklüklar → küçüklükler True
otel otellarını → otellerını False
sözdizimsel tuz sözdizimsel tuzler → sözdizimsel tuzlar True
```

At the end, it prints the total processed cases (`n`) and successful corrections (`c`):
```
Number of words identified: 31
Number of words corrected: 2
```

**2 out of 31** cases were corrected successfully. 
**Reason for low success rate**
- While the plural suffix (lar/ler) is corrected, subsequent suffixes (e.g., possessive or case suffixes) remain unprocessed. These additional suffixes also require adjustments to align with vowel harmony rules, which the current implementation does not address.

## Progressive Suffix Correction for Verbs

### Construction of transducers
```python
turkish_alphabet = "abcçdefgğhıijklmnoöprsştuüvyzT " # add " "(space) and T, so it includes word compounds and capital letter
sigma = pynini.union(*turkish_alphabet).closure()
vowel = pynini.union(*"öüeiouaı") # define vowels group
consonant = pynini.union(*"bcçdfgğhjklmnprsştvz")  # define consonants group
```
Again, we start by defining the turkish alphabets, including space to handle multi-word compounds and the uppercase letter `T` for potential capitalized inputs. Additionally, a `vowel` is defined to represent the group of vowels while a `consonant` is defined to represent the group of consonants.

```python
# Progressive suffix context for adding ü
prog_adding_rule_ü = pynini.cdrewrite(
      pynini.cross("", "ü"),
      consonant,    # Left context: a consonant
      "yor",    # Right context: before "yor"
      sigma)

# Progressive suffix context for adding i
prog_adding_rule_i = pynini.cdrewrite(
      pynini.cross("", "i"),
      consonant,    # Left context: a consonant
      "yor",    # Right context: before "yor"
      sigma)

# Progressive suffix context for correcting back to ü
prog_correction_rule_ü = pynini.cdrewrite(
        pynini.cross(vowel, "ü"),
        "",    # Left context: anything
        "yor",    # Right context: before "yor" (equal to immediately followed by 'yor')
        sigma)

# Progressive suffix context for correcting back to i
prog_correction_rule_i = pynini.cdrewrite(
        pynini.cross(vowel, "i"),
        "",    # Left context: anything
        "yor",    # Right context: before "yor" (equal to immediately followed by 'yor')
       sigma)
```
Two rewrite rules are created to handle corrections to the progressive suffix which ensure the **correction of improperly formed progressive suffixes**:
1. **`prog_correction_rule_ü`**: Replaces any vowel with *"ü"* when followed by *"yor"*, ensuring proper alignment with front and rounded vowels.
2. **`prog_correction_rule_i`**: Replaces any vowel with *"i"* in the same context, aligning with front and unrounded vowels.

Addtionally, two rewrite rules are defined which ensure the **addition of missing vowels** to align with Turkish vowel harmony principles:
1. **`prog_adding_rule_ü`**: Inserts *"ü"* when the preceding character is a consonant and the following context is *"yor"*. This rule aligns with front and rounded vowels.
2. **`prog_adding_rule_i`**: Inserts *"i"* under the same context but applies when the vowel is front and unrounded.


### Code for verb processing
```python
n_v_prog = 0
c_v_prog = 0

# Process
for i, line in enumerate(tur_out):
  lemma, msd, inflected = line[:3]
  correct = dev[i]

  if msd.startswith("V") and 'PROG' in msd:
    v_cat = v_vowel_categorize(lemma)
```
First, we check for verbs and progressive forms by creating a filter that only process entries with part-of-speech `V` (verb) and containing `PROG` (progressive) in `msd`. Then, we categorize vowel harmony by using the `v_vowel_categorize` function to determine the vowel frontness of the lemma.

```python
    # first, adding missing vowel ü, then correcting
    if v_cat["frontness"] == True and v_cat["roundness"] == True:
      output = inflected @ prog_adding_rule_ü
      output = output @ prog_correction_rule_ü
      paths = list(output.paths().ostrings())
      correction = paths[0]
      print(lemma, inflected, "→", correction, correct == correction)
      n_v_prog += 1
      if correct == correction:
        c_v_prog += 1

    # first, adding missing vowel i, then correcting
    elif v_cat["frontness"] == True and v_cat["roundness"] == False:
      output = inflected @ prog_adding_rule_i
      output = output @ prog_correction_rule_i
      paths = list(output.paths().ostrings())
      correction = paths[0]
      print(lemma, inflected, "→", correction, correct == correction)
      n_v_prog += 1
      if correct == correction:
        c_v_prog += 1

    else:
      correction = inflected

print("--------------------------------")
print("Number of words identified:", n_v_prog)
print("Number of words corrected:", c_v_prog)
```
Next, we apply progressive suffix rules which correct or insert *"i"* or *"ü"* using four transducers designed before. In the end, we compare the corrected form with the (`correct`) and track the total (`n_v_prog`) and successful corrections (`c_v_prog`).

### Sample Output
Same as above, the output shows each lemma, its original inflected form, the corrected form, and whether the correction matches the expected form (`True` or `False`).

Example:
```
Türkçeleştirmek Türkçeleştirüyor muydu → Türkçeleştiriyor muydu True
Türkçeleştirmek Türkçeleştirmiyormuşuz → Türkçeleştirmiyormuşuz True
Türkçeleştirmek Türkçeleştiriyormuş → Türkçeleştiriyormuş True
Türkçeleştirmek Türkçeleştiryor olmamalıymışım → Türkçeleştiriyor olmamalıymışım True
Türkçeleştirmek Türkçeleştiriyor olmalıymışsın → Türkçeleştiriyor olmalıymışsın True
```

At the end, it prints the total processed cases (`n_v_prog`) and successful corrections (`c_v_prog`):
```
Number of words identified: 108
Number of words corrected: 98
```

**98 out of 108** cases were corrected successfully. 

**Analysis for high accuracy improvement** 
- The transducers cover most of the errors shown in verbs with progressive forms. This case shows the importance of customizing transducers to cover more mistakes shown in the inflected words. We will then go back to enhance the plural suffix correction rules for nouns to increase the efficiency.