# Overview of final language-specific code - Turkish

**LNG 409 Fall 2024**

Author: Hee Joong Choi, Yuzhang Fu, Xin You


In [None]:
import pynini

This notebook will walkthrough our final code named **transducer_tur**, which implements vowel harmony correction for Turkish noun plural and verb progressive suffixes using Finite-State Transducers (FSTs) via the `pynini` library. We will demonstrate the tweaks we made, the logistics of the code, and the performance of these transducers. 

## Set up

To run this code, install pynini, a library for building and running weighted finite-state transducers.

```Python
!pip install pynini
```

**Note**: Throughout the project, we are only able import the pynini library on google colab. For instance, it is suggested to upload `transducer_tur.ipynb` along with `data/tur.out` and `data/tur.dev` to google drive and run the code. However, we have also attached `trasducer_tur.py` in our github repository for you to check.

## Defined functions

### Vowel categorization

We have created two helper functions that categorize vowels in Turkish words:

`n_vowel_categorize(lemma)` determines the frontness and roundness of the last vowel in a lemma, where **'ö, ü'** are front rounded vowels, **'e,i'** are front unrounded vowels, **'o, u'** are back rounded vowels, and **'a, ı'** are back unrounded vowels.

```python
def n_vowel_categorize(lemma):
    for char in reversed(lemma):
        if char in "öü":
            return {"frontness": True, "roundness": True}
        elif char in "ei":
            return {"frontness": True, "roundness": False}
        elif char in "ou":
            return {"frontness": False, "roundness": True}
        elif char in "aı":
            return {"frontness": False, "roundness": False}
```

`v_vowel_categorize(lemma)` finds the second-to-last vowel in the lemma to handle progressive suffix rules with the same vowel categorization rules above.

```python
def v_vowel_categorize(lemma):
    vowels = "öüeiouaı"
    n = 0
    for char in reversed(lemma):
        if char in vowels:
            n += 1
            if n == 2:
                if char in "öü":
                    return {"frontness": True, "roundness": True}
                elif char in "ei":
                    return {"frontness": True, "roundness": False}
                elif char in "ou":
                    return {"frontness": False, "roundness": True}
                elif char in "aı":
                    return {"frontness": False, "roundness": False}
```

## Plural Suffix Correction for Nouns

### Construction of transducers
```python
turkish_alphabet = "abcçdefgğhıijklmnoöprsştuüvyz " # Added " "(space) to include compounds like "sözdizimsel tuzlerde"
sigma = pynini.union(*turkish_alphabet).closure()
```
The Turkish alphabet is defined, including a space character to handle multi-word compounds. `sigma` is created to match any valid Turkish word.

```python
plural_correction_rule_a = pynini.cdrewrite(
        pynini.cross("e", "a"),
        "l",
        "r",
        sigma)

plural_correction_rule_e = pynini.cdrewrite(
        pynini.cross("a", "e"),
        "l",
        "r",
        sigma)
```
Two rewrite rules are defined:
1. **`plural_correction_rule_a`**: Converts "e" to "a" when the suffix should be `lar` (used for back vowels).
2. **`plural_correction_rule_e`**: Converts "a" to "e" when the suffix should be `ler` (used for front vowels).

These rules ensure suffix alignment with Turkish vowel harmony principles.


### Code for noun processing
```python
n_n_pl = 0 # counter for total number or words
c_n_pl = 0 # counter for correct words

for i, line in enumerate(tur_out):
  lemma, msd, inflected = line[:3]
  correct = dev[i]

  if msd.startswith("N") and '(PL' in msd:
    v_cat = n_vowel_categorize(lemma)
```
First, we check for nouns and plural forms by creating a filter that only process entries with part-of-speech `N` (noun) and containing `(PL` (plural) in `msd`. Then, we categorize vowel harmony by using the `n_vowel_categorize` function to determine the vowel frontness of the lemma.

```python
    if v_cat["frontness"] == False and "ler" in inflected:
      output = inflected @ plural_correction_rule_a
      paths = list(output.paths().ostrings())
      correction = paths[0]
      print(lemma, inflected, "→", correction, correct == correction)
      n_n_pl += 1
      if correct == correction:
        c_n_pl += 1

    elif v_cat["frontness"] == True and "lar" in inflected:
      output = inflected @ plural_correction_rule_e
      paths = list(output.paths().ostrings())
      correction = paths[0]
      print(lemma, inflected, "→", correction, correct == correction)
      n_n_pl += 1
      if correct == correction:
        c_n_pl += 1

    else:
      correction = inflected

  print("--------------------------------")
  print("Number of words identified:", n_n_pl)
  print("Number of words corrected:", c_n_pl)
```
Next, we apply plural suffix rules which correct `ler` to `lar` using `plural_correction_rule_a` for back vowels and correct `lar` to `ler` using `plural_correction_rule_e` for front vowels. In the end, we compare the corrected form with the (`correct`) and track the total (`n`) and successful corrections (`c`).

### Sample output
The output shows each lemma, its original inflected form, the corrected form, and whether the correction matches the expected form (`True` or `False`).

Example:
```
anahtarlık anahtarlıklerimize → anahtarlıklarimize False
asimptot asimptotlerimizden → asimptotlarimizden False
küçüklük küçüklüklar → küçüklükler True
otel otellarını → otellerını False
sözdizimsel tuz sözdizimsel tuzler → sözdizimsel tuzlar True
```

At the end, it prints the total processed cases (`n`) and successful corrections (`c`):
```
Number of words identified: 31
Number of words corrected: 2
```

**2 out of 31** cases were corrected successfully. 
- Reason for low success rate: While the plural suffix (lar/ler) is corrected, subsequent suffixes (e.g., possessive or case suffixes) remain unprocessed. These additional suffixes also require adjustments to align with vowel harmony rules, which the current implementation does not address.

## Progressive Suffix Correction for Verbs

### Construction of transducers

### Code for verb processing