# Turkish Diacritisation | YZV 405E NLP Term Project

Author: Bora Boyacıoğlu

Student ID: 150200310

## Step 4: Rule Based Algorithm

In this notebook, my aim is to develop a rule based algorithm to find the words that are actually in need to be replaced. This will massively improve the accuracy, as most of the words do not require any changes at all. And most of the one do require, only have one correspondance. The logic here is like the following:

1. Unicode Turkish word does not change. *(eg. bilgisayar)*
2. If a word only has one correspondance, check for it in the vocabulary and replace it with that one. *(eg. sinif $\rightarrow$ sınıf)*
3. In case a word has more than one acronyms, list every possible combination. *(eg. aci $\rightarrow$ {acı, açı})*
    * Create a possibility pool for each acronym.
    * 3.1. Give the sentence to the model. If the prediction sentence contains any acronyms of that word, use it.
    * 3.2. If not, replace the word with the most probable combination.

Import necessary libraries.

In [1]:
import json

from IPython.display import Markdown as md
from unidecode import unidecode

from dataset import DiacritizationDataset
from utils.main_utils import *

In [2]:
%load_ext autoreload
%autoreload 2

### Creating a Vocabulary

I will be using two vocabularies:
1. Turkish Dictionary **[1]**
4. Train Data

#### Load the Turkish Dictionary

The file seems to be a broken JSON format. So, I needed to manually find the word by splitting each line.

In [3]:
# Define Turkish character transformation.
tr_maps = str.maketrans({'â': 'a', 'î': 'i', 'û': 'ü'})

# Define the normalization and tokenization function.
nt = lambda x: tokenize_text(normalize_str(x.translate(tr_maps)))

In [4]:
# Load the dataset.
with open('data/ext/gts.json') as f:
    data = f.read().splitlines()
    words1 = []
    
    for line in data:
        # Skip lines which may cause problems, if any.
        if '"madde":"' not in line:
            continue
        
        # Get the word.
        madde = line.split('"madde":"')[1].split('"')[0]
        
        # Normalize and tokenize the word.
        madde = nt(madde)
                
        words1.extend(madde)

In [5]:
print("Length:", len(words1))

Length: 148270


#### Load the Train Data

This load will be different from the first part, as I will only do the early preprocessing steps (filtering, normalising, and tokenising). The vocabulary creation process is different here.

In [6]:
# Load the dataset.
train_data = DiacritizationDataset('data/train.csv', type='train')

# Normalize the train data.
normalize(train_data)

# Tokenize the train data.
tokenize(train_data)

Normalizing text 100.00%
Tokenizing... 100.00%


In [7]:
# Get the words.
words2 = []

for sent in train_data.diacritized:
    words2.extend(sent)

In [8]:
print("Length:", len(words2))

Length: 689839


#### Combine the Vocabularies

In [9]:
# Get the unique words.
words = list(set(words1 + words2))

In [10]:
j1 = len(words)
print("1. Length:", j1)

1. Length: 150119


### Making the Combinations

In [11]:
# Build the acronyms.
acronyms = {}

for word in words:
    undiacritized = unidecode(word)
    
    # If the word only contains ASCII characters, skip it.
    if undiacritized == word:
        continue
    
    # Add the undiacritized word to the acronyms.
    if undiacritized not in acronyms:
        acronyms[undiacritized] = [word]
    else:
        acronyms[undiacritized].append(word)

In [12]:
j2 = len(acronyms)
print("2. Acronyms:", j2)

2. Acronyms: 85075


In [13]:
# Count the words which have more than one acronym.
plural = 0
for undiacritized, words in list(acronyms.items()):
    if len(words) > 1:
        plural += 1

In [14]:
j3 = plural
print("3. Words with more than one acronym:", j3)

3. Words with more than one acronym: 833


In [15]:
# Count the total acronyms.
total_acronyms = 0
total_plural = 0
for words in acronyms.values():
    total_acronyms += len(words)
    if len(words) > 1:
        total_plural += len(words)

In [16]:
j4 = total_plural
print("4. Total plural acronym count:", j4)

4. Total plural acronym count: 1702


As we can see, out of 150119 unique words **(1)** in our vocabulary, only 85075 **(2)** have non-ASCII forms. This is only the $56.67\%$ of the unique words. And out of 85075, only 833 **(3)** share different acronym forms with each other. This is $0.98\%$ of the non-ASCII ones. And after considering the total number of acronyms left, we have 1702 words **(4)**, out of 150119 in total. To conclude, only $1.13\%$ of the words require a prediction to be changed.

### Counting the Probabilities

Now, the goal is to count the occurrences of each acronym in the Train Data. The reason I am only taking the train data to account is, I need the chance of a word occuring. The first data is, by its name, a dictionary.

In [17]:
# Count the acronyms.
probs = {}
index = 0
for acronym, words in acronyms.items():
    counts = {}
    
    if len(words) == 1:
        print(f'Counting... {100 * (index + 1) / total_acronyms:.2f}%', end='\r')
        probs[acronym] = {words[0]: 1}
        index += 1
        continue
    
    for word in words:
        print(f'Counting... {100 * (index + 1) / total_acronyms:.2f}%', end='\r')
        counts[word] = words2.count(word)
        index += 1
        
    probs[acronym] = counts

Counting... 100.00%

#### Save the Probabilities

Lastly, save the counted probabilities into a nested JSON dictionary.

In [18]:
# Save the probabilities.
with open('data/comb/probs.json', 'w', encoding='utf-8') as f:
    json.dump(probs, f, ensure_ascii=False)

### References

**[1] Turkish Dictionary:**

MIT License

Copyright (c) 2021 Kemal Ogun Isik

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

GitHub @ guncel-turkce-sozluk (https://github.com/ogun/guncel-turkce-sozluk)