# Turkish Diacritization with Deep Learning

**Emirberk Almacı / 150220751**                
 **Furkan Öztürk / 150200312**


This project is designed to utilize advanced deep learning techniques to perform diacritization on Turkish text. Diacritization involves the addition of diacritical marks to text. In Turkish, these marks can alter the meaning of words significantly. Therefore, achieving accurate diacritization is essential for ensuring the correct interpretation of the text. This project aims to develop a deep learning model capable of accurately diacritizing Turkish text, thereby improving text understanding and analysis.

## Requirements

In [1]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import Dataset,DataLoader, TensorDataset, RandomSampler, SequentialSampler

## Preprocessing

The words in the data set are collected in a single list.

In [2]:
# Read the train data
df = pd.read_csv('train.csv')

# Create a list to hold all words
all_words = []

# For each sentence in the data
for sentence in df['Sentence']:
    # Split the sentence into words
    tokens = sentence.split()
    # Add each word to the all_words list
    for word in tokens:
        all_words.append(word)

# Keep only unique words in the list
all_words = list(set(all_words))

print(len(all_words))

95845


The number of all characters in the data set is examined. It appears to contain some characters that are not Turkish words.

In [3]:
# Concatenate all sentences into a single string
all_sentence = ''.join(df['Sentence'])

# Get unique characters from the concatenated string
unique_characters = set(all_sentence)

# Create an empty dictionary to store character counts
character_count = {}

# Count occurrences of each character in the concatenated string
for char in unique_characters:
    character_count[char] = all_sentence.count(char)

# Sort the dictionary items by value (character counts) in descending order
sorted_character_count = sorted(character_count.items(), key=lambda x: x[1], reverse=True)

print("Number of occurrences of each unique character:")
# Print the sorted character counts
for char, count in sorted_character_count:
    print(f"{char}: {count}")  # Print sorted character counts

Number of occurrences of each unique character:
 : 771171
a: 493191
e: 389428
i: 364900
n: 316712
r: 289752
l: 281941
ı: 202301
k: 194806
d: 177560
m: 151554
t: 149938
s: 140532
y: 132729
u: 132662
o: 110084
b: 103331
ü: 77913
ş: 68650
z: 60695
g: 54272
ç: 45404
v: 44372
ğ: 43411
c: 42446
h: 42224
p: 37039
ö: 33894
f: 21628
.: 16701
,: 15097
1: 9429
0: 8092
2: 5164
/: 4000
j: 3676
:: 3445
5: 3443
): 3012
<: 2992
>: 2979
q: 2965
(: 2938
9: 2756
-: 2357
ä: 2233
4: 2179
6: 2170
3: 2052
’: 2016
': 1806
±: 1694
8: 1657
w: 1473
7: 1306
?: 1243
ã: 1195
°: 1187
;: 1014
: 965
â: 944
¼: 653
å: 591
x: 575
!: 400
: 399

: 399
î: 290
§: 285
»: 280
": 245
¶: 174
=: 165
–: 157
%: 108
û: 95
[: 83
`: 79
é: 68
…: 66
*: 66
]: 63
: 59
: 59
|: 55
á: 50
•: 34
: 31
«: 31
: 30
­: 28
~: 28
: 22
@: 18
M: 18
´: 15
&: 14
¤: 14
_: 12
ó: 11
½: 11
: 9
ê: 9
š: 6
”: 6
“: 6
$: 6
}: 5
′: 5
{: 4
—: 4
è: 4
ï: 4
ñ: 3
#: 3
‘: 2
®: 2
ô: 2
\: 2
ß: 2
ú: 2
: 1
œ: 1
″: 1
ø: 1
∙: 1
¦: 1
ē: 1
ì: 1
ţ: 1
¢: 1
¹: 1


Letters greatly affect the performance of the model. For this reason, index values will be assigned only to letters and the same index value will be assigned for other characters.

In [4]:
turkish_letters = ["a", "b", "c", "ç", "d", "e", "f", "g", "ğ", "h", "ı", "i", "j", "k", "l",
                   "m", "n", "o", "ö", "p", "r", "s", "ş", "t", "u", "ü", "v", "y", "z", "w",
                   "q", "x", "î", "û", "â"]

punctuation = [".", ",", "!", "?", ";", ":", "'", "(", ")", "°", "/", "[", "]", "{", "}", ">", "<", "...", "="]
numbers = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]

valid_char = turkish_letters

# Create a dictionary mapping characters to indices
char_to_index = {char: i + 1 for i, char in enumerate(valid_char)}

# Create a dictionary mapping indices to characters
index_to_char = {i: char for char, i in char_to_index.items()}  # char_to_index.items(). Write comment line

In [5]:
all_words[:10]

['kazandıktan',
 'kullanılırsa',
 "cabriolet'de",
 'yönlerinden',
 'saptık',
 'fahr',
 'sapmadın',
 'kıymetleştirme',
 'ödeyip',
 'müdahalesi']

In general, it can be seen that the character length of words is less than 20. For this reason, 20 was set as the limit and words longer than this were removed from the data set.

In [6]:
# Filter words longer than 20 characters
all_words = [word for word in all_words if len(word) <= 20]

# Print the number of remaining words
print("Number of words with length less than or equal to 20:", len(all_words))

Number of words with length less than or equal to 20: 95530


It was determined that there were words containing non-Turkish characters. Here, too, it is removed from the data set.

In [7]:
# Filter words that contain only Turkish letters
all_words = [word for word in all_words if all(char in turkish_letters for char in word)]

# Print the number of remaining words
print("Number of words containing only Turkish letters:", len(all_words))

Number of words containing only Turkish letters: 88630


In [8]:
all_words[:10]

['kazandıktan',
 'kullanılırsa',
 'yönlerinden',
 'saptık',
 'fahr',
 'sapmadın',
 'kıymetleştirme',
 'ödeyip',
 'müdahalesi',
 'suyunda']

In addition to the data set provided for the project, the data set containing Turkish words sourced from the internet was also used during model training.

In [9]:
def read_txt(file_path):
    # Create an empty list to store the data
    data = []
    # Open the file in read mode
    with open(file_path, 'r', encoding='utf-8') as file:
        # Read each line in the file
        for line in file:
            # Strip any leading or trailing whitespaces and append the line to the data list
            data.append(line.strip())
    # Return the data
    return data

# Define the file path
file_path = "random_data.txt"
# Call the read_txt function with the file path
random_data = read_txt(file_path)
# Print the first 10 lines of the data
print(random_data[:10])  # Print the first 10 lines

['pirelerinde', 'sevişmenle', 'topaklasan', 'odacılarla', 'hortlamamış', 'sakındılarsa', 'virajındayız', 'sapladılar', 'indirdiğinize', 'aranman']


Big data set is created by adding two data sets (Provided and extra added).

In [10]:
# Add words from random data as well
all_words = all_words + random_data

# Print the length of all words
print(len(all_words))

203332


To create the training data, the characters in the correct training data are replaced with the corresponding characters.

In [11]:
all_words_diac = all_words.copy()

turkish_diacritic_mapping = {
'ç': 'c',
'ğ': 'g',
'ı': 'i',
'ö': 'o',
'ş': 's',
'ü': 'u',
'Ç': 'C',
'Ğ': 'G',
'İ': 'I',
'Ö': 'O',
'Ş': 'S',
'Ü': 'U',
}

In [12]:
# Create a copy of all_words
all_words_diac = all_words.copy()

# Replace characters with diacritics
for i, word in enumerate(all_words_diac):
    new_word = ""
    for char in word:
        new_word += turkish_diacritic_mapping.get(char, char)
    all_words_diac[i] = new_word

In [13]:
all_words_diac[:10]

['kazandiktan',
 'kullanilirsa',
 'yonlerinden',
 'saptik',
 'fahr',
 'sapmadin',
 'kiymetlestirme',
 'odeyip',
 'mudahalesi',
 'suyunda']

When creating training data, approximately 0.7 of it is created from non-diacritized data and 0.3 of it is created from diacritized data.

In [14]:
train = all_words_diac[:140000] + all_words[140000:]

Converts words into their corresponding indices using the provided `char_to_index` mapping. The function `words2index()` iterates through each word in the input list `all_words`, converting each character into its index. If a character is not present in the `char_to_index` mapping, a special index is assigned.

The function returns a list of lists, where each inner list contains the indices of characters in a word. These lists are then used to create a padded tensor using PyTorch's `pad_sequence()` function, ensuring that all sequences have the same length for batch processing in neural networks. This padded tensor represents the input data for the model.

In [15]:
def words2index(all_words):
    # Convert words to indices
    all_words_index = []
    for word in all_words:
        index = []
        for char in word:
            if char in char_to_index.keys():
                index.append(char_to_index[char])
            else:
                index.append(len(char_to_index))
        all_words_index.append(index)
    return all_words_index

In [16]:
# Convert all words to indices and pad the sequences
all_words_index = words2index(all_words)
tensor_list = [torch.tensor(item) for item in all_words_index]
words_padded_tensor = pad_sequence(tensor_list, batch_first=True)

# Convert training data to indices and pad the sequences
train_index = words2index(train)
train_tensor = [torch.tensor(item) for item in train_index]
words_diac_padded_tensor = pad_sequence(train_tensor, batch_first=True)

In [17]:
# Check if the tensors are equal
words_diac_padded_tensor == words_padded_tensor

tensor([[ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True, False,  True,  ...,  True,  True,  True],
        ...,
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True],
        [ True,  True,  True,  ...,  True,  True,  True]])

The model used in this project is a Recurrent Neural Network (RNN) built with PyTorch. The RNN is particularly suited for sequence prediction problems. This model takes a sequence of characters as input and outputs a sequence of diacritized characters.

The RNN processes the input sequence one character at a time, maintaining an internal state that encodes information about the characters it has seen so far. The output at each time step is dependent on the previous outputs and the current input.

The model is trained using a loss function that measures the difference between the predicted output sequence and the actual output sequence. The model parameters are updated using the Adam optimizer.

In [18]:
class MyDataset(Dataset):
    def __init__(self, tensor1, tensor2):
        self.tensor1 = tensor1
        self.tensor2 = tensor2

    def __len__(self):
        return len(self.tensor1)

    def __getitem__(self, idx):
        return self.tensor1[idx], self.tensor2[idx]

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_size, num_layers, n_classes):
        super(RNN, self).__init__()

        # Embedding layer
        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        # Bidirectional LSTM layers
        self.lstm1 = nn.LSTM(embedding_dim, hidden_size, num_layers=num_layers, batch_first=True, bidirectional=True)
        self.lstm2 = nn.LSTM(hidden_size * 2, hidden_size, num_layers=num_layers, batch_first=True, bidirectional=True)
        self.lstm3 = nn.LSTM(hidden_size * 2, hidden_size, num_layers=num_layers, batch_first=True, bidirectional=True)

        # Linear layer
        self.linear = nn.Linear(hidden_size * 2, n_classes)

    def forward(self, sentences):
        # Embedding
        embeddings = self.embedding(sentences)

        # LSTM layers
        lstm_out1, _ = self.lstm1(embeddings)
        lstm_out2, _ = self.lstm2(lstm_out1)
        lstm_out3, _ = self.lstm3(lstm_out2)

        # Linear layer
        output = self.linear(lstm_out3)

        return output

In [19]:
# Hyperparameters
vocab_size = len(char_to_index) + 1
embedding_dim = 512
hidden_size = 256
num_layers = 2
n_classes = len(char_to_index) + 1

# Create dataset and data loader
dataset = MyDataset(words_diac_padded_tensor, words_padded_tensor)
train_loader = DataLoader(dataset, batch_size=64, shuffle=True)

In [20]:
# Initialize model
model = RNN(vocab_size, embedding_dim, hidden_size, num_layers, n_classes)

# Loss function
criterion = nn.CrossEntropyLoss()

# Optimizer
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Move model to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

RNN(
  (embedding): Embedding(36, 512)
  (lstm1): LSTM(512, 256, num_layers=2, batch_first=True, bidirectional=True)
  (lstm2): LSTM(512, 256, num_layers=2, batch_first=True, bidirectional=True)
  (lstm3): LSTM(512, 256, num_layers=2, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=36, bias=True)
)

In [21]:
# Select device
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("CUDA is available and being used.")
else:
    device = torch.device("cpu")
    print("CUDA is not available. CPU is being used.")

# Training loop
num_epochs = 20
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs.view(-1, n_classes), targets.view(-1))
        loss.backward()
        optimizer.step()
        running_loss += loss.item()
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {running_loss / len(train_loader)}")

CUDA is available and being used.
Epoch 1/20, Loss: 0.07776058879238422
Epoch 2/20, Loss: 0.01962048771701486
Epoch 3/20, Loss: 0.015858757008695924
Epoch 4/20, Loss: 0.013480453057738597
Epoch 5/20, Loss: 0.011813699869332627
Epoch 6/20, Loss: 0.010702121578938814
Epoch 7/20, Loss: 0.00982010654845469
Epoch 8/20, Loss: 0.00908491605163222
Epoch 9/20, Loss: 0.008565407686901547
Epoch 10/20, Loss: 0.008050132277946952
Epoch 11/20, Loss: 0.0075459962451679295
Epoch 12/20, Loss: 0.007197051038304749
Epoch 13/20, Loss: 0.006910586844934628
Epoch 14/20, Loss: 0.006597699947145142
Epoch 15/20, Loss: 0.0064644568533713705
Epoch 16/20, Loss: 0.006170431090188561
Epoch 17/20, Loss: 0.006093562593218892
Epoch 18/20, Loss: 0.00588511824925079
Epoch 19/20, Loss: 0.005759246986557682
Epoch 20/20, Loss: 0.005684267218555341


#### Save the Model

Since continuous training of the model takes a long time and requires hardware, the model that gives the best performance is saved for later use.

In [None]:
# Save the trained model
torch.save(model.state_dict(), "turkish_diacritic_model.pth")

### Test the Model

#### Upload Pre-trained Model

In [32]:
# Load the saved model
model = RNN(vocab_size, embedding_dim, hidden_size, num_layers, n_classes)
model.load_state_dict(torch.load("turkish_diacritic_model.pth"))
model.eval()

# Move model to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

RNN(
  (embedding): Embedding(36, 512)
  (lstm1): LSTM(512, 256, num_layers=2, batch_first=True, bidirectional=True)
  (lstm2): LSTM(512, 256, num_layers=2, batch_first=True, bidirectional=True)
  (lstm3): LSTM(512, 256, num_layers=2, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=36, bias=True)
)

Diacriticed lowercase letters must be converted back to uppercase letters. This process is done with `the upper_mapping` dictionary.





In [22]:
upper_mapping = {"i": "İ",
                 "ç": "Ç",
                 "ö": "Ö",
                 "ü": "Ü",
                 "ş": "Ş"}

 With the `predict()` function, sentences are converted into words, then into characters, and finally into the index corresponding to that character, and given to the model. The model produces indexes corresponding to these indexes. Finally, the output is calculated by finding the characters corresponding to these indexes.

In [23]:
def predict(model, input_sentence, char_to_index, index_to_char):
    model.eval()
    output_sentence = ""

    with torch.no_grad():
        for word in input_sentence.split():
            # Create a list of booleans indicating whether each character is uppercase
            is_uppercase = [char.isupper() for char in word]
            word = word.lower()

            if not all(char in turkish_letters for char in word):
                output_sentence += word + " "
                continue

            input_indices = []
            for char in word:
                if char in char_to_index.keys():
                    input_indices.append(char_to_index[char])
                else:
                    input_indices.append(len(char_to_index))

            # Convert input word to indices
            input_tensor = torch.LongTensor(input_indices).unsqueeze(0)

            # Move input tensor to device
            input_tensor = input_tensor.to(device)

            # Forward pass
            output = model(input_tensor)

            # Get predicted classes
            _, predicted = torch.max(output, 2)

            # Convert predicted indices to characters
            predicted_word = ''.join([index_to_char[idx.item()] for idx in predicted[0]])

            # Convert characters to uppercase where necessary
            predicted_word_upper = ""
            for char, is_upper in zip(predicted_word, is_uppercase):
                if is_upper and char not in ["i", "ç", "ö", "ü", "ş"]:
                    char = char.upper()
                elif is_upper and char in ["i", "ç", "ö", "ü", "ş"]:
                    char = upper_mapping[char]
                else:
                    char
                predicted_word_upper += char

            output_sentence += predicted_word_upper + " "

    return output_sentence.strip()

# Example usage
input_sentence = "Istanbul zavalli adami oracikta astilar "
output_sentence = predict(model, input_sentence, char_to_index, index_to_char)
print("Predicted sentence:", output_sentence)

Predicted sentence: İstanbul zavallı adamı oracıkta astılar


Using the `predict()` function, the model is applied to all words in all sentences and replaced in the data set.

In [24]:
df_test = pd.read_csv("test.csv")
df_test

Unnamed: 0,ID,Sentence
0,0,tr ekonomi ve politika haberleri turkiye nin ...
1,1,uye girisi
2,2,son guncelleme 12:12
3,3,Imrali Mit gorusmesi ihtiyac duyuldukca oluyor
4,4,Suriye deki silahli selefi muhalifler yeni ku...
...,...,...
1152,1152,Yuregir Adana ilimize ait sirin bir ilcedir
1153,1153,yuze guluculugun at oynattigi bir aydinlar ort...
1154,1154,zavalli adami oracikta astilar ve hic kimse se...
1155,1155,zengin cocuklarina ariz munasebetsizlikler fak...


In [25]:
def predict_on_sentences(df, model, char_to_index, index_to_char):
  df['Sentence'] = df.Sentence.apply(lambda x: predict(model, x, char_to_index, index_to_char))

  return df

df_test = pd.read_csv("test.csv")
df_test = predict_on_sentences(df_test.copy(), model, char_to_index, index_to_char)

In [26]:
df_test

Unnamed: 0,ID,Sentence
0,0,tr ekonomi ve politika haberleri türkiye nin e...
1,1,üye girişi
2,2,son güncelleme 12:12
3,3,İmrali Mit görüşmesi ihtiyaç duyuldukça oluyor
4,4,Suriye deki silahlı selefi muhalifler yeni kur...
...,...,...
1152,1152,Yüreğir Adana ilimize ait şirin bır ilçedir
1153,1153,yüze gülücülüğün at oynattığı bır aydınlar ort...
1154,1154,zavallı adamı oracıkta astılar ve hiç kimse se...
1155,1155,zengin çocuklarına arız münasebetsizlikler fak...


To preserve the processed data for further analysis and to test, the DataFrame is saved as a csv file.

In [None]:
df_test.to_csv("result.csv", index=False)