# Turkish Diacritisation | YZV 405E NLP Term Project

Author: Bora Boyacıoğlu

Student ID: 150200310

## Step 1: Data Preprocessing

Import necessary libraries.

In [1]:
from dataset import DiacritizationDataset
from utils.utils import *

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
def print_example(texts: list[str]):
    print('\033[1mTrain Data Example:\033[0m')
    print('Undiacritized:', texts[0])
    print('Diacritized:', texts[1])

    print('\n\033[1mTest Data Example:\033[0m')
    print('Undiacritized:', texts[2])
    print('Diacritized:', texts[3])

### Defining Datasets

We have two datasets: `train` and `test`. We will use the `train` dataset to train our model and the `test` dataset to evaluate the model. Firstly, open these datasets using the defined Dataset classes.

In [4]:
# Train dataset.
train_data = DiacritizationDataset('data/train.csv', type='train')

# Test dataset.
test_data = DiacritizationDataset('data/test.csv', type='test')

In [5]:
print(f"Length: {len(train_data)}\t(Train)\n"
      f"        {len(test_data)}\t(Test)")

Length: 57839	(Train)
        1176	(Test)


Note that I divided some very long sentences. Also, the $15480^{th}$ sentence in the `train` dataset is a sequence of lines, which created a massive sentence. I divided it line by line. These adjustments made the length a bit different than the original.

In [6]:
print_example([train_data.get(0, 'und'), train_data.get(0, 'd'), test_data.get(0, 'und'), test_data.get(0, 'd')])

[1mTrain Data Example:[0m
Undiacritized: sinif  havuz ve acik deniz calismalariyla  tum dunyada gecerli  basarili bir standart olusturmustur . 
Diacritized: sınıf  havuz ve açık deniz çalışmalarıyla  tüm dünyada geçerli  başarılı bir standart oluşturmuştur . 

[1mTest Data Example:[0m
Undiacritized:  tr ekonomi ve politika haberleri turkiye nin en cesur gazetesi radikal de uye ol
Diacritized: None


### Preprocessing

Normalize the text by converting it to lowercase and removing any special characters.

In [7]:
# Normalize the train data.
normalize(train_data)

# Normalize the test data.
normalize(test_data)

Normalizing text 100.00%
Normalizing text 100.00%


In [8]:
print_example([train_data.get(0, 'und'), train_data.get(0, 'd'), test_data.get(0, 'und'), test_data.get(0, 'd')])

[1mTrain Data Example:[0m
Undiacritized: sinif havuz ve acik deniz calismalariyla tum dunyada gecerli basarili bir standart olusturmustur 
Diacritized: sınıf havuz ve açık deniz çalışmalarıyla tüm dünyada geçerli başarılı bir standart oluşturmuştur 

[1mTest Data Example:[0m
Undiacritized:  tr ekonomi ve politika haberleri turkiye nin en cesur gazetesi radikal de uye ol
Diacritized: None


Then, tokenize the text by splitting it into words. We will be using Spacy for tokenization.

In [9]:
# Tokenize the train data.
tokenize(train_data)

# Tokenize the test data.
tokenize(test_data)

Tokenizing... 100.00%
Tokenizing... 100.00%


In [10]:
print_example([train_data.get(0, 'und'), train_data.get(0, 'd'), test_data.get(0, 'und'), test_data.get(0, 'd')])

[1mTrain Data Example:[0m
Undiacritized: ['sinif', 'havuz', 've', 'acik', 'deniz', 'calismalariyla', 'tum', 'dunyada', 'gecerli', 'basarili', 'bir', 'standart', 'olusturmustur']
Diacritized: ['sınıf', 'havuz', 've', 'açık', 'deniz', 'çalışmalarıyla', 'tüm', 'dünyada', 'geçerli', 'başarılı', 'bir', 'standart', 'oluşturmuştur']

[1mTest Data Example:[0m
Undiacritized: ['tr', 'ekonomi', 've', 'politika', 'haberleri', 'turkiye', 'nin', 'en', 'cesur', 'gazetesi', 'radikal', 'de', 'uye', 'ol']
Diacritized: None


### Preparing for Training

First, create the vocabulary.

In [11]:
# Build word to index and index to word mappings.
vocab = train_data.build_vocab(test_data)

Then, apply padding to fit all the sentences into one length.

In [12]:
# Pad the datasets.
max_len = train_data.pad()
test_data.pad(max_len)

# Print the maximum length.
print(f'Maximum Length: {max_len}')

Maximum Length: 118


Using the vocabulary, convert the words into indices.

In [13]:
# Convert the strings to vocabular integers (mappings).
train_data.to_indices()
test_data.to_indices()

### Saving the Preprocessed Data

In [14]:
# Save the train data.
train_data.save_data('data/train_data.pkl')

# Save the test data.
test_data.save_data('data/test_data.pkl')

# Save the vocab.
train_data.save_vocab('data/vocab.pkl')

In [15]:
print_example([
    untokenize(train_data, 0, 'und', detailed=True),
    untokenize(train_data, 0, 'd', detailed=True),
    untokenize(test_data, 0, 'und', detailed=True),
    None
])

[1mTrain Data Example:[0m
Undiacritized: <sos> sinif havuz ve acik deniz calismalariyla tum dunyada gecerli basarili bir standart olusturmustur <eos> <pad> (105)...
Diacritized: <sos> sınıf havuz ve açık deniz çalışmalarıyla tüm dünyada geçerli başarılı bir standart oluşturmuştur <eos> <pad> (105)...

[1mTest Data Example:[0m
Undiacritized: <sos> tr ekonomi ve politika haberleri turkiye nin en cesur gazetesi radikal de uye ol <eos> <pad> (104)...
Diacritized: None
