TurkishDeasciifier

In Turkish Natural Language Processing normalized text is crucial for well-designed models and high accuracy results. One of the main step in text normalization is diacritic resolution which converts non-normalized, asciified text to normalized and deasciified text. To resolve diacritic resolution task, recurrent neural network based model is prepared.

Dependencies

To run this project, necessarcy libraries are:

Dynet
NTLK
Numpy
Codecs

Setup

Please clone or download project into your local

Data

Under the /data directory, you can find train and test data.

data_input.txt and data_output.txt are for training
data_input_test.txt and data_output_test.txt are for testing
Each train and test data consist sentences per line
If you want to use your own train or test file please upload your files under this directory with the specified file name

Under the /models directory, you can find pretrained RNN Diacritic Resolution model

char2int and int2char are prebuilted index arrays, please load these variables while using running project.

Load function is predefined for these variables. Function loads char2int.p and int2char.p files.

How to Create Train Dataset

You can easily create your own dataset by obtaining large Turkish well-written text from web or other medias. Obtained text should be asciified by replacing Turkish diacritic letters with ascii form. For example:

Replace all "ş" with "s"
Replace all "ü" with "u"
...

You have two different text files:

Ascii text: the one you can use as input file
Original text: the one you can use as output file

Train

python deasciifierRNN.py train

Test

python deasciifierRNN.py test

Article

You can review article: diacritic-restoration-recurrent.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurkishDeasciifier

Dependencies

Setup

Data

How to Create Train Dataset

Train

Test

Article

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
models		models
README.md		README.md
char2int.p		char2int.p
deasciifierRNN.py		deasciifierRNN.py
diacritic-restoration-recurrent.pdf		diacritic-restoration-recurrent.pdf
int2char.p		int2char.p
vocab_size.p		vocab_size.p

aysnrgenc/TurkishDeasciifier

Folders and files

Latest commit

History

Repository files navigation

TurkishDeasciifier

Dependencies

Setup

Data

How to Create Train Dataset

Train

Test

Article

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages