Lemmatization in Turkish Language

Turkish language is morphologically complex language that the stem of a word can not be obtained by following a set of rules. In natural language processing, the stem of a word can be determined by the meaning they have.

For example, the following words

göz (eye)
gözlük (eyeglasses)
gözlükçü (optician)
gözlem (observation)

have the root form "göz". But if you reduce them to "göz", you will lose initial meaning of the words. With this approach, lemmatizer seems better option to find stem of turkish words.

Turkish Lemmatizer uses "Longest Matched Stemming" algorithm. But besides, it handles some of the [turkish word formation rules](Turkish Word Formation) that is commonly seen in turkish words.

Those are:

TODO: English Translation required

Ünsüz Yumuşaması
Ünlü Daralması
Ünlü Düşmesi
Ünsüz Düşmesi
Pekiştirme

In this lemmatization library, lemmatization process performance totally related with the supplied stem list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatization in Turkish Language

Clone this wiki locally