Skip to content

Lemmatization in Turkish Language

baturman edited this page Nov 30, 2013 · 8 revisions

Turkish language is morphologically complex language that the stem of a word can not be obtained by following a set of rules. In natural language processing, the stem of a word can be determined by the meaning they have.

For example, the following words

  • göz (eye)
  • gözlük (eyeglasses)
  • gözlükçü (optician)
  • gözlem (observation)

have the root form "göz". But if you reduce them to "göz", you will lose initial meaning of the words. With this approach, lemmatizer seems better option to find stem of turkish words.

Turkish Lemmatizer uses "Longest Matched Stemming" algorithm. But besides, it handles some of the [turkish word formation rules](Turkish Word Formation) that is commonly seen in turkish words.

Those are:

TODO: English Translation required

  1. Ünsüz Yumuşaması
  2. Ünlü Daralması
  3. Ünlü Düşmesi
  4. Ünsüz Düşmesi
  5. Pekiştirme

In this lemmatization library, lemmatization process performance totally related with the supplied stem list.