Skip to content
Ahmet A. Akın edited this page Aug 26, 2021 · 57 revisions

What is Zemberek-NLP

This is a project that provides some basic tools for Turkish NLP applications.

What is the state of the project?

This project is in maintenance mode.

Should I use Zemberek-NLP?

For most NLP tasks, Zemberek will probably not provide state of the art results comparing with modern NLP tools. However, it still can be useful for preprocessing of Turkish text (tokenizations, sentence segmentation, lemmatizations etc.) and when creating a baseline for some applications.

What is the difference between Zemberek2 and Zemberek-NLP

Morphological alaysis and generation tool Zemberek2 is no longer maintained. Zemberek-NLP is developed from almost scratch and shares almost no code with Zemberek2. There were several shortcomings of Zemberek2 such as: Too strict parsing, incompatible formatting, weak dictionary, complex code, slowness, no disambiguation. Hopefully Zemberek-nlp will address all those issues.

What is "Morphological Analysis"?

Morphological analysis is used for finding meaningful syntactic parts (Morphemes) of a word. Such as root words and suffixes. For example, "kalemlerimden (from my pencils)" is analyzed as follows in Zemberek:

[kalem:Noun] kalem:Noun+ler:A3pl+im:P1sg+den:Abl

[kalem:Noun] → lemma and root POS.
kalem:Noun   → Stem. This may be different than lemma.
ler:A3pl     → Plural suffix `A3pl` with form `ler`
im:P1sg      → First person singular possessive suffix `P1sg` with form `im`
den:Abl      → Ablative suffix `Abl` (from) with form "den" 

Usually the actual letters of a morpheme, such as den in the example above is called surface form and representing suffix name Abl is called lexical form. Finding morphemes for Turkish computationally is not easy, such system requires knowledge of complex phonetic and hand crafted suffix sequence rules (morphotactics).

What is "Morphological Disambiguation"?

Many Turkish words are highly ambiguous. A single word can have 2 to 10 correct analyses with different stem and suffixes depending on the context. For example, word "yarın" can be interpreted as follows:

[yar:Noun] yar:Noun+A3sg+ın:Gen           (cliff's)
[yar:Noun] yar:Noun+A3sg+ın:P2sg          (your cliff)
[yarı:Noun] yarı:Noun+A3sg+n:P2sg         (your half)
[yarı:Adj] yarı:Adj|Zero→Noun+A3sg+n:P2sg (your half, root is adjective)
[yarın:Adv] yarın:Adv                     (during tomorrow)
[yarın:Noun,Time] yarın:Noun+A3sg         (tomorrow) This is the most common.
[yarmak:Verb] yar:Verb+Imp+ın:A2pl        (split!)

For resolving ambiguity, a simple machine learning mechanism is trained with hand tagged sentences. It uses context words and their analyses to determine the correct result.

Why does disambiguation fails in some cases?

Like all data driven statistical systems, disambiguation mechanism may produce wrong results. Besides, morphological disambiguation is a hard problem. You need a lot of training data for algorithms to generate good models. Generating training data is a time consuming process.

However, as of 0.14.0, performance of the disambiguation mechanism is much improved. If possible, switch to 0.14.0 or higher versions. We expect to improve the quality in further versions by adding more data.

Can I use Zemberek-Nlp with Python?

There are two ways to use Zemberek-Nlp with Python. One is to access it natively from Python with jpype. @ozturkberkay's Git repository Zemberek-Python-Examples provides working zemberek examples.

Second alternative is to use Zemberek-NLP's own gRrpc server. Here you need full zemberek jar file and run it with

 java -jar zemberek-full.jar StartGrpcServer

After that, you can access it with provided Python files as explained here

Can I use it as a stemmer-lemmatizer?

Yes, it is trivial to access stem and lemmas from the parse result. However, for correct stemming good disambiguation is required.

Can I add a new dictionary item programatically?

Yes.

Can I generate words?

Yes.

Where is word suggestion functionality?

After version 0.11.0 there is a simple spelling functionality available in normalization module.

Where is deasciifier functionality?

Currently zemberek-nlp does not offer deasciifier functionality directly. But TurkishMorphology class can be configured to ignore diacritics symbols during analysis. There are several applications available in internet that use Deniz Yuret's deasciifier algorithm .

Can I detect languages?

Yes. Use lang-id module for this. There are also alternatives like language-detector. Keep in mind that this module is for detecting the language of text with reasonable character count (usually more than 20 characters). It is usually not suitable for detecting the language of individual words.

Why is the code in English?

Zemeberek2 code was completely Turkish. It was one of the point that made it attractive for new comers. However, we wanted Zemberek-nlp to be used in global NLP community and academia and therefore used English in the code. Not that it worked out that way, but still we stick to that decision.

But feel free to use Turkish in issue section.

What about Libre Office or Lucene-Solr extensions?

We do not have extensions for external applications for now. But it is easy to write a Turkish stemmer or lemmatizer (There is already a Lucene-Solr Turkish Analysis project available using different NLP tools.).

There is a LibreOffice spell checker extension available.

Can I use it in Android?

It is possible in theory, but we have not tried it. Library is more suitable for server or desktop usage in it's current state.

Why don't you use an FST tool?

Most Turkish morphological parsing tools use an FST (Finite state transducer). Oflazer, Sak and Çöltekin uses this approach. FST greatly simplifies the parser and it is very fast. However, we did not go that route because:

  • Good FST tools were not available for Java.
  • Some FST tools were too low level
  • You cannot modify the search graph at run-time if you use an FST tool.

Zemberek uses a different approach and uses a graph that is created programatically. It is slower but programming gives more flexibility.

Can I use Zemberek-NLP in a commercial product ?

Yes, as long as abiding the distribution requirements of Apache 2.0 license, you can use Zemberek source or binaries even in closed sourced commercial projects.

What are the alternatives ?

There are many tools for Turkish NLP available. Some are:

  • Kemal Oflazer's command line parser
  • Olcay Taner Yıldız's NLP Toolkit
  • Haşim Sak's morphological parser and disambiguator.
  • Çağrı Çöltekin's TRmorph.
  • Ali Ok's trnltk-java
  • ITU Turkish NLP pipeline
  • TS Corpus provides variety of Turkish linguistic corpora and online NLP tools.
  • Harun Reşit Zafer's nuve
  • Deniz Yüret's deasciifier and disambiguator.
  • Odtü-Sabancı Tree-bank
  • Weka, Open-NLP, NLTK, Stanford NLP and many recent Neural Network based tools (Tensorflow, Pytorch etc.) can be trained for Turkish.

I want to know about Turkish Morphology

There are many books available for Turkish Grammar. There is also a slightly outdated documentation with perspective of Zemberek developers available here .

Why word "Zemberek"?

Zemberek is the main spring of a watch in Turkish. Etymologically It comes from Persian word "zanbūrak زنبورك", meaning "little bee". Long ago @mdakin picked this word as it sounds funny/interesting.

Who are the one eyed mouse and hamster in the avatars?

They are Danger Mouse and Penfold from animated series Danger Mouse - Tehlikeli Fare.