Here is the the new home of the Zemberek project. Zemberek-NLP is a Natural Language Processing library that provides basic NLP tools for processing Turkish text.
Latest version is 0.9.3 (November 22nd 2016)
Please read the FAQ for common questions.
Add this to pom.xml file
<repositories> <repository> <id>ahmetaa-repo</id> <name>ahmetaa Maven Repo on Github</name> <url>https://raw.github.com/ahmetaa/maven-repo/master</url> </repository> </repositories>
And dependecies (For example morphology):
<dependencies> <dependency> <groupId>zemberek-nlp</groupId> <artifactId>morphology</artifactId> <version>0.9.3</version> </dependency> </dependencies>
Google docs page page has versions and separate module and dependent jars.
Turkish-nlp-examples contains a maven java project with small usage examples.
Known Issues and Limitations
- Project requires Java 8.
- Currently word and sentence parse module operations generates parse graph with each initialization. So each run in the system takes some seconds. We will fix it in the next version with fast serialization of the parse graph.
- Morphological parsing does not work for some obvious and frequent words.
- Morphological disambiguation is working less accurate then expected (Not very usable).
- Morphological generation may not work for some obvious Stem-Suffix combinations.
- Please see issues section for further issues and feel free to create new ones.
Core classes such as special Collection classes, Hash functions and helpers.
Turkish morphological parsing, disambiguation and generation. Morphology Documentation
Turkish Tokenization and sentence boundary detection. So far only rule based algorithms.
Turkish syllabification and hyphenation.
Please refer to contributors.txt file.
Portions of this code has been developed in Tübitak BİLGEM's Speech and Language Technologies Laboratory.