Turkish Nlp libraries
Java ANTLR
Latest commit e52cc88 Jan 18, 2017 @ahmetaa cleanup

README.md

Zemberek-NLP

Here is the the new home of the Zemberek project. Zemberek-NLP is a Natural Language Processing library that provides basic NLP tools for processing Turkish text.

Latest version is 0.9.3 (November 22nd 2016)

FAQ

Please read the FAQ for common questions.

Usage

Maven

Add this to pom.xml file

<repositories>
    <repository>
        <id>ahmetaa-repo</id>
        <name>ahmetaa Maven Repo on Github</name>
        <url>https://raw.github.com/ahmetaa/maven-repo/master</url>
    </repository>
</repositories>

And dependecies (For example morphology):

<dependencies>
    <dependency>
        <groupId>zemberek-nlp</groupId>
        <artifactId>morphology</artifactId>
        <version>0.9.3</version>
    </dependency>
</dependencies>

Jar distributions

Google docs page page has versions and separate module and dependent jars.

Examples

Turkish-nlp-examples contains a maven java project with small usage examples.

Known Issues and Limitations

  • Project requires Java 8.
  • Currently word and sentence parse module operations generates parse graph with each initialization. So each run in the system takes some seconds. We will fix it in the next version with fast serialization of the parse graph.
  • Morphological parsing does not work for some obvious and frequent words.
  • Morphological disambiguation is working less accurate then expected (Not very usable).
  • Morphological generation may not work for some obvious Stem-Suffix combinations.
  • Please see issues section for further issues and feel free to create new ones.

Modules

Core

Core classes such as special Collection classes, Hash functions and helpers.

Morphology

Turkish morphological parsing, disambiguation and generation. Morphology Documentation

Tokenization

Turkish Tokenization and sentence boundary detection. So far only rule based algorithms.

Hyphenation

Turkish syllabification and hyphenation.

Language modelling

Language model compression

Acknowledgements

Please refer to contributors.txt file.

Portions of this code has been developed in Tübitak BİLGEM's Speech and Language Technologies Laboratory.