Jitar HMM part of speech tagger
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.
src Small cleanup in static markers. Jan 8, 2016
.gitignore Ignore IntelliJ files. Oct 3, 2013
.travis.yml Add Travis CI file. Jul 29, 2014
NOTICE Bye bye Affero GPL. Hello Apache License version 2.0. Jul 25, 2013
README.markdown Add 0.3.3 notes. Oct 1, 2015



A simple Trigram HMM part-of-speech tagger


Jitar is a simple part-of-speech tagger, based on a trigram Hidden Markov Model (HMM). It (partly) implements the ideas set forth in [1]. Jitar is written in Java, so it should be easy to use in other Java programs, or languages that run on the JVM.


The Jitar API will be highly unstable for the first few versions!


The latest Jitar version can be downloaded from the releases page. The binary distribution includes a couple of handy scripts to use Jitar.

If you would like to use Jitar in your own software, add it as a dependency.




libraryDependencies += "eu.danieldk.nlp.jitar" % "jitar" % "0.3.0"


compile 'eu.danieldk.nlp.jitar:jitar:0.3.0'


A model can be created from a corpus that includes part of speech tags, such as the Brown corpus. The model can be created easily with the training program:

bin/train brown my_brown_corpus my_corpus.model

Replace brown by conll if you are using a corpus in CoNLL format.


Usually, you will want to call the tagger from your own program, but we have included a simple command line tagger as a sample. This tagger reads pretokenized sentences from the standard input (one sentence per line), and will print the best scoring tag sequence to the standard output. For example:

$ echo "The cat is on the mat ." | bin/tag model

Release plan

For version 0.y.z, there might be API breakage. The plan is to offer API stability for a given x in x.y.z when x >= 1.

0.4.0 (Planned)

  • Use Dictomaton to store the lexicon and suffixes for unknown words.


  • Fix a bug where the start/end markers could be used when handling unknown tokens (typically an unseen punctuation character). This change does not require retraining.
  • Add a utility jitar-tag-conllx to tag files that are in the CoNLL-X format. This preserves all other columns.
  • Compute interpolated scores only once.


  • Add a capitalization marking to tags (as per the TnT paper). This gives and improvement of around .2% on German and English.
  • Add a separate unknown word distribution for words containing a dash. This provides a modest improvement for English and German.
  • API simplification (no more need to use/specify start and end markers).

0.2.0 (Never released)

  • Java-style corpus readers.
  • Unified training and tagging data structures.
  • Add a utility for N-fold cross-validation.
  • Add more unit tests.


  • Release in the Maven Central Repository.
  • Convenient shell-script wrappers for training/tagging/evaluation.


Daniël de Kok <me@danieldk.eu>


  • "What's up with the name?"

    This is a Java port of a C++ tagger that I previously wrote, named Sitar. Jitar, it is not an abbreviation. If you do like abbreviations, let's make it "JavaIsh TAgging Redux" :).

  • "Can I use Jitar, or parts thereof in closed-source software?"

    Sure, as long as you follow the terms of the Apache License version 2.0, including section 4b.

  • "Do a really have to add a readable attribution notice to my product?"

    Yes! If this is really a problem for you or your company, contact me to see if we can make a special arrangement.

[1] TnT - a statistical part-of-speech tagger, Thorsten Brants, 2000