This is an un-official fork of the Lapos tagger, based on version 0.1.2. Official source available here.
The goal of this fork is to add Unicode support for use in the Classical Language Toolkit. Once fixed, the CLTK hopes that these changes will be merged upstream.
There are two branches, master
being for Linux and apple
being for Mac OS (some changes were made for Clang, see below).
For full instructions, see README
. The CLTK's Latin model (based on Perseus treebanks) was made with the following command:
$ ./lapos-learn -m ./model latin_training_set.pos
Note: You can get this trainined set with curl -O https://raw.githubusercontent.com/cltk/latin_treebank_perseus/master/latin_training_set.pos
.
For running, use echo
to pass one sentence at a time:
$ echo "He opened the window." | ./lapos -t -m ./model_wsj02-21
He/PRP opened/VBD the/DT window/NN ./.
To compile on Clang, a few changes need to be made, namely removing tr1
from, e.g., (<tr1/unordered_map>
and td::tr1::unordered_map
).
We also increased the maximum number of tags, from 50 to 2000 (in crf.h
, commenting out enum { MAX_LABEL_TYPES = 50 };
and uncommenting const static int MAX_LABEL_TYPES = 2000;
). Also removed the unnecessary empty-input-line warning in crf.ppp
("warning: empty sentence"
).
Lapos created by Yoshimasa Tsuruoka, Yusuke Miyao, and Jun'ichi Kazama. For all technical details, see README
and for license LICENSE
.