Fork of the Lookahead Part-Of-Speech (Lapos) Tagger
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
samples
.gitignore
.travis.yml
LICENSE
Makefile
README
README.md
common.h
crf.cpp
crf.h
crfpos.cpp
eval.cpp
learn.cpp
lookahead.cpp
main.cpp
strdic.h
tokenize.cpp

README.md

Build Status

About

This is an un-official fork of the Lapos tagger, based on version 0.1.2. Official source available here.

The goal of this fork is to add Unicode support for use in the Classical Language Toolkit. Once fixed, the CLTK hopes that these changes will be merged upstream.

Build

There are two branches, master being for Linux and apple being for Mac OS (some changes were made for Clang, see below).

Use

For full instructions, see README. The CLTK's Latin model (based on Perseus treebanks) was made with the following command:

$ ./lapos-learn -m ./model latin_training_set.pos

Note: You can get this trainined set with curl -O https://raw.githubusercontent.com/cltk/latin_treebank_perseus/master/latin_training_set.pos.

For running, use echo to pass one sentence at a time:

$ echo "He opened the window." | ./lapos -t -m ./model_wsj02-21
He/PRP opened/VBD the/DT window/NN ./.

Changes

To compile on Clang, a few changes need to be made, namely removing tr1 from, e.g., (<tr1/unordered_map> and td::tr1::unordered_map).

We also increased the maximum number of tags, from 50 to 2000 (in crf.h, commenting out enum { MAX_LABEL_TYPES = 50 }; and uncommenting const static int MAX_LABEL_TYPES = 2000;). Also removed the unnecessary empty-input-line warning in crf.ppp ("warning: empty sentence").

License

Lapos created by Yoshimasa Tsuruoka, Yusuke Miyao, and Jun'ichi Kazama. For all technical details, see README and for license LICENSE.