XIADA is an statistical POS tagger based on Markov models and developed with ruby language. It treats XML documents natively, which allows anybody to tag XML documents in an easy way.
At present, the tagger includes three different custom configurations:
-
galician_xiada
To tag and lemmatize Galician written texts.
-
galician_xiada_oral
To tag and lemmatize Galician spoken transcriptions.
-
spanish_eslora
To tag and lemmatize Spanish spoken transcriptions.
galician_xiada corpora and configurations come from CORGA project, while spanish_eslora ones come from ESLORA project. You can find more information about authoring and licensing on the corresponding directories.
-
Install ruby (> 2.5.0 version):
Our preferred way is through rbenv/ruby-build.
-
Intall bundler
gem install bundler
-
Install libpgdm3 and sqlite3:
In Debian stable:
sudo apt-get install libsqlite3-0 libsqlite3-dev sqlite3 sqlite3-dev
-
Clone the repo:
git clone git@github.com:crpih/xiada.git
-
Install required gems:
Enter repo root directory (from now on
repo_root_directory
) and run:bundle install
The tagger can be trained entering repo_root_directory
and then run:
cd training/bin
make galician_xiada
cd training/bin
make spanish_eslora
This command will generate different training databases in repo_root_directory/training/databases
(it will take several minutes to finish).
To check that all is working fine, from repo_root_directory
run:
bundle exec rake test
And, finally, the tagger can be launched in several ways. Here is an example:
First, XIADA_PROFILE environment variable must be set:
export XIADA_PROFILE="galician_xiada"
export XIADA_PROFILE="galician_xiada_oral"
export XIADA_PROFILE="spanish_eslora"
An xml file (named, for example, input.xml
) like this one could be created, replacing the sentence content as needed:
<?xml version="1.0" encoding="UTF-8"?>
<documento>
<oración>Esta é unha oración de exemplo para probar.</oración>
<oración>Esta é outra oración</oración>
</documento>
<?xml version="1.0" encoding="UTF-8"?>
<documento>
<fragmento>Esta é unha oración de exemplo para probar.</fragmento>
<fragmento>Esta é outra oración</gragmento>
</documento>
<?xml version="1.0" encoding="UTF-8"?>
<documento>
<oración>Esta es una oración de ejemplo para probar.</oración>
<oración>Esta es otra oración.</oración>
</documento>
And the command to tag the file could be:
ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db
ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db
ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db
The output will be sent to STDOUT, so you can redirect it to another xml file:
ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db > output.xml
ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db > output.xml
ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db > output.xml
Or, as the output of the tagger is not very nice (it is not indented), we use to pass the output through xmllint
program this way:
ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db | xmllint --format - > output.xml
ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db | xmllint --format - > output.xml
ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db | xmllint --format - > output.xml
Build image with:
DOCKER_BUILDKIT=1 docker build --ssh default -t xiada_tagger-eslora:latest .
DOCKER_BUILDKIT=1 docker build --ssh default -t xiada_tagger-corga:latest .
Existing training databases are copied inside the image.
Por 4000 is exposed.
XIADA_PROFILE
must be defined to run the container.
bundle exec rake test
bundle exec ruby -I"lib:test" test/regression/tagger/xiada_tagger_test.rb --name="/spanish_eslora/
Specify file and test name as a regular expression. Example execute only the ESLORA regression test of 1.xml
file:
bundle exec ruby -I"lib:test" test/regression/tagger/xiada_tagger_test.rb --name="/spanish_eslora.*_1.xml/"
Define the environment variable XIADA_SAVE_RESULT=1
and execute tests.