XIADA: Tagger/Lemmatizer for Galician Language

XIADA is an statistical POS tagger based on Markov models and developed with ruby language. It treats XML documents natively, which allows anybody to tag XML documents in an easy way.

At present, the tagger includes three different custom configurations:

galician_xiada

To tag and lemmatize Galician written texts.
galician_xiada_oral

To tag and lemmatize Galician spoken transcriptions.
spanish_eslora

To tag and lemmatize Spanish spoken transcriptions.

galician_xiada corpora and configurations come from CORGA project, while spanish_eslora ones come from ESLORA project. You can find more information about authoring and licensing on the corresponding directories.

Project home page

http://corpus.cirp.gal/xiada

INSTALL

Install ruby (> 2.5.0 version):

Our preferred way is through rbenv/ruby-build.
Intall bundler
```
 gem install bundler
```

Install libpgdm3 and sqlite3:

In Debian stable:

 sudo apt-get install libsqlite3-0 libsqlite3-dev sqlite3 sqlite3-dev

Clone the repo:

 git clone git@github.com:crpih/xiada.git

Install required gems:

Enter repo root directory (from now on repo_root_directory) and run:
```
 bundle install
```

TRAIN

The tagger can be trained entering repo_root_directory and then run:

for Galician XIADA...

cd training/bin
make galician_xiada

for Spanish ESLORA...

cd training/bin
make spanish_eslora

This command will generate different training databases in repo_root_directory/training/databases (it will take several minutes to finish).

CHECK

To check that all is working fine, from repo_root_directory run:

bundle exec rake test

RUN

And, finally, the tagger can be launched in several ways. Here is an example:

Tag sentences inside an XML document

First, XIADA_PROFILE environment variable must be set:

for written Galician XIADA...

export XIADA_PROFILE="galician_xiada"

for spoken Galician XIADA...

export XIADA_PROFILE="galician_xiada_oral"

for spoken Spanish ESLORA...

export XIADA_PROFILE="spanish_eslora"

An xml file (named, for example, input.xml) like this one could be created, replacing the sentence content as needed:

for written Galician XIADA...

<?xml version="1.0" encoding="UTF-8"?>
<documento>
  <oración>Esta é unha oración de exemplo para probar.</oración>
  <oración>Esta é outra oración</oración>
</documento>

for spoken Galician XIADA...

<?xml version="1.0" encoding="UTF-8"?>
<documento>
  <fragmento>Esta é unha oración de exemplo para probar.</fragmento>
  <fragmento>Esta é outra oración</gragmento>
</documento>

for spoken Spanish ESLORA...

<?xml version="1.0" encoding="UTF-8"?>
<documento>
  <oración>Esta es una oración de ejemplo para probar.</oración>
  <oración>Esta es otra oración.</oración>
</documento>

And the command to tag the file could be:

for written Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db

for spoken Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db

for spoken Spanish ESLORA...

ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db

The output will be sent to STDOUT, so you can redirect it to another xml file:

for written Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db > output.xml

for spoken Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db > output.xml

for spoken Spanish ESLORA...

ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db > output.xml

Or, as the output of the tagger is not very nice (it is not indented), we use to pass the output through xmllint program this way:

for written Galician XIADA

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db | xmllint --format - > output.xml

for spoken Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db | xmllint --format - > output.xml

for spoken Spanish ESLORA...

ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db | xmllint --format - > output.xml

Build docker image

Build image with:

DOCKER_BUILDKIT=1 docker build --ssh default -t xiada_tagger-eslora:latest .

DOCKER_BUILDKIT=1 docker build --ssh default -t xiada_tagger-corga:latest .

Existing training databases are copied inside the image. Por 4000 is exposed. XIADA_PROFILE must be defined to run the container.

Testing

Execute all tests

bundle exec rake test

Execute all tests for one profile:

bundle exec ruby -I"lib:test" test/regression/tagger/xiada_tagger_test.rb --name="/spanish_eslora/

Execute only a test

Specify file and test name as a regular expression. Example execute only the ESLORA regression test of 1.xml file:

bundle exec ruby -I"lib:test" test/regression/tagger/xiada_tagger_test.rb --name="/spanish_eslora.*_1.xml/"

Save tests results for reference

Define the environment variable XIADA_SAVE_RESULT=1 and execute tests.

Name		Name	Last commit message	Last commit date
Latest commit History 447 Commits
lib		lib
running		running
test		test
training		training
.gitignore		.gitignore
.ruby-version		.ruby-version
AUTHORS		AUTHORS
Dockerfile		Dockerfile
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
xiada.gemspec		xiada.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XIADA: Tagger/Lemmatizer for Galician Language

Project home page

INSTALL

TRAIN

for Galician XIADA...

for Spanish ESLORA...

CHECK

RUN

Tag sentences inside an XML document

for written Galician XIADA...

for spoken Galician XIADA...

for spoken Spanish ESLORA...

for written Galician XIADA...

for spoken Galician XIADA...

for spoken Spanish ESLORA...

for written Galician XIADA...

for spoken Galician XIADA...

for spoken Spanish ESLORA...

for written Galician XIADA...

for spoken Galician XIADA...

for spoken Spanish ESLORA...

for written Galician XIADA

for spoken Galician XIADA...

for spoken Spanish ESLORA...

Build docker image

Testing

Execute all tests

Execute all tests for one profile:

Execute only a test

Save tests results for reference

About

Releases

Packages

Languages

License

crpih/xiada

Folders and files

Latest commit

History

Repository files navigation

XIADA: Tagger/Lemmatizer for Galician Language

Project home page

INSTALL

TRAIN

for Galician XIADA...

for Spanish ESLORA...

CHECK

RUN

Tag sentences inside an XML document

for written Galician XIADA...

for spoken Galician XIADA...

for spoken Spanish ESLORA...

for written Galician XIADA...

for spoken Galician XIADA...

for spoken Spanish ESLORA...

for written Galician XIADA...

for spoken Galician XIADA...

for spoken Spanish ESLORA...

for written Galician XIADA...

for spoken Galician XIADA...

for spoken Spanish ESLORA...

for written Galician XIADA

for spoken Galician XIADA...

for spoken Spanish ESLORA...

Build docker image

Testing

Execute all tests

Execute all tests for one profile:

Execute only a test

Save tests results for reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages