Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

XIADA: Tagger/Lemmatizer for Galician Language

XIADA is an statistical POS tagger based on Markov models and developed with ruby language. It treats XML documents natively, which allows anybody to tag XML documents in an easy way.

At present, the tagger includes three different custom configurations:

  1. galician_xiada

    To tag and lemmatize Galician written texts.

  2. galician_xiada_oral

    To tag and lemmatize Galician spoken transcriptions.

  3. spanish_eslora

    To tag and lemmatize Spanish spoken transcriptions.

galician_xiada corpora and configurations come from CORGA project, while spanish_eslora ones come from ESLORA project. You can find more information about authoring and licensing on the corresponding directories.

Project home page

http://corpus.cirp.gal/xiada

INSTALL

  1. Install ruby (> 2.0 version):

    Our preferred way is through rbenv/ruby-build.

  2. Intall bundler

     gem install bundler
    
  3. Install libpgdm3 and sqlite3:

    In Debian stable:

     sudo apt-get install libgdbm3 libgdbm-dev libsqlite3-0 libsqlite3-dev sqlite3 sqlite3-dev
    
  4. Clone the repo:

     git clone git@github.com:crpih/xiada.git
    
  5. Install required gems:

    Enter repo root directory (from now on repo_root_directory) and run:

     bundle install
    

TRAIN

The tagger can be trained entering repo_root_directory and then run:

for Galician XIADA...

cd training/bin
make galician_xiada

for Spanish ESLORA...

cd training/bin
make spanish_eslora

This command will generate different training databases in repo_root_directory/training/databases (it will take several minutes to finish).

CHECK

To check that all is working fine, from repo_root_directory run:

bundle exec rake test

RUN

And, finally, the tagger can be launched in several ways. Here is an example:

Tag sentences inside an XML document

First, XIADA_PROFILE environment variable must be set:

for written Galician XIADA...

export XIADA_PROFILE="galician_xiada"

for spoken Galician XIADA...

export XIADA_PROFILE="galician_xiada_oral"

for spoken Spanish ESLORA...

export XIADA_PROFILE="spanish_eslora"

An xml file (named, for example, input.xml) like this one could be created, replacing the sentence content as needed:

for written Galician XIADA...

<?xml version="1.0" encoding="UTF-8"?>
<documento>
  <oración>Esta é unha oración de exemplo para probar.</oración>
  <oración>Esta é outra oración</oración>
</documento>

for spoken Galician XIADA...

<?xml version="1.0" encoding="UTF-8"?>
<documento>
  <fragmento>Esta é unha oración de exemplo para probar.</fragmento>
  <fragmento>Esta é outra oración</gragmento>
</documento>

for spoken Spanish ESLORA...

<?xml version="1.0" encoding="UTF-8"?>
<documento>
  <oración>Esta es una oración de ejemplo para probar.</oración>
  <oración>Esta es otra oración.</oración>
</documento>

And the command to tag the file could be:

for written Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db 

for spoken Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db

for spoken Spanish ESLORA...

ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db 

The output will be sent to STDOUT, so you can redirect it to another xml file:

for written Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db > output.xml

for spoken Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db > output.xml

for spoken Spanish ESLORA...

ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db > output.xml 

Or, as the output of the tagger is not very nice (it is not indented), we use to pass the output through xmllint program this way:

for written Galician XIADA

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada/xml_values.txt -f input.xml training/databases/galician_xiada/training_galician_xiada_escrita.db | xmllint --format - > output.xml

for spoken Galician XIADA...

ruby running/bin/xiada_tagger.rb -v -x running/galician_xiada_oral/xml_values.txt -f input.xml training/databases/galician_xiada_oral/training_galician_xiada_oral.db | xmllint --format - > output.xml

for spoken Spanish ESLORA...

ruby running/bin/xiada_tagger.rb -v -x running/spanish_eslora/xml_values.txt -f input.xml training/databases/spanish_eslora/training_spanish_eslora.db | xmllint --format - > output.xml

About

XIADA: Tagger/Lemmatizer for Galician Language

Resources

License

Packages

No packages published