Skip to content

XdgDisambiguatrParsingSpanish

Alex Rudnick edited this page Nov 22, 2013 · 1 revision

Training MaltParser on the Ancora corpus of Spanish: http://clic.ub.edu/corpus/en

Ancora is freely available, with registration on the wiki. Just register on the wiki and tell them who you are, and then you can download the corpus.

That's been done at least once before: http://www.sepln.org/revistaSEPLN/revista/44/articulos/revista4418.pdf

Next actions:

  • the problem here is that we have all these nice files in the Ancora format, but MaltParser wants them in the CoNLL format. That's OK.
  • We're just going to have to write a script to change from one to the other. Pretty sure it'll just be a matter of splitting on the tabs and rearranging the fields.
  • CoNLL2009-ST-Spanish-trial.txt starts with the same sentence as 3LB-CAST_111_C-2.txt from the Ancora dependency corpus. There we go.
You have registered as user on http://clic.ub.edu/corpus, which gives you  access to the AnCora corpora downloads.

Each time you publish a work in which these corpora have been used, you must mention their origin and add the following reference:
Taulé, M., M.A. Martí, M. Recasens (2009). “AnCora: Multilevel Annotated Corpora for Catalan and Spanish”, Proceedings of 6th International Conference on language Resources and Evaluation.
Clone this wiki locally