Skip to content

dali-ambiguity/dali-preprocessing-pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DALI PIPELINE

This is the pre-processing pipeline for Dali (Disagreements and Language Interpretation) project. The tool contains a pre-processing pipeline and a converter developed at Queen Mary University of London for the Dali project. The details of the tool is partially discripted in the section 3 of the following paper:

Crowdsourcing and Aggregating Nested Markable Annotations
Chris Madge, Juntao Yu, Jon Chamberlain, Udo Kruschwitz, Silviu Paun and Massimo Poesio In Proceedings of the The 57th Annual Meeting of the Association for Computational Linguistics (ACL),2019

Pre-processing Pipeline

The pipeline takes a Gutenberg/Wikipedia/DaliDoc documents as input and processes the documents with a sentence spliter, a tokenizer, a part-of-speech tagger, a dependency parser and a mention detector. Then it outputs the processed document into supported format (Masxml,Masxml_PD, MMAX, SGF, CONLL12 and Dali). For sentence spliter and tokenizer we used Stanford pipeline, for part-of-speech it depends on the method of your choose, if Mate parser is used then the PoS is generated by the parser, or if you use Dozat parser or DaliTagger (mention), the PoS is annotated by the Stanford tagger.

The mention detection part of the pipeline contains two main approaches:

Rule-based mention detector

The first approach is a rule-based mention detector based on a dependency parser, we refer to as DEP MD. For choice of the parers, we support two parsers (Mate/Dozat).

The Mate parser is fully included in the dalipipeline-all-inclusive.jar pre-trained models can also be downloaded from this link.

If you'd like to use Dozat parser, please follow the instructions of this link and download the model and the word embeddings. Please test that parser first to make sure it works.

Neural network based mention detector

The second approach based on a neural network based mention detector (NN MD). The pretrained models can be download here. For more details see the README.md inside the DaliTagger folder.

The NN MD has also been used for mention detection in the shared task hosted at the First Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC). Results are presented in the following paper:

Anaphora Resolution with the ARRAU Corpus
Massimo Poesio, Yulia Grishina, Varada Kolhatkar, Nafise Moosavi, Ina Roesiger, Adam Roussel, Fabian Simonjetz, Alexandra Uma, Olga Uryupina, Juntao Yu, Heike Zinsmeister
In Proceedings of the First Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC), 2018

The usage of the pipeline

Useage: java -Xmx5g -cp dalipipeline.jar dali.main.Pipeline [Options]

Options Descriptions Default
--help,-h Help
-g | -w| -d The type of the input document, -g: Gutenberg; -w: Wikipedia; -d:DaliDoc -d
-mate | -dozat | -dalitagger the mention detector you want to use, -mate: DEP MD based on Bohnet and Nivre (2012) -dozat: DEP MD based on Dozat and Manning (2016); -dalitagger: NN MD -dalitagger
-startFile If you use the Dozat parser, you need to specify the location of the network.py, or if you use the NN MD this will be the path to the test_pipe.py. DaliTagger/test_pipe.py
-input <dir> The directory that contains the documents to be processed; input/
-output <dir> The directory to output the processed documents; output/
-tmodel <file> The location of the model for part-of-speech; models/english-bidirectional-distsim.tagger
-pmodel <dir> The location of the parsing model of the Mate parser or (-save_dir) of the Dozat parser or model (prefix) location of NN MD DaliTagger/models/model-
-xsl <file> The location of the xsl file required by the sgf converter; models/MASXML2SGF.xsl
-dali -masxml -masxmlpd -sgf -conll -mmax The output format of the documents; -masxml

Converter

The converter is able to convert between different output format supported by this tool. Currently the converter support Dali (a binary file generated by the Dali pipeline), Masxml, CoNLL 2012, MMAX and SGF (output only) format.

The usage of the converter

Useage: java -cp dalipipeline.jar dali.main.Converter [Options]

Options Descriptions Default
--help, -h Help
-inFormat [dali | masxml | masxmlpd | conll | mmax] The input file format; dali
-input <dir> The directory that contains the documents to be converted; input/
-output <dir> The directory to output the converted documents; output/
-xsl <file> The location of the xsl file required by the sgf converter; models/MASXML2SGF.xsl
-dali -masxml -masxmlpd -sgf -conll -mmax The output format of the documents; -masxml

Releases

No releases published

Packages

No packages published

Languages

  • Java 64.8%
  • Python 35.2%