This software contains four rule-based, dependency-based syntactic parsers for 4 languages (English, Spanish, Galician, and Portuguese), as well as MetaRomance, a multilingual parser suited for Romance languages. The parsers were implemented in PERL and are stored in the parsers
folder. They were generated from dependency grammars, stored in the grammars
folder.
The software also contains a compiler (compi-beta.rb), implemented in Ruby, which generate parsers in PERL from DepPattern grammars. To write formal grammars using DepPattern, please, look up the tutorial of the formal grammar. Besides, the software provides the PoS tagger of Linguakit, also developed by our group.
GNU/LINUX, Perl and Ruby (for the grammar compiler)
You only need to download the repository.
Download DepPattern-master.zip and then:
unzip DepPattern-master.zip
git clone https://github.com/citiususc/DepPattern.git
Run ./deppattern --help
to see the modules:
usage: deppattern <lang> [--help|-h] [-m|--meta-romance] [-g|--grammar]
[-ng|--no-iteration-grammar] [-p|--parser] [-f|--file] [-a] [-fa] [-c]
required positional arguments:
<lang> Choose the language
Choices: [en, es, gl, pt], case insensitive
optional named arguments:
--help, -h ? show this help message and exit
-m, --meta-romance ? MetaRomance Parser
-g, --grammar <grammar> ? Path of the file grammar (with iterations)
-ng, --no-iteration-grammar <grammar> ? Path of the file grammar (without iterations)
-p, --parser <parser> ? Path of the parser, or name of the parser generated from grammar (i.e. metaromance)
-f, --file <file> ? Path of the file input (default stdin)
-a ? Simple dependency analysis
-fa ? Full dependency analysis
-conll ? Full dependency analysis with CoNLL format
-c ? Tagged text with syntactic information (for correction rules)
The same syntax with deppattern.bat
command. You must install Perl and Ruby for Windows and specify the paths for the corresponding interperters: perl and ruby.
Return a syntactic analysis for Portuguese in -a format:
./deppattern pt -f tests/test-pt -a
Return a syntactic analysis for English in -conll format:
./deppattern en -f tests/test-en -conll
Generate a parser (parser.perl) from the English grammar using the compiler:
./deppattern en -g grammars/grammar-devel-en/grammar-en.dp`
Return a syntactic analysis using the Spanish grammar, with -conll format:
./deppattern en -f tests/test-es -g grammars/grammar-devel-es/grammar-es.dp -conll`
You also may enter the input text in pipeline:
echo "Mary is eating fish." |./deppattern en -a
Each grammar directory must contain the following files:
- the grammar (the name of the file is chosen by the user)
- tagset.conf
- dependencies.conf
- lexical_classes.conf
For more details, look up the tutorial of DepPattern
One of the parsers provided by the package is MetaRomance, made of Universal Dependencies for Romance languages, and one of the systems that participated at CoNLL-2017 Shared Task on multilingual dependency parsing. If the input text is in Portuguese, the command to run MetaRomance would be the following:
./deppattern pt -m -f tests/test-pt -a
More information in:
Garcia, Marcos and Pablo Gamallo (2017) "A rule-based system for cross-lingual parsing of Romance languages with Universal Dependencies", ConLL-2017, Vancouver, Canada.
The input file must be in plain text format, and codified in UTF8.
The file containing the grammar must be in plain text format. Below, you'll find a toy example of a grammar with 4 dependency-based rules:
AdjnR: NOUN ADJ
Agr: number, genre
%
SpecL: DT NOUN
Agr: number, genre
%
SubjL: NOUN [ADV]* VERB
Agr: number
%
DobjR: VERB [ADV]* NOUN
%
Look up the tutorial stored in the doc directory.
Option -a means that deppattern generates a specific output based on triples. Each analysed sentence consists of two elements:
-
A line containing the POS tagged lemmas of the sentence. This line begins with the tag SENT. The set of tags used here are listed in file TagSet.txt. All lemmas are identified by means of a position number from 1 to N, where N is the size of the sentence.
-
All dependency triples identified by the grammar. A triple consists of:
(relation;head_lemma;dependent_lemma)
For instance, the sentence "I am a man." generates the following output:
SENT::<I_PRO_0_<number:0|lemma:I|possessor:0|case:0|genre:0|person:0|politeness:0|type:P|token:I|> am_VERB_1_<number:0|mode:0|lemma:be|genre:0|tense:0|person:0|type:S|token:am|> a_DT_2_<number:0|lemma:a|possessor:0|genre:0|person:0|type:0|token:a|> man_NOUN_3_<number:S|lemma:man|genre:0|person:3|type:C|token:man|> ._SENT>
(Subj;be_VERB_1;I_PN_0)
(Spec;man_NOM_3;a_DT_2)
(Dobj;be_VERB_1;man_NOM_3)
The set of dependency relationships used by the 5 grammars can be consulted and modified in the corresponding configuration file: grammars/grammar-devel-LING/dependencies.conf
.
Morpho-syntactic information is provided by the POS tagger, also included in Linguakit .
Option -fa gives rise to a full represention of the output triples. Each triple is associated with two pieces of information: morpho-syntactic features of both the head and the dependent.
Option -c allows us to generate a file with the same input (a tagged text) but with some corrections proposed by the grammar. This option is useful to identify and correct regular errors of PoS taggers using grammatical rules.
It is also possible to get an output file with the format defined by CoNLL-X, inspired by Lin (1998). This format was adopted by the evaluation tasks defined in CoNLL.
Pablo Gamallo, Isaac González, Marcos Garcia, César Piñeiro, Grupo ProLNat@GE, CiTIUS, University of Santiago de Compostela, Galiza.
Gamallo P. , González I. (2011) A Grammatical Formalism Based on Patterns of Part-of-Speech Tags , International Journal of Corpus Linguistics , 16(1), 45-71. ISNN:1384-6655
Gamallo, P. 2015. Dependency Parsing with Compression Rules, The 14th International Conference on Parsing Technologies (IWPT-2015) p. 107-117, Bilbao. ISBN 978-1-941643-98-3
Gamallo, P., González, I. 2012. DepPattern: A Multilingual Dependency Parser, Demo Session of the International Conference on Computational Processing of the Portuguese Language (PROPOR 2012) , April 17-20, Coimbra, Portugal.