# Dependency syntactic analysis

EstNLTK provides wrappers for two syntactic analysers: [MaltParser](http://www.maltparser.org/) and [VISLCG3 based syntactic analyser of Estonian](https://github.com/EstSyntax/EstCG).

MaltParser based syntactic analysis is distributed with EstNLTK and can be applied by default. VISLCG3 based syntactic analysis has a requirement that VISLCG3 must be installed into the system first.

Both analysers are using a common syntactic analysis tagset, which is introduced in https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf.

## `VislTagger`: VISLCG3 syntactic analysis

VISL CG3 is a rule-based syntactic parser that has thousands of Estonian-specific handcrafted rules for tagging syntactic functions and dependencies. However, the parser needs more information than is given out by the standard morphological analyser (e.g. pronoun types, verb subcategorization, etc.), and its input needs to be in a different format. 

Therefore, to use VislTagger, we first need to add layer `morph_extended` to our Text object. This layer contains more detailed morphological analysis than the standard `morph_analysis` layer.


In [1]:
from estnltk import Text

In [2]:
text = Text('Ta on ise tee esimesel poolel.')
text.tag_layer(['morph_extended'])

text
Ta on ise tee esimesel poolel.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,7
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,7
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,7


If we compare the standard `morph_analysis` layer with `morph_extended`, we can see that `morph_extended` has more refined labels under the 'form' attribute, as well as features like pronoun type, punctuation type, letter case, etc. Those extra features are needed because they are used in VislCG3 grammar rules. 

In addition, `morph_extended` layer is ambiguous, more so than the standard `morph_analysis` layer as it is more detailed. E.g. in our example sentence, we can see that the word 'on' gets 6 different analyses. As the first step of VislTagger is morphological disambiguation, the extra analyses will be removed and will not propagate to syntactic analysis layer.

In [3]:
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Ta,Ta,tema,tema,['tema'],0,,sg n,P
on,on,olema,ole,['ole'],0,,b,V
,on,olema,ole,['ole'],0,,vad,V
ise,ise,ise,ise,['ise'],0,,sg n,P
,ise,ise,ise,['ise'],0,,pl n,P
tee,tee,tee,tee,['tee'],0,,sg n,S
esimesel,esimesel,esimene,esimene,['esimene'],l,,sg ad,O
poolel,poolel,pool,pool,['pool'],l,,sg ad,S
.,.,.,.,['.'],,,,Z


In [4]:
text.morph_extended

layer name,attributes,parent,enveloping,ambiguous,span count
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,7

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech,punctuation_type,pronoun_type,letter_case,fin,verb_extension_suffix,subcat
Ta,Ta,tema,tema,['tema'],0,,sg nom,P,,['ps3'],cap,,[],
on,on,olema,ole,['ole'],0,,mod indic pres ps3 sg ps af,V,,,,True,[],['Intr']
,on,olema,ole,['ole'],0,,aux indic pres ps3 sg ps af,V,,,,True,[],['Intr']
,on,olema,ole,['ole'],0,,main indic pres ps3 sg ps af,V,,,,True,[],['Intr']
,on,olema,ole,['ole'],0,,mod indic pres ps3 pl ps af,V,,,,True,[],['Intr']
,on,olema,ole,['ole'],0,,aux indic pres ps3 pl ps af,V,,,,True,[],['Intr']
,on,olema,ole,['ole'],0,,main indic pres ps3 pl ps af,V,,,,True,[],['Intr']
ise,ise,ise,ise,['ise'],0,,sg nom,P,,"['pos', 'det', 'refl']",,,[],
,ise,ise,ise,['ise'],0,,pl nom,P,,"['pos', 'det', 'refl']",,,[],
tee,tee,tee,tee,['tee'],0,,com sg nom,S,,,,,[],


In order to use VISLCG3 based syntactic analysis, the VISLCG3 parser must be installed into the system. The information about the parser is distributed in the [Constraint Grammar's Google Group](https://groups.google.com/forum/#!forum/constraint-grammar), and this is also the place to look for the most compact guide about getting & installing the [latest version of the parser](https://groups.google.com/forum/#!msg/constraint-grammar/hXsbzyyhIVI/nHXRnOomf9wJ).

By default, EstNLTK expects that the directory containing VISLCG3 parser's executable (vislcg3 in UNIX, vislcg3.exe in Windows) is accessible from system's environment variable PATH. If this requirement is satisfied, the EstNLTK should always be able to execute the parser and therefore, we can parse our example sentence as follows:

In [5]:
from estnltk.taggers import VislTagger

visl_tagger = VislTagger()
visl_tagger.tag(text)

text.visl

layer name,attributes,parent,enveloping,ambiguous,span count
visl,"id, lemma, ending, partofspeech, subtype, mood, tense, voice, person, inf_form, number, case, polarity, number_format, capitalized, finiteness, subcat, clause_boundary, deprel, head",morph_extended,,True,7

text,id,lemma,ending,partofspeech,subtype,mood,tense,voice,person,inf_form,number,case,polarity,number_format,capitalized,finiteness,subcat,clause_boundary,deprel,head
Ta,1,tema,0,P,pers,_,_,_,ps3,_,sg,nom,_,_,cap,_,_,_,@SUBJ,2
on,2,ole,0,V,main,indic,pres,ps,ps3,_,sg,_,af,_,_,_,_,_,@FMV,0
ise,3,ise,0,P,"['pos', 'det', 'refl']",_,_,_,_,_,sg,nom,_,_,_,_,_,_,@ADVL,2
tee,4,tee,0,S,com,_,_,_,_,_,sg,nom,_,_,_,_,_,_,@PRD,2
esimesel,5,esimene,l,N,ord,_,_,_,_,_,sg,ad,_,l,_,_,_,_,@AN>,6
poolel,6,pool,l,S,com,_,_,_,_,_,sg,ad,_,_,_,_,_,_,"['@<NN', '@ADVL']",2
.,7,.,_,Z,Fst,_,_,_,_,_,_,_,_,_,_,_,_,CLB,_,6


The parser assigns each word a syntactic label (`deprel`, (e.g. '@SUBJ' stands for subject, see the [documentation](https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf) for details)) and its syntactic head (`head`) which is the id of its governing word in the sentence. NB! As can be seen from the example, the word id's start from 1 and not 0. The governing word index 0 marks that the current word is the root node of the tree.

As VislCG3 is based on [Constraint Grammar](http://visl.sdu.dk/constraint_grammar.html) formalism and first adds all the syntactic labels and then removes the ones that are not suitable based on constraints, it does leave some syntactic labels ambiguous as is seen with the word 'poolel' in our example sentence which gets both the analyses of a complement (@<NN) and adverbial (@ADVL). Despite this, each word still has only one syntactic head.

Note: By default, VislTagger makes post-corrections on the original VISLCG3 parser output and removes self-references (that is: situations where word's `id` equals `head`). You can use constructor parameter `fix_selfreferences` to turn off these post-corrections.

## `SyntaxDependencyRetagger` 

`SyntaxDependencyRetagger` adds `parent_span` and `children` attributes to the syntax layer.

Here the syntax layer is a layer that has at least `id` and `head` attributes.

The `parent_span` and `children` attributes help to navigate from a span to the parent and children of that span.

If `parent_span` or `children` attribute already exists in the layer then the values are updated. Therefore, to
    update the dependencies in the syntax layer first update the values of head attributes and then run this retagger.

In [6]:
from estnltk.taggers import SyntaxDependencyRetagger

SyntaxDependencyRetagger('visl').retag(text)

text.visl

layer name,attributes,parent,enveloping,ambiguous,span count
visl,"id, lemma, ending, partofspeech, subtype, mood, tense, voice, person, inf_form, number, case, polarity, number_format, capitalized, finiteness, subcat, clause_boundary, deprel, head, parent_span, children",morph_extended,,True,7

text,id,lemma,ending,partofspeech,subtype,mood,tense,voice,person,inf_form,number,case,polarity,number_format,capitalized,finiteness,subcat,clause_boundary,deprel,head,parent_span,children
Ta,1,tema,0,P,pers,_,_,_,ps3,_,sg,nom,_,_,cap,_,_,_,@SUBJ,2,"Span('on', [{'id': 2, 'lemma': 'ole', 'ending': '0', 'partofspeech': 'V', 'subty ..., type: <class 'estnltk.layer.span.Span'>",()
on,2,ole,0,V,main,indic,pres,ps,ps3,_,sg,_,af,_,_,_,_,_,@FMV,0,,"(""Span('Ta', [{'id': 1, 'lemma': 'tema', 'ending': '0', 'partofspeech': 'P', 'su ..., type: <class 'tuple'>, length: 4"
ise,3,ise,0,P,"['pos', 'det', 'refl']",_,_,_,_,_,sg,nom,_,_,_,_,_,_,@ADVL,2,"Span('on', [{'id': 2, 'lemma': 'ole', 'ending': '0', 'partofspeech': 'V', 'subty ..., type: <class 'estnltk.layer.span.Span'>",()
tee,4,tee,0,S,com,_,_,_,_,_,sg,nom,_,_,_,_,_,_,@PRD,2,"Span('on', [{'id': 2, 'lemma': 'ole', 'ending': '0', 'partofspeech': 'V', 'subty ..., type: <class 'estnltk.layer.span.Span'>",()
esimesel,5,esimene,l,N,ord,_,_,_,_,_,sg,ad,_,l,_,_,_,_,@AN>,6,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'ending': 'l', 'partofspeech': 'S', ' ..., type: <class 'estnltk.layer.span.Span'>",()
poolel,6,pool,l,S,com,_,_,_,_,_,sg,ad,_,_,_,_,_,_,"['@<NN', '@ADVL']",2,"Span('on', [{'id': 2, 'lemma': 'ole', 'ending': '0', 'partofspeech': 'V', 'subty ..., type: <class 'estnltk.layer.span.Span'>","(""Span('esimesel', [{'id': 5, 'lemma': 'esimene', 'ending': 'l', 'partofspeech': ..., type: <class 'tuple'>, length: 2"
.,7,.,_,Z,Fst,_,_,_,_,_,_,_,_,_,_,_,_,CLB,_,6,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'ending': 'l', 'partofspeech': 'S', ' ..., type: <class 'estnltk.layer.span.Span'>",()


The span `poolel` has a parent span and two child spans:

In [7]:
span = text.visl[5]
span

text,id,lemma,ending,partofspeech,subtype,mood,tense,voice,person,inf_form,number,case,polarity,number_format,capitalized,finiteness,subcat,clause_boundary,deprel,head,parent_span,children
poolel,6,pool,l,S,com,_,_,_,_,_,sg,ad,_,_,_,_,_,_,"['@<NN', '@ADVL']",2,"Span('on', [{'id': 2, 'lemma': 'ole', 'ending': '0', 'partofspeech': 'V', 'subty ..., type: <class 'estnltk.layer.span.Span'>","(""Span('esimesel', [{'id': 5, 'lemma': 'esimene', 'ending': 'l', 'partofspeech': ..., type: <class 'tuple'>, length: 2"


To get the parent span write

In [8]:
span.annotations[0].parent_span

text,id,lemma,ending,partofspeech,subtype,mood,tense,voice,person,inf_form,number,case,polarity,number_format,capitalized,finiteness,subcat,clause_boundary,deprel,head,parent_span,children
on,2,ole,0,V,main,indic,pres,ps,ps3,_,sg,_,af,_,_,_,_,_,@FMV,0,,"(""Span('Ta', [{'id': 1, 'lemma': 'tema', 'ending': '0', 'partofspeech': 'P', 'su ..., type: <class 'tuple'>, length: 4"


To iterate over all children write

In [9]:
for child in span.annotations[0].children:
    display(child)

text,id,lemma,ending,partofspeech,subtype,mood,tense,voice,person,inf_form,number,case,polarity,number_format,capitalized,finiteness,subcat,clause_boundary,deprel,head,parent_span,children
esimesel,5,esimene,l,N,ord,_,_,_,_,_,sg,ad,_,l,_,_,_,_,@AN>,6,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'ending': 'l', 'partofspeech': 'S', ' ..., type: <class 'estnltk.layer.span.Span'>",()


text,id,lemma,ending,partofspeech,subtype,mood,tense,voice,person,inf_form,number,case,polarity,number_format,capitalized,finiteness,subcat,clause_boundary,deprel,head,parent_span,children
.,7,.,_,Z,Fst,_,_,_,_,_,_,_,_,_,_,_,_,CLB,_,6,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'ending': 'l', 'partofspeech': 'S', ' ..., type: <class 'estnltk.layer.span.Span'>",()


# Maltparser

[Maltparser](http://www.maltparser.org) is a data-driven parser and has been trained on [Estonian Dependency Treebank](https://github.com/EstSyntax/EDT). 
For using Maltparser, you need to have:
  * Java SE Runtime Environment (version >= 1.8) is installed into the system;
  * `java` in the PATH environment variable;


### ConllMorphTagger (pre-processing)

Maltparser's optimal data format for morphological analysis differs from the VislCG3 format, and so we first need to do some data conversions.
These conversions are a reimplementation of the pre-processing implemented by Kaili Müürisep in https://github.com/EstSyntax/EstMalt .
Conversions are done by ConllMorphTagger, which takes `sentences` and `morph_extended` layers as inputs, and tags `conll_morph` layer onto our data. 
The process also requires VISLCG3 parser ( VISLCG3's executable must be accessible from PATH ).

In [10]:
from estnltk.taggers import ConllMorphTagger

tagger = ConllMorphTagger(output_layer='conll_morph',  # default: 'conll_morph'
                          morph_extended_layer='morph_extended'  # default: 'morph_extended'
                          )
tagger

name,output layer,output attributes,input layers
ConllMorphTagger,conll_morph,"('id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc')","('sentences', 'morph_extended')"

0,1
no_visl,False


In [11]:
tagger.tag(text)

text.conll_morph

layer name,attributes,parent,enveloping,ambiguous,span count
conll_morph,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",morph_extended,,True,7

text,id,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Ta,1,Ta,tema,P,Ppers,ps3|sg|nom,_,_,_,_
on,2,on,ole,V,V,indic|pres|ps3|sg,_,_,_,_
ise,3,ise,ise,P,P,pos|det|refl|sg|nom,_,_,_,_
tee,4,tee,tee,S,S,sg|nom,_,_,_,_
esimesel,5,esimesel,esimene,N,A,ord|sg|ad|l,_,_,_,_
poolel,6,poolel,pool,S,S,sg|ad,_,_,_,_
.,7,.,.,Z,Z,Fst,_,_,_,_


This way, we get a standard CONLL format that can also be converted into a string as follows.

In [12]:
from estnltk.taggers.syntax.conll_morph_to_str import *

print(conll_to_str(text))

1	Ta	tema	P	Ppers	ps3|sg|nom	_	_	_	_	
2	on	ole	V	V	indic|pres|ps3|sg	_	_	_	_	
3	ise	ise	P	P	pos|det|refl|sg|nom	_	_	_	_	
4	tee	tee	S	S	sg|nom	_	_	_	_	
5	esimesel	esimene	N	A	ord|sg|ad|l	_	_	_	_	
6	poolel	pool	S	S	sg|ad	_	_	_	_	
7	.	.	Z	Z	Fst	_	_	_	_	




This is useful for cases when we want to use Maltparser without the EstNLTK interface. However, we can also use MaltParser inside EstNLTK on our Text object. There are two ways how to do it.

### MaltParserTagger

If text has layers `words`, `sentences` and `conll_morph`, then you can use MaltParserTagger to analyse it with EstNLTK's MaltParser. This produces `maltparser_syntax` layer:

In [13]:
from estnltk.taggers import MaltParserTagger

maltparser_tagger = MaltParserTagger()

maltparser_tagger.tag( text )

text.maltparser_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
maltparser_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
Ta,1,tema,P,Ppers,"{'ps3': '', 'sg': '', 'nom': ''}",2,@SUBJ,,,"Span('on', [{'id': 2, 'lemma': 'ole', 'upostag': 'V', 'xpostag': 'V', 'feats': { ..., type: <class 'estnltk.layer.span.Span'>",()
on,2,ole,V,V,"{'indic': '', 'pres': '', 'ps3': '', 'sg': ''}",0,ROOT,,,,"(""Span('Ta', [{'id': 1, 'lemma': 'tema', 'upostag': 'P', 'xpostag': 'Ppers', 'fe ..., type: <class 'tuple'>, length: 3"
ise,3,ise,P,P,"{'pos': '', 'det': '', 'refl': '', 'sg': '', 'nom': ''}",4,@NN>,,,"Span('tee', [{'id': 4, 'lemma': 'tee', 'upostag': 'S', 'xpostag': 'S', 'feats': ..., type: <class 'estnltk.layer.span.Span'>",()
tee,4,tee,S,S,"{'sg': '', 'nom': ''}",2,@PRD,,,"Span('on', [{'id': 2, 'lemma': 'ole', 'upostag': 'V', 'xpostag': 'V', 'feats': { ..., type: <class 'estnltk.layer.span.Span'>","(""Span('ise', [{'id': 3, 'lemma': 'ise', 'upostag': 'P', 'xpostag': 'P', 'feats' ..., type: <class 'tuple'>, length: 1"
esimesel,5,esimene,N,A,"{'ord': '', 'sg': '', 'ad': '', 'l': ''}",6,@AN>,,,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'upostag': 'S', 'xpostag': 'S', 'feat ..., type: <class 'estnltk.layer.span.Span'>",()
poolel,6,pool,S,S,"{'sg': '', 'ad': ''}",2,@ADVL,,,"Span('on', [{'id': 2, 'lemma': 'ole', 'upostag': 'V', 'xpostag': 'V', 'feats': { ..., type: <class 'estnltk.layer.span.Span'>","(""Span('esimesel', [{'id': 5, 'lemma': 'esimene', 'upostag': 'N', 'xpostag': 'A' ..., type: <class 'tuple'>, length: 2"
.,7,.,Z,Z,{'Fst': ''},6,@Punc,,,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'upostag': 'S', 'xpostag': 'S', 'feat ..., type: <class 'estnltk.layer.span.Span'>",()


Internally, MaltParserTagger also applies SyntaxDependencyRetagger to add `parent_span` and `children` attributes to spans.
You can disable this behaviour with the flag `add_parent_and_children`:


      maltparser_tagger = MaltParserTagger( add_parent_and_children=False )

### MaltParser class (for raw CONLL output)

Alternatively, if you want to get the standard CONLL-format syntactic analyses, you can use `MaltParser` class:

In [14]:
from estnltk.taggers.syntax.maltparser_tagger.maltparser import MaltParser

parser = MaltParser()
initial_output = parser.parse_text(text, return_type = 'conllu_lines')

In [15]:
print( '\n'.join( initial_output ) )

1	Ta	tema	P	Ppers	ps3|sg|nom	2	@SUBJ	_	_
2	on	ole	V	V	indic|pres|ps3|sg	0	ROOT	_	_
3	ise	ise	P	P	pos|det|refl|sg|nom	4	@NN>	_	_
4	tee	tee	S	S	sg|nom	2	@PRD	_	_
5	esimesel	esimene	N	A	ord|sg|ad|l	6	@AN>	_	_
6	poolel	pool	S	S	sg|ad	2	@ADVL	_	_
7	.	.	Z	Z	Fst	6	@Punc	_	_



Note that this way of parsing does not add an output layer to the Text object.

Still, you can write the `initial_output` to a file, and then import it as a Text object as described in the subsection "Working with ConLL-X files":

In [16]:
with open("example.conll", "w") as fout:
    fout.write('\n'.join( initial_output))

Internally, MaltParserTagger also uses `MaltParser` class, so these two ways of parsing are equal by results.
So, if you want to get the layer with MaltParser's results, using MaltParserTagger should be more straightforward way of obtaining it.

## UDPipe Tagger

[UDPipe](http://ufal.mff.cuni.cz/udpipe) is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL files. EstNLTK provides two UDPipe models for syntactic analysis - for CoNNL-U and CoNNL-X format. 

In order to use UDPipe based syntactic analysis, the UDPipe must be installed into the system. The information about how to install it can be found from [UDPipe Installation page](http://ufal.mff.cuni.cz/udpipe/install).
 
EstNLTK expects that the directory containing UDPipe executable (udpipe in UNIX, udpipe.exe in Windows) is accessible from system's environment variable PATH. UDPipe executables can be downloaded [here](https://github.com/ufal/udpipe/releases). If this requirement is satisfied, the EstNLTK should be able to execute UDPipe and we can parse out example sentence. 

If the text object has `sentences` and `conll_morph` layers, then it is possible to use the UDPipeTagger for the syntactic analysis. This will produce `udpipe_syntax` layer. 

In [17]:
from estnltk.taggers import UDPipeTagger
from estnltk import Text
from estnltk.taggers import ConllMorphTagger
import warnings
warnings.filterwarnings('ignore')

text = Text("Tema tahtis saada rikkaks ja kuulsaks ja elada kodanlikku elu .")
text.analyse('all')

conll_morph = ConllMorphTagger() # adding conll_morph layer 
conll_morph.tag(text)

udpipe_tagger = UDPipeTagger() # by default it uses conll_morph as an input layer
udpipe_tagger.tag(text)
text.udpipe_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
udpipe_syntax,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",conll_morph,,True,11

text,id,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Tema,1,Tema,tema,P,Ppers,"OrderedDict([('ps3', ''), ('sg', ''), ('nom', '')])",2,@SUBJ,_,_
tahtis,2,tahtis,taht,V,V,"OrderedDict([('indic', ''), ('impf', ''), ('ps3', ''), ('sg', '')])",0,root,_,_
saada,3,saada,saa,V,Vinf,"OrderedDict([('inf', '')])",2,@OBJ,_,_
rikkaks,4,rikkaks,rikas,A,A,"OrderedDict([('sg', ''), ('tr', '')])",3,@ADVL,_,_
ja,5,ja,ja,J,Jc,OrderedDict(),6,@J,_,_
kuulsaks,6,kuulsaks,kuulus,A,A,"OrderedDict([('sg', ''), ('tr', '')])",4,@ADVL,_,_
ja,7,ja,ja,J,Jc,OrderedDict(),8,@J,_,_
elada,8,elada,ela,V,Vinf,"OrderedDict([('inf', '')])",3,@OBJ,_,_
kodanlikku,9,kodanlikku,kodanlik,A,A,"OrderedDict([('sg', ''), ('part', '')])",10,@AN>,_,_
elu,10,elu,elu,S,S,"OrderedDict([('sg', ''), ('part', '')])",8,@OBJ,_,_


It is also possible to add syntactic analysis to text with CoNNL-U format layer. We can read file in CoNNL-U format and add syntactic analysis to it using UDPipeTagger. To do that, version should be specified for the tagger.  

In [18]:
from estnltk.converters.conll_importer import conll_to_text

text = conll_to_text(file='example.conllu', syntax_layer='conllu_morph') 

udpipe_tagger = UDPipeTagger(input_syntax_layer='conllu_morph', version='conllu') # default version is conllx

udpipe_tagger.tag(text)
text.udpipe_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
udpipe_syntax,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",conllu_morph,,True,22

text,id,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Aga,1,Aga,aga,CCONJ,J,OrderedDict(),5,cc,_,_
ma,2,ma,mina,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('Person', '1'), ('PronType', 'Prs')])",5,nsubj,_,_
peaksin,3,peaksin,pidama,AUX,V,"OrderedDict([('Mood', 'Cnd'), ('Number', 'Sing'), ('Person', '1'), ('Tense', 'Pr ..., type: <class 'collections.OrderedDict'>, length: 6",5,aux,_,_
vist,4,vist,vist,ADV,D,OrderedDict(),5,advmod,_,_
pihtima,5,pihtima,pihtima,VERB,V,"OrderedDict([('Case', 'Ill'), ('VerbForm', 'Sup'), ('Voice', 'Act')])",0,root,_,_
",",6,",",",",PUNCT,Z,OrderedDict(),10,punct,_,_
et,7,et,et,SCONJ,J,OrderedDict(),10,mark,_,_
ma,8,ma,mina,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('Person', '1'), ('PronType', 'Prs')])",10,nsubj,_,_
ei,9,ei,ei,AUX,V,"OrderedDict([('Polarity', 'Neg')])",10,aux,_,_
tea,10,tea,teadma,VERB,V,"OrderedDict([('Connegative', 'Yes'), ('Mood', 'Ind'), ('Tense', 'Pres'), ('VerbF ..., type: <class 'collections.OrderedDict'>, length: 5",5,ccomp,_,_


## StanzaSyntaxTagger

For using StanzaSyntaxTagger [Stanza](https://github.com/stanfordnlp/stanza) 1.1 must be installed.
Stanza has been trained on Estonian Dependecy Treebank. 
In order to analyse the text, suitable models have to be downloaded first from https://entu.keeleressursid.ee/public-document/entity-9791. Folders *pretrain* and *depparse* must be extracted to the root directory defining StanzaSyntaxTagger under the subdirectory *stanza_resources/et*.


EstNLTK's StanzaSyntaxTagger can be used in three ways: based on `sentences`, `morph_analysis` or `morph_extended` layer. Argument `input_text` has to be 'sentences', 'morph_analysis' or 'morph_extended' accordingly and on two latter occasions correct morphology layer name has to be passed to `input_morph_layer` argument (default: 'morph_analysis').

StanzaSyntaxTagger produces `stanza_syntax` layer. 

In [19]:
from estnltk.taggers.syntax.stanza_tagger.stanza_tagger import StanzaSyntaxTagger

text = Text("Tema tahtis saada rikkaks ja kuulsaks ja elada kodanlikku elu .")
text.analyse('all')

stanza_tagger = StanzaSyntaxTagger(input_type='morph_analysis')

stanza_tagger.tag( text )

text.stanza_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
stanza_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc",morph_analysis,,False,11

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Tema,1,tema,P,P,"OrderedDict([('sg', 'sg'), ('n', 'n')])",2,nsubj,_,_
tahtis,2,tahtma,V,V,"OrderedDict([('s', 's')])",0,root,_,_
saada,3,saama,V,V,"OrderedDict([('da', 'da')])",2,xcomp,_,_
rikkaks,4,rikas,A,A,"OrderedDict([('sg', 'sg'), ('tr', 'tr')])",3,xcomp,_,_
ja,5,ja,J,J,OrderedDict(),6,cc,_,_
kuulsaks,6,kuulus,A,A,"OrderedDict([('sg', 'sg'), ('tr', 'tr')])",4,conj,_,_
ja,7,ja,J,J,OrderedDict(),8,cc,_,_
elada,8,elama,V,V,"OrderedDict([('da', 'da')])",3,conj,_,_
kodanlikku,9,kodanlik,A,A,"OrderedDict([('sg', 'sg'), ('p', 'p')])",10,amod,_,_
elu,10,elu,S,S,"OrderedDict([('sg', 'sg'), ('p', 'p')])",8,obj,_,_


When tagging based on `sentences` layer values of `upostag` and `feats` are different compared to using morphological layers. Models for POS-tagging, lemmatization and dependancy parsing are taken from [Stanza's available models](https://stanfordnlp.github.io/stanza/available_models.html) trained on UD v2.5.

In [20]:
stanza_tagger = StanzaSyntaxTagger(input_type='sentences', output_layer='stanza_syntax_sent', depparse_path='')

stanza_tagger.tag( text )

text.stanza_syntax_sent

layer name,attributes,parent,enveloping,ambiguous,span count
stanza_syntax_sent,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc",words,,False,11

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Tema,1,tema,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('Person', '3'), ('PronType', 'Prs')])",2,nsubj,_,_
tahtis,2,tahtma,VERB,V,"OrderedDict([('Mood', 'Ind'), ('Number', 'Sing'), ('Person', '3'), ('Tense', 'Pa ..., type: <class 'collections.OrderedDict'>, length: 6",0,root,_,_
saada,3,saama,VERB,V,"OrderedDict([('VerbForm', 'Inf')])",2,xcomp,_,_
rikkaks,4,rikas,ADJ,A,"OrderedDict([('Case', 'Tra'), ('Degree', 'Pos'), ('Number', 'Sing')])",3,xcomp,_,_
ja,5,ja,CCONJ,J,OrderedDict(),6,cc,_,_
kuulsaks,6,kuulus,ADJ,A,"OrderedDict([('Case', 'Tra'), ('Degree', 'Pos'), ('Number', 'Sing')])",4,conj,_,_
ja,7,ja,CCONJ,J,OrderedDict(),8,cc,_,_
elada,8,elama,VERB,V,"OrderedDict([('VerbForm', 'Inf')])",3,conj,_,_
kodanlikku,9,kodanlik,ADJ,A,"OrderedDict([('Case', 'Par'), ('Degree', 'Pos'), ('Number', 'Sing')])",10,amod,_,_
elu,10,elu,NOUN,S,"OrderedDict([('Case', 'Par'), ('Number', 'Sing')])",8,obj,_,_


Extra attributes for parent and children can be added with `add_parent_and_children` flag. Flags `mark_syntax_error` and `mark_agreement_error` can be used for pointing out possible errors in syntactic relations (see more under UDValidationRetagger and DeprelAgreementRetagger).  

## Working with ConLL-X files

EstNLTK enables us to import data from ConLL-X files as Text objects with one or more syntactic analysis layers. To see how this works, let's use the example created in the previous subsection:

In [21]:
from estnltk.converters.conll_importer import conll_to_text

# Reading data from the file 'example.conll'
text = conll_to_text(file='example.conll', syntax_layer='malt_1')

In [22]:
text

text
Ta on ise tee esimesel poolel .

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,,,,True,7
malt_1,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7


In [23]:
text.malt_1

layer name,attributes,parent,enveloping,ambiguous,span count
malt_1,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
Ta,1,tema,P,Ppers,"{'ps3': '', 'sg': '', 'nom': ''}",2,@SUBJ,,,"Span('on', [{'id': 2, 'lemma': 'ole', 'upostag': 'V', 'xpostag': 'V', 'feats': { ..., type: <class 'estnltk.layer.span.Span'>",()
on,2,ole,V,V,"{'indic': '', 'pres': '', 'ps3': '', 'sg': ''}",0,ROOT,,,,"(""Span('Ta', [{'id': 1, 'lemma': 'tema', 'upostag': 'P', 'xpostag': 'Ppers', 'fe ..., type: <class 'tuple'>, length: 3"
ise,3,ise,P,P,"{'pos': '', 'det': '', 'refl': '', 'sg': '', 'nom': ''}",4,@NN>,,,"Span('tee', [{'id': 4, 'lemma': 'tee', 'upostag': 'S', 'xpostag': 'S', 'feats': ..., type: <class 'estnltk.layer.span.Span'>",()
tee,4,tee,S,S,"{'sg': '', 'nom': ''}",2,@PRD,,,"Span('on', [{'id': 2, 'lemma': 'ole', 'upostag': 'V', 'xpostag': 'V', 'feats': { ..., type: <class 'estnltk.layer.span.Span'>","(""Span('ise', [{'id': 3, 'lemma': 'ise', 'upostag': 'P', 'xpostag': 'P', 'feats' ..., type: <class 'tuple'>, length: 1"
esimesel,5,esimene,N,A,"{'ord': '', 'sg': '', 'ad': '', 'l': ''}",6,@AN>,,,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'upostag': 'S', 'xpostag': 'S', 'feat ..., type: <class 'estnltk.layer.span.Span'>",()
poolel,6,pool,S,S,"{'sg': '', 'ad': ''}",2,@ADVL,,,"Span('on', [{'id': 2, 'lemma': 'ole', 'upostag': 'V', 'xpostag': 'V', 'feats': { ..., type: <class 'estnltk.layer.span.Span'>","(""Span('esimesel', [{'id': 5, 'lemma': 'esimene', 'upostag': 'N', 'xpostag': 'A' ..., type: <class 'tuple'>, length: 2"
.,7,.,Z,Z,{'Fst': ''},6,@Punc,,,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'upostag': 'S', 'xpostag': 'S', 'feat ..., type: <class 'estnltk.layer.span.Span'>",()


We can also add more syntax layers from CONLL files. This is useful if we have parsed our data with different parsing models and/or have gold standard and want to compare the annotations.

In [24]:
# Let's create another file, although it will have the same content as the first one
with open("example2.conll", "w") as fout:
    fout.write('\n'.join( initial_output))

In [25]:
from estnltk.converters.conll_importer import add_layer_from_conll

# Adding the analysis in the second file to the Text object that we already created
add_layer_from_conll(file='example2.conll', text=text, syntax_layer='malt_2')

text
Ta on ise tee esimesel poolel .

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,,,,True,7
malt_1,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7
malt_2,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7


We can also iterate over sentences and words and compare their attributes. 

In [26]:
malt_1 = text.malt_1
malt_2 = text.malt_2

for sentence in text.sentences:
    for word in sentence:
        if malt_1.get(word).deprel == malt_2.get(word).deprel:
            print(malt_1.get(word).text, malt_2.get(word).text)

Ta Ta
on on
ise ise
tee tee
esimesel esimesel
poolel poolel
. .




## Labeled Attachment Score

Now that we have two syntactic analysis layers on our Text object, we probably want to compare them computationally as well. Labeled Attachment Score (LAS) is a standard evaluation metric in dependency syntax. It is the ratio of words that are assigned both the correct syntactic head and the correct dependency label and varies between 0 and 1. 

In [27]:
from estnltk.syntax.scoring import las_score

To understand better, let's change the second layer a little bit:

In [28]:
text.malt_2[3].deprel = '@X'

text.malt_1[3].deprel != text.malt_2[3].deprel

True

Now we can calculate LAS between the two syntax layers:

In [29]:
las_score(layer_a=text.malt_1,
          layer_b=text.malt_2
          )

0.8571428571428571

We can also just calculate the score for some parts of our Text object, e.g. to compare the first 4 spans or to compare the spans starting from the 5th:

In [30]:
las_score(text.malt_1, text.malt_2, 0, 4)

0.75

In [31]:
las_score(text.malt_1, text.malt_2, 4)

1.0

In addition, we can tag sliding LAS scores to be able to compare different text segments.

In [32]:
from estnltk.taggers import SyntaxLasTagger

tagger = SyntaxLasTagger('malt_1', 'malt_2', window=3)
tagger

name,output layer,output attributes,input layers
SyntaxLasTagger,las,"('deprel_sequence', 'score')","('malt_1', 'malt_2')"

0,1
window,3


In [33]:
# NBVAL_IGNORE_OUTPUT
tagger.tag(text)
text.las

0,1
aggregate_deprel_sequences,"{('@SUBJ',): [1.0], ('@SUBJ', 'ROOT'): [1.0], ('@SUBJ', 'ROOT', '@NN>'): [1.0], ('ROOT', '@NN>', '@PRD'): [0.6666666666666666], ('@NN>', '@PRD', '@AN>'): [0.6666666666666666], ('@PRD', '@AN>', '@ADVL'): [0.6666666666666666], ('@AN>', '@ADVL', '@Punc'): [1.0], ('@ADVL', '@Punc'): [1.0], ('@Punc',): [1.0]}"

layer name,attributes,parent,enveloping,ambiguous,span count
las,"deprel_sequence, score",,malt_1,False,9

text,deprel_sequence,score
['Ta'],"('@SUBJ',)",1.0
"['Ta', 'on']","('@SUBJ', 'ROOT')",1.0
"['Ta', 'on', 'ise']","('@SUBJ', 'ROOT', '@NN>')",1.0
"['on', 'ise', 'tee']","('ROOT', '@NN>', '@PRD')",0.6666666666666666
"['ise', 'tee', 'esimesel']","('@NN>', '@PRD', '@AN>')",0.6666666666666666
"['tee', 'esimesel', 'poolel']","('@PRD', '@AN>', '@ADVL')",0.6666666666666666
"['esimesel', 'poolel', '.']","('@AN>', '@ADVL', '@Punc')",1.0
"['poolel', '.']","('@ADVL', '@Punc')",1.0
['.'],"('@Punc',)",1.0


`SyntaxLasTagger` adds metadata to the output layer. The `aggregate_deprel_sequences` field of the meta lists all LAS scores for every `deprel` sequence encountered.

In [34]:
text.las.meta

{'aggregate_deprel_sequences': {('@SUBJ',): [1.0],
  ('@SUBJ', 'ROOT'): [1.0],
  ('@SUBJ', 'ROOT', '@NN>'): [1.0],
  ('ROOT', '@NN>', '@PRD'): [0.6666666666666666],
  ('@NN>', '@PRD', '@AN>'): [0.6666666666666666],
  ('@PRD', '@AN>', '@ADVL'): [0.6666666666666666],
  ('@AN>', '@ADVL', '@Punc'): [1.0],
  ('@ADVL', '@Punc'): [1.0],
  ('@Punc',): [1.0]}}