# Dependency syntactic analysis

EstNLTK provides wrappers for two syntactic analysers: [MaltParser](http://www.maltparser.org/) and [VISLCG3 based syntactic analyser of Estonian](https://github.com/EstSyntax/EstCG).

MaltParser based syntactic analysis is distributed with EstNLTK and can be applied by default. VISLCG3 based syntactic analysis has a requirement that VISLCG3 must be installed into the system first.

Both analysers are using a common syntactic analysis tagset, which is introduced in https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf.

## `VislTagger`: VISLCG3 syntactic analysis

VISL CG3 is a rule-based syntactic parser that has thousands of Estonian-specific handcrafted rules for tagging syntactic functions and dependencies. However, the parser needs more information than is given out by the standard morphological analyser (e.g. pronoun types, verb subcategorization, etc.), and its input needs to be in a different format. 

Therefore, to use VislTagger, we first need to add layer `morph_extended` to our Text object. This layer contains more detailed morphological analysis than the standard `morph_analysis` layer.


In [1]:
from estnltk import Text

In [2]:
text = Text('Ta on ise tee esimesel poolel.')
text.tag_layer(['morph_extended'])

text
Ta on ise tee esimesel poolel.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,7
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,7
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7
morph_extended,"lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,7


If we compare the standard `morph_analysis` layer with `morph_extended`, we can see that `morph_extended` has more refined labels under the 'form' attribute, as well as features like pronoun type, punctuation type, letter case, etc. Those extra features are needed because they are used in VislCG3 grammar rules. 

In addition, `morph_extended` layer is ambiguous, more so than the standard `morph_analysis` layer as it is more detailed. E.g. in our example sentence, we can see that the word 'on' gets 6 different analyses. As the first step of VislTagger is morphological disambiguation, the extra analyses will be removed and will not propagate to syntactic analysis layer.

In [3]:
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Ta,tema,tema,"('tema',)",0,,sg n,P
on,olema,ole,"('ole',)",0,,b,V
,olema,ole,"('ole',)",0,,vad,V
ise,ise,ise,"('ise',)",0,,pl n,P
,ise,ise,"('ise',)",0,,sg n,P
tee,tee,tee,"('tee',)",0,,sg n,S
esimesel,esimene,esimene,"('esimene',)",l,,sg ad,O
poolel,pool,pool,"('pool',)",l,,sg ad,S
.,.,.,"('.',)",,,,Z


In [4]:
text.morph_extended

layer name,attributes,parent,enveloping,ambiguous,span count
morph_extended,"lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,7

text,lemma,root,root_tokens,ending,clitic,form,partofspeech,punctuation_type,pronoun_type,letter_case,fin,verb_extension_suffix,subcat
Ta,tema,tema,"('tema',)",0,,sg nom,P,,"('ps3',)",cap,,(),
on,olema,ole,"('ole',)",0,,mod indic pres ps3 sg ps af,V,,,,True,(),"('Intr',)"
,olema,ole,"('ole',)",0,,aux indic pres ps3 sg ps af,V,,,,True,(),"('Intr',)"
,olema,ole,"('ole',)",0,,main indic pres ps3 sg ps af,V,,,,True,(),"('Intr',)"
,olema,ole,"('ole',)",0,,mod indic pres ps3 pl ps af,V,,,,True,(),"('Intr',)"
,olema,ole,"('ole',)",0,,aux indic pres ps3 pl ps af,V,,,,True,(),"('Intr',)"
,olema,ole,"('ole',)",0,,main indic pres ps3 pl ps af,V,,,,True,(),"('Intr',)"
ise,ise,ise,"('ise',)",0,,pl nom,P,,"('pos', 'det', 'refl')",,,(),
,ise,ise,"('ise',)",0,,sg nom,P,,"('pos', 'det', 'refl')",,,(),
tee,tee,tee,"('tee',)",0,,com sg nom,S,,,,,(),


In order to use VISLCG3 based syntactic analysis, the VISLCG3 parser must be installed into the system. The information about the parser is distributed in the [Constraint Grammar's Google Group](https://groups.google.com/forum/#!forum/constraint-grammar), and this is also the place to look for the most compact guide about getting & installing the [latest version of the parser](https://groups.google.com/forum/#!msg/constraint-grammar/hXsbzyyhIVI/nHXRnOomf9wJ).

By default, EstNLTK expects that the directory containing VISLCG3 parser's executable (vislcg3 in UNIX, vislcg3.exe in Windows) is accessible from system's environment variable PATH. If this requirement is satisfied, the EstNLTK should always be able to execute the parser and therefore, we can parse our example sentence as follows:

In [5]:
from estnltk.taggers import VislTagger

visl_tagger = VislTagger()
visl_tagger.tag(text)

text.visl

layer name,attributes,parent,enveloping,ambiguous,span count
visl,"id, lemma, ending, partofspeech, subtype, mood, tense, voice, person, inf_form, number, case, polarity, number_format, capitalized, finiteness, subcat, clause_boundary, deprel, head",words,,True,7

text,id,lemma,ending,partofspeech,subtype,mood,tense,voice,person,inf_form,number,case,polarity,number_format,capitalized,finiteness,subcat,clause_boundary,deprel,head
Ta,1,tema,0,P,pers,_,_,_,ps3,_,sg,nom,_,_,cap,_,_,_,@SUBJ,2
on,2,ole,0,V,main,indic,pres,ps,ps3,_,sg,_,af,_,_,_,_,_,@FMV,0
ise,3,ise,0,P,"['pos', 'det', 'refl']",_,_,_,_,_,sg,nom,_,_,_,_,_,_,@ADVL,2
tee,4,tee,0,S,com,_,_,_,_,_,sg,nom,_,_,_,_,_,_,@PRD,2
esimesel,5,esimene,l,N,ord,_,_,_,_,_,sg,ad,_,l,_,_,_,_,@AN>,6
poolel,6,pool,l,S,com,_,_,_,_,_,sg,ad,_,_,_,_,_,_,"['@<NN', '@ADVL']",2
.,7,.,_,Z,Fst,_,_,_,_,_,_,_,_,_,_,_,_,CLB,_,7


The parser assigns each word a syntactic label (`deprel`, (e.g. '@SUBJ' stands for subject, see the [documentation](https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf) for details)) and its syntactic head (`head`) which is the id of its governing word in the sentence. NB! As can be seen from the example, the word id's start from 1 and not 0. The governing word index 0 marks that the current word is the root node of the tree.

As VislCG3 is based on [Constraint Grammar](http://visl.sdu.dk/constraint_grammar.html) formalism and first adds all the syntactic labels and then removes the ones that are not suitable based on constraints, it does leave some syntactic labels ambiguous as is seen with the word 'poolel' in our example sentence which gets both the analyses of a complement (@<NN) and adverbial (@ADVL). Despite this, each word still has only one syntactic head.

## `SyntaxDependencyRetagger` <font color='red'>[ no idea what this is]</font>

In [6]:
from estnltk.taggers import SyntaxDependencyRetagger

SyntaxDependencyRetagger('visl').retag(text, check_output_consistency=False)

text.visl

layer name,attributes,parent,enveloping,ambiguous,span count
visl,"id, lemma, ending, partofspeech, subtype, mood, tense, voice, person, inf_form, number, case, polarity, number_format, capitalized, finiteness, subcat, clause_boundary, deprel, head, parent_span, children",words,,True,7

text,id,lemma,ending,partofspeech,subtype,mood,tense,voice,person,inf_form,number,case,polarity,number_format,capitalized,finiteness,subcat,clause_boundary,deprel,head,parent_span,children
Ta,1,tema,0,P,pers,_,_,_,ps3,_,sg,nom,_,_,cap,_,_,_,@SUBJ,2,"AS(start=3, end=5, text:'on')",()
on,2,ole,0,V,main,indic,pres,ps,ps3,_,sg,_,af,_,_,_,_,_,@FMV,0,,"(""AS(start=0, end=2, text:'Ta')"", ""AS(start=6, end=9, text:'ise')"", ""AS(start=10 ..., type: <class 'tuple'>, length: 4"
ise,3,ise,0,P,"['pos', 'det', 'refl']",_,_,_,_,_,sg,nom,_,_,_,_,_,_,@ADVL,2,"AS(start=3, end=5, text:'on')",()
tee,4,tee,0,S,com,_,_,_,_,_,sg,nom,_,_,_,_,_,_,@PRD,2,"AS(start=3, end=5, text:'on')",()
esimesel,5,esimene,l,N,ord,_,_,_,_,_,sg,ad,_,l,_,_,_,_,@AN>,6,"AS(start=23, end=29, text:'poolel')",()
poolel,6,pool,l,S,com,_,_,_,_,_,sg,ad,_,_,_,_,_,_,"['@<NN', '@ADVL']",2,"AS(start=3, end=5, text:'on')","(""AS(start=14, end=22, text:'esimesel')"",)"
.,7,.,_,Z,Fst,_,_,_,_,_,_,_,_,_,_,_,_,CLB,_,7,"AS(start=29, end=30, text:'.')","(""AS(start=29, end=30, text:'.')"",)"


# Maltparser

Maltparser is a data-driven parser and has been trained on [Estonian Dependency Treebank](https://github.com/EstSyntax/EDT). As its optimal data format for morphological analysis differs from the VislCG3 format, we first need to use  ConllMorphTagger to tag `conll_morph` layer onto our data. This is based on `morph_extended` layer.

In [7]:
from estnltk.taggers import ConllMorphTagger

tagger = ConllMorphTagger(output_layer='conll_morph',  # default: 'conll_morph'
                          morph_extended_layer='morph_extended'  # default: 'morph_extended'
                          )
tagger

name,output layer,output attributes,input layers
ConllMorphTagger,conll_morph,"('id', 'form', 'lemma', 'upostag', 'xpostag', 'feats', 'head', 'deprel', 'deps', 'misc')","('morph_extended',)"


In [8]:
tagger.tag(text)

text.conll_morph

layer name,attributes,parent,enveloping,ambiguous,span count
conll_morph,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",words,,True,7

text,id,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Ta,1,Ta,tema,P,Ppers,ps3|sg|nom,_,_,_,_
on,2,on,ole,V,V,indic|pres|ps3|sg,_,_,_,_
ise,3,ise,ise,P,P,pos|det|refl|sg|nom,_,_,_,_
tee,4,tee,tee,S,S,sg|nom,_,_,_,_
esimesel,5,esimesel,esimene,N,A,ord|sg|ad|l,_,_,_,_
poolel,6,poolel,pool,S,S,sg|ad,_,_,_,_
.,7,.,.,Z,Z,Fst,_,_,_,_


This way, we get a standard CONLL format that can also be converted into a string as follows.

In [9]:
from estnltk.taggers.syntax.conll_morph_to_str import *

print(conll_to_str(text))

1	Ta	tema	P	Ppers	ps3|sg|nom	_	_	_	_	
2	on	ole	V	V	indic|pres|ps3|sg	_	_	_	_	
3	ise	ise	P	P	pos|det|refl|sg|nom	_	_	_	_	
4	tee	tee	S	S	sg|nom	_	_	_	_	
5	esimesel	esimene	N	A	ord|sg|ad|l	_	_	_	_	
6	poolel	pool	S	S	sg|ad	_	_	_	_	
7	.	.	Z	Z	Fst	_	_	_	_	




This is useful for cases when we want to use Maltparser without the EstNLTK interface. However, we can also use MaltParser inside EstNLTK on our Text object and get the standard CONLL-format syntactic analysis:

In [10]:
from estnltk.taggers.syntax.maltparser import MaltParser

parser = MaltParser()
initial_output = parser.parse_text(text, return_type = 'text')

In [11]:
print( '\n'.join( initial_output) )

1	Ta	tema	P	Ppers	ps3|sg|nom	2	@SUBJ	_	_
2	on	ole	V	V	indic|pres|ps3|sg	0	ROOT	_	_
3	ise	ise	P	P	pos|det|refl|sg|nom	4	@NN>	_	_
4	tee	tee	S	S	sg|nom	2	@PRD	_	_
5	esimesel	esimene	N	A	ord|sg|ad|l	6	@AN>	_	_
6	poolel	pool	S	S	sg|ad	2	@ADVL	_	_
7	.	.	Z	Z	Fst	6	@Punc	_	_



If we want the analysis to be added to the Text object, we need to write it to a file first (<font color='red'>should be a better way of doing this</font>) and then we can import it as a Text object.

In [12]:
with open("example.conll", "w") as fout:
    fout.write('\n'.join( initial_output))

In [13]:
from estnltk.converters.conll_importer import conll_to_text, add_layer_from_conll

text = conll_to_text(file='example.conll', syntax_layer='malt_1')

In [14]:
text

text
Ta on ise tee esimesel poolel .

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,,,,False,7
malt_1,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7


In [15]:
text.malt_1

layer name,attributes,parent,enveloping,ambiguous,span count
malt_1,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
Ta,1,tema,P,Ppers,ps3|sg|nom,2,@SUBJ,,,Span(on),()
on,2,ole,V,V,indic|pres|ps3|sg,0,ROOT,,,,"('Span(Ta)', 'Span(tee)', 'Span(poolel)')"
ise,3,ise,P,P,pos|det|refl|sg|nom,4,@NN>,,,Span(tee),()
tee,4,tee,S,S,sg|nom,2,@PRD,,,Span(on),"('Span(ise)',)"
esimesel,5,esimene,N,A,ord|sg|ad|l,6,@AN>,,,Span(poolel),()
poolel,6,pool,S,S,sg|ad,2,@ADVL,,,Span(on),"('Span(esimesel)', 'Span(.)')"
.,7,.,Z,Z,Fst,6,@Punc,,,Span(poolel),()


We can also add more syntax layers from CONLL files. This is useful if we have parsed our data with different parsing models and/or have gold standard and want to compare the annotations.

In [16]:
# Let's create another file, although it will have the same content as the first one
with open("example2.conll", "w") as fout:
    fout.write('\n'.join( initial_output))

In [17]:
from estnltk.converters.conll_importer import add_layer_from_conll

# Adding the analysis in the second file to the Text object that we already created
add_layer_from_conll(file='example2.conll', text=text, syntax_layer='malt_2')

text
Ta on ise tee esimesel poolel .

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,,,,False,7
malt_1,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7
malt_2,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7




## Labeled Attachment Score

Now that we have two syntactic analysis layers on our Text object, we probably want to compare them computationally as well. Labeled Attachment Score (LAS) is a standard evaluation metric in dependency syntax. It is the ~~percentage~~ <font color='red'>[what is percentage before multiplying by 100?]</font> of words that are assigned both the correct syntactic head and the correct dependency label and varies between 0 and 1. 

In [18]:
from estnltk.syntax.scoring import las_score

To understand better, let's change the second layer a little bit:

In [19]:
text.malt_2[3].deprel = '@X'

text.malt_1[3].deprel != text.malt_2[3].deprel

True

Now we can calculate LAS between the two syntax layers:

In [20]:
las_score(layer_a=text.malt_1,
          layer_b=text.malt_2
          )

0.8571428571428571

We can also just calculate the score for some parts of our Text object, e.g. to compare the first 4 spans or to compare the spans starting from the 5th:

In [21]:
las_score(text.malt_1, text.malt_2, 0, 4)

0.75

In [22]:
las_score(text.malt_1, text.malt_2, 4)

1.0

In addition, we can tag sliding LAS scores to be able to compare different text segments.

In [23]:
from estnltk.taggers.syntax.syntax_las_tagger import SyntaxLasTagger

tagger = SyntaxLasTagger('malt_1', 'malt_2', window=3)
tagger

name,output layer,output attributes,input layers
SyntaxLasTagger,las,"('deprel_sequence', 'score')","('malt_1', 'malt_2')"

0,1
window,3


In [24]:
tagger.tag(text)
text.las

layer name,attributes,parent,enveloping,ambiguous,span count
las,"deprel_sequence, score",,malt_1,False,9

text,deprel_sequence,score
['Ta'],"('@SUBJ',)",1.0
"['Ta', 'on']","('@SUBJ', 'ROOT')",1.0
"['Ta', 'on', 'ise']","('@SUBJ', 'ROOT', '@NN>')",1.0
"['on', 'ise', 'tee']","('ROOT', '@NN>', '@PRD')",0.6666666666666666
"['ise', 'tee', 'esimesel']","('@NN>', '@PRD', '@AN>')",0.6666666666666666
"['tee', 'esimesel', 'poolel']","('@PRD', '@AN>', '@ADVL')",0.6666666666666666
"['esimesel', 'poolel', '.']","('@AN>', '@ADVL', '@Punc')",1.0
"['poolel', '.']","('@ADVL', '@Punc')",1.0
['.'],"('@Punc',)",1.0


<font color='red'>Not sure what this next thing is but it can be done as well:</font>

In [25]:
text.las.meta['aggregate_deprel_sequences']

{('@ADVL', '@Punc'): [1.0],
 ('@AN>', '@ADVL', '@Punc'): [1.0],
 ('@NN>', '@PRD', '@AN>'): [0.6666666666666666],
 ('@PRD', '@AN>', '@ADVL'): [0.6666666666666666],
 ('@Punc',): [1.0],
 ('@SUBJ',): [1.0],
 ('@SUBJ', 'ROOT'): [1.0],
 ('@SUBJ', 'ROOT', '@NN>'): [1.0],
 ('ROOT', '@NN>', '@PRD'): [0.6666666666666666]}