## <span style="color:purple">Utilities for syntactic analysis</span>

This tutorial describes EstNLTK's helpful utilities for syntactic analysis. Namely:

* `SyntaxDependencyRetagger`, which adds links between words that have syntactic relations, making it easier to navigate and query syntactic relations;
* Functions for working with CONLL files;
* Validation retaggers, which help to detect possible errors made by syntactic parsers;
* Syntax evaluation tools for calculating LAS score;

## `SyntaxDependencyRetagger` 

`SyntaxDependencyRetagger` adds `parent_span` and `children` attributes to a syntax layer, which help to navigate dependency relations.

In syntax layer, each word has attributes `id` and `head`. `id` is the index of the word in the sentence, and `head` is the index of word's parent in the sentence.
`SyntaxDependencyRetagger` makes this information explicit, adding to each word a link to its parent word (span), and links to all of its children spans.

Note: Most EstNLTK's syntactic analysis taggers have flag `add_parent_and_children` which switches on automatic preprocessing of the output layer with `SyntaxDependencyRetagger`.
If this flag is missing or you want to recalculate `parent_span` and `children` values, you can run `SyntaxDependencyRetagger` manually.

Example:

In [1]:
# First, preprocess Text for MaltParserTagger
from estnltk import Text
from estnltk.taggers import ConllMorphTagger
from estnltk.taggers import MaltParserTagger

# create a preprocessing tagger
conll_tagger = ConllMorphTagger( output_layer='conll_morph', morph_extended_layer='morph_analysis', no_visl=True )
# create syntax tagger
maltparser_tagger = MaltParserTagger(input_type='morph_analysis', version='conllu', add_parent_and_children=False)

# create text and annotate with Maltparser
text = Text('Ilus suur karvane kass nurrus punasel diivanil.')
text.tag_layer('morph_analysis')
conll_tagger.tag( text )
maltparser_tagger.tag( text )

text
Ilus suur karvane kass nurrus punasel diivanil.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,8
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,8
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,8
conll_morph,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",morph_analysis,,True,8
maltparser_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc",,,False,8


In [2]:
# Add parent_span and children information
from estnltk.taggers import SyntaxDependencyRetagger

SyntaxDependencyRetagger('maltparser_syntax').retag(text)

text.maltparser_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
maltparser_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,8

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
Ilus,1,ilus,A,A,"{'sg': '', 'n': ''}",4,amod,,,"Span('kass', [{'id': 4, 'lemma': 'kass', 'upostag': 'S', 'xpostag': 'S', 'feats' ..., type: <class 'estnltk_core.layer.span.Span'>",()
suur,2,suur,A,A,"{'sg': '', 'n': ''}",4,amod,,,"Span('kass', [{'id': 4, 'lemma': 'kass', 'upostag': 'S', 'xpostag': 'S', 'feats' ..., type: <class 'estnltk_core.layer.span.Span'>",()
karvane,3,karvane,A,A,"{'sg': '', 'n': ''}",4,amod,,,"Span('kass', [{'id': 4, 'lemma': 'kass', 'upostag': 'S', 'xpostag': 'S', 'feats' ..., type: <class 'estnltk_core.layer.span.Span'>",()
kass,4,kass,S,S,"{'sg': '', 'n': ''}",5,nsubj,,,"Span('nurrus', [{'id': 5, 'lemma': 'nurruma', 'upostag': 'V', 'xpostag': 'V', 'f ..., type: <class 'estnltk_core.layer.span.Span'>","(""Span('Ilus', [{'id': 1, 'lemma': 'ilus', 'upostag': 'A', 'xpostag': 'A', 'feat ..., type: <class 'tuple'>, length: 3"
nurrus,5,nurruma,V,V,{'s': ''},0,root,,,,"(""Span('kass', [{'id': 4, 'lemma': 'kass', 'upostag': 'S', 'xpostag': 'S', 'feat ..., type: <class 'tuple'>, length: 3"
punasel,6,punane,A,A,"{'sg': '', 'ad': ''}",7,amod,,,"Span('diivanil', [{'id': 7, 'lemma': 'diivan', 'upostag': 'S', 'xpostag': 'S', ' ..., type: <class 'estnltk_core.layer.span.Span'>",()
diivanil,7,diivan,S,S,"{'sg': '', 'ad': ''}",5,obl,,,"Span('nurrus', [{'id': 5, 'lemma': 'nurruma', 'upostag': 'V', 'xpostag': 'V', 'f ..., type: <class 'estnltk_core.layer.span.Span'>","(""Span('punasel', [{'id': 6, 'lemma': 'punane', 'upostag': 'A', 'xpostag': 'A', ..., type: <class 'tuple'>, length: 1"
.,8,.,Z,Z,,5,punct,,,"Span('nurrus', [{'id': 5, 'lemma': 'nurruma', 'upostag': 'V', 'xpostag': 'V', 'f ..., type: <class 'estnltk_core.layer.span.Span'>",()


Note: If `parent_span` or `children` attribute already exists in the syntax layer then the values are updated.  
Therefore, to update the dependencies in the syntax layer first update the values of head attributes and then run `SyntaxDependencyRetagger`.

#### Navigating in dependency relations

The span _'kass'_ has a parent span and three child spans:

In [3]:
span = text.maltparser_syntax[3]
span

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
kass,4,kass,S,S,"{'sg': '', 'n': ''}",5,nsubj,,,"Span('nurrus', [{'id': 5, 'lemma': 'nurruma', 'upostag': 'V', 'xpostag': 'V', 'f ..., type: <class 'estnltk_core.layer.span.Span'>","(""Span('Ilus', [{'id': 1, 'lemma': 'ilus', 'upostag': 'A', 'xpostag': 'A', 'feat ..., type: <class 'tuple'>, length: 3"


To get the **parent span**, write:

In [4]:
span.annotations[0].parent_span

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
nurrus,5,nurruma,V,V,{'s': ''},0,root,,,,"(""Span('kass', [{'id': 4, 'lemma': 'kass', 'upostag': 'S', 'xpostag': 'S', 'feat ..., type: <class 'tuple'>, length: 3"


To iterate over all **children**, write:

In [5]:
for child in span.annotations[0].children:
    display(child)

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
Ilus,1,ilus,A,A,"{'sg': '', 'n': ''}",4,amod,,,"Span('kass', [{'id': 4, 'lemma': 'kass', 'upostag': 'S', 'xpostag': 'S', 'feats' ..., type: <class 'estnltk_core.layer.span.Span'>",()


text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
suur,2,suur,A,A,"{'sg': '', 'n': ''}",4,amod,,,"Span('kass', [{'id': 4, 'lemma': 'kass', 'upostag': 'S', 'xpostag': 'S', 'feats' ..., type: <class 'estnltk_core.layer.span.Span'>",()


text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
karvane,3,karvane,A,A,"{'sg': '', 'n': ''}",4,amod,,,"Span('kass', [{'id': 4, 'lemma': 'kass', 'upostag': 'S', 'xpostag': 'S', 'feats' ..., type: <class 'estnltk_core.layer.span.Span'>",()


---

## Working with CONLL files

EstNLTK enables us to import data from CONLL format files as Text objects with one or more syntactic analysis layers. 
To see how this works, let's import the annotations from file `example.conll` (should be in the folder of this tutorial):

In [6]:
from estnltk.converters.conll.conll_importer import conll_to_text

# Reading data from the file 'example.conll'
text = conll_to_text(file='example.conll', syntax_layer='imported_syntax')

You can specify name of the syntax layer upon the import. 
While working with several syntactic analysis layers, please keep in mind that each layer should have a unique name.

In [7]:
text

text
Ta on ise tee esimesel poolel .

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,,,,True,7
imported_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7


In [8]:
text.imported_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
imported_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,parent_span,children
Ta,1,tema,P,P,"{'sg': '', 'n': ''}",2,@ADVL,,,"Span('on', [{'id': 2, 'lemma': 'olema', 'upostag': 'V', 'xpostag': 'V', 'feats': ..., type: <class 'estnltk_core.layer.span.Span'>",()
on,2,olema,V,V,{'b': ''},0,ROOT,,,,"(""Span('Ta', [{'id': 1, 'lemma': 'tema', 'upostag': 'P', 'xpostag': 'P', 'feats' ..., type: <class 'tuple'>, length: 3"
ise,3,ise,P,P,"{'pl': '', 'n': ''}",4,@NN>,,,"Span('tee', [{'id': 4, 'lemma': 'tee', 'upostag': 'S', 'xpostag': 'S', 'feats': ..., type: <class 'estnltk_core.layer.span.Span'>",()
tee,4,tee,S,S,"{'sg': '', 'n': ''}",2,@ADVL,,,"Span('on', [{'id': 2, 'lemma': 'olema', 'upostag': 'V', 'xpostag': 'V', 'feats': ..., type: <class 'estnltk_core.layer.span.Span'>","(""Span('ise', [{'id': 3, 'lemma': 'ise', 'upostag': 'P', 'xpostag': 'P', 'feats' ..., type: <class 'tuple'>, length: 1"
esimesel,5,esimene,O,O,"{'sg': '', 'ad': ''}",6,@DN>,,,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'upostag': 'S', 'xpostag': 'S', 'feat ..., type: <class 'estnltk_core.layer.span.Span'>",()
poolel,6,pool,S,S,"{'sg': '', 'ad': ''}",2,@ADVL,,,"Span('on', [{'id': 2, 'lemma': 'olema', 'upostag': 'V', 'xpostag': 'V', 'feats': ..., type: <class 'estnltk_core.layer.span.Span'>","(""Span('esimesel', [{'id': 5, 'lemma': 'esimene', 'upostag': 'O', 'xpostag': 'O' ..., type: <class 'tuple'>, length: 2"
.,7,.,Z,Z,,6,@Punc,,,"Span('poolel', [{'id': 6, 'lemma': 'pool', 'upostag': 'S', 'xpostag': 'S', 'feat ..., type: <class 'estnltk_core.layer.span.Span'>",()


We can also add more syntax layers from CONLL files. 
This is useful if we have parsed our data with different parsing models and/or have gold standard and want to compare the annotations.

In [9]:
from estnltk.converters.conll.conll_importer import add_layer_from_conll

# Adding the analysis from the second file to the Text object that we already created
add_layer_from_conll(file='example2.conll', text=text, syntax_layer='imported_syntax_2')

text
Ta on ise tee esimesel poolel .

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,,,,True,7
imported_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7
imported_syntax_2,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7


We can also iterate over sentences and words and compare their attributes:

In [10]:
syntax_1 = text.imported_syntax
syntax_2 = text.imported_syntax_2

for sentence in text.sentences:
    for word in sentence:
        if syntax_1.get(word).deprel == syntax_2.get(word).deprel:
            print(syntax_1.get(word).text, syntax_2.get(word).text)

Ta Ta
on on
ise ise
tee tee
esimesel esimesel
poolel poolel
. .


All _deprel_ values were equal. This is an expected result, as files' contents were also identical.

**Note #1:** you can also use `conll_to_text` to import a syntax preprocessing layer (a layer that is usually created by `ConllMorphTagger`, as described in [this tutorial](01_syntax_preprocessing.ipynb)).

In the following example, the file `example.conllu` already has CoNNL-U format syntactic annotation: we read the file and add syntactic analysis with `UDPipeTagger` (note: running this example requires that `UDPipeTagger` has been properly installed, see [this tutorial](03_syntactic_analysis_with_udpipe.ipynb) for details):

In [11]:
from estnltk.taggers import UDPipeTagger
from estnltk.converters.conll.conll_importer import conll_to_text

# Import while using name of the 
text = conll_to_text(file='example.conllu', syntax_layer='conllu_morph') 

udpipe_tagger = UDPipeTagger(input_syntax_layer='conllu_morph', version='conllu') # default version is conllx

udpipe_tagger.tag(text)

text.udpipe_syntax

  self.stdin = io.open(p2cwrite, 'wb', bufsize)
  self.stdout = io.open(c2pread, 'rb', bufsize)
  self.stderr = io.open(errread, 'rb', bufsize)


layer name,attributes,parent,enveloping,ambiguous,span count
udpipe_syntax,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",conllu_morph,,True,22

text,id,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Aga,1,Aga,aga,CCONJ,J,OrderedDict(),5,cc,_,_
ma,2,ma,mina,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('Person', '1'), ('PronType', 'Prs')])",5,nsubj,_,_
peaksin,3,peaksin,pidama,AUX,V,"OrderedDict([('Mood', 'Cnd'), ('Number', 'Sing'), ('Person', '1'), ('Tense', 'Pr ..., type: <class 'collections.OrderedDict'>, length: 6",5,aux,_,_
vist,4,vist,vist,ADV,D,OrderedDict(),5,advmod,_,_
pihtima,5,pihtima,pihtima,VERB,V,"OrderedDict([('Case', 'Ill'), ('VerbForm', 'Sup'), ('Voice', 'Act')])",0,root,_,_
",",6,",",",",PUNCT,Z,OrderedDict(),10,punct,_,_
et,7,et,et,SCONJ,J,OrderedDict(),10,mark,_,_
ma,8,ma,mina,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('Person', '1'), ('PronType', 'Prs')])",10,nsubj,_,_
ei,9,ei,ei,AUX,V,"OrderedDict([('Polarity', 'Neg')])",10,aux,_,_
tea,10,tea,teadma,VERB,V,"OrderedDict([('Connegative', 'Yes'), ('Mood', 'Ind'), ('Tense', 'Pres'), ('VerbF ..., type: <class 'collections.OrderedDict'>, length: 5",5,ccomp,_,_


**Note #2:** in addition to function `conll_to_text`, EstNLTK also has function `conll_to_texts_list` which allows to import multiple Text objects from a single CONLL file. However, detecting document boundaries inside a CONLL file relies on a heuristic specifically designed for [Estonian UD corpus](https://github.com/UniversalDependencies/UD_Estonian-EDT). For details, see [this tutorial](../../corpus_processing/importing_text_objects_from_corpora.ipynb).

---

## Validation Retaggers

Tutorials about UDValidationRetagger and DeprelAgreementRetagger can be found [here](03_syntactic_analysis_with_stanza.ipynb) (scroll below the Stanza tutorials).

---




## Evaluation: Labeled Attachment Score

Once we have two syntactic analysis layers on our Text object, we probably want to compare them computationally as well. Labeled Attachment Score (LAS) is a standard evaluation metric in dependency syntax. 
It is the ratio of words that are assigned both the correct syntactic head and the correct dependency label and varies between 0 and 1.

Let's use syntax layers from files `example.conll` and `example2.conll` as an example:

In [12]:
from estnltk.converters.conll.conll_importer import conll_to_text
from estnltk.converters.conll.conll_importer import add_layer_from_conll

# Reading data from the file 'example.conll'
text = conll_to_text(file='example.conll', syntax_layer='malt_1')
# Adding the analysis in the second file to the Text object that we already created
add_layer_from_conll(file='example2.conll', text=text, syntax_layer='malt_2')

text
Ta on ise tee esimesel poolel .

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,,,,True,7
malt_1,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7
malt_2,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, parent_span, children",,,False,7


In [13]:
from estnltk.taggers.standard.syntax.scoring.scoring import las_score

Remember that both imported syntactic analysis layers were equal.
So, the initial calculation gives us a perfect score:

In [14]:
las_score(layer_a=text.malt_1,
          layer_b=text.malt_2)

1.0

To understand LAS better, let's change the second layer a little bit:

In [15]:
text.malt_2[3].deprel = '@X'

text.malt_1[3].deprel != text.malt_2[3].deprel

True

Now, let's calculate the score again:

In [16]:
las_score(layer_a=text.malt_1,
          layer_b=text.malt_2)

0.8571428571428571

We can also just calculate the score for some parts of our Text object, e.g. to compare the first 4 spans or to compare the spans starting from the 5th:

In [17]:
las_score(text.malt_1, text.malt_2, 0, 4)

0.75

In [18]:
las_score(text.malt_1, text.malt_2, 4)

1.0

In addition, we can tag sliding LAS scores to be able to compare different text segments.

In [19]:
from estnltk.taggers import SyntaxLasTagger

tagger = SyntaxLasTagger('malt_1', 'malt_2', window=3)
tagger

name,output layer,output attributes,input layers
SyntaxLasTagger,las,"('deprel_sequence', 'score')","('malt_1', 'malt_2')"

0,1
window,3


In [20]:
tagger.tag(text)
text.las

0,1
aggregate_deprel_sequences,"{('@ADVL',): [1.0], ('@ADVL', 'ROOT'): [1.0], ('@ADVL', 'ROOT', '@NN>'): [1.0], ('ROOT', '@NN>', '@ADVL'): [0.6666666666666666], ('@NN>', '@ADVL', '@DN>'): [0.6666666666666666], ('@ADVL', '@DN>', '@ADVL'): [0.6666666666666666], ('@DN>', '@ADVL', '@Punc'): [1.0], ('@ADVL', '@Punc'): [1.0], ('@Punc',): [1.0]}"

layer name,attributes,parent,enveloping,ambiguous,span count
las,"deprel_sequence, score",,malt_1,False,9

text,deprel_sequence,score
['Ta'],"('@ADVL',)",1.0
"['Ta', 'on']","('@ADVL', 'ROOT')",1.0
"['Ta', 'on', 'ise']","('@ADVL', 'ROOT', '@NN>')",1.0
"['on', 'ise', 'tee']","('ROOT', '@NN>', '@ADVL')",0.6666666666666666
"['ise', 'tee', 'esimesel']","('@NN>', '@ADVL', '@DN>')",0.6666666666666666
"['tee', 'esimesel', 'poolel']","('@ADVL', '@DN>', '@ADVL')",0.6666666666666666
"['esimesel', 'poolel', '.']","('@DN>', '@ADVL', '@Punc')",1.0
"['poolel', '.']","('@ADVL', '@Punc')",1.0
['.'],"('@Punc',)",1.0


`SyntaxLasTagger` adds metadata to the output layer. The `aggregate_deprel_sequences` field of the meta lists all LAS scores for every `deprel` sequence encountered.

In [21]:
text.las.meta

{'aggregate_deprel_sequences': {('@ADVL',): [1.0],
  ('@ADVL', 'ROOT'): [1.0],
  ('@ADVL', 'ROOT', '@NN>'): [1.0],
  ('ROOT', '@NN>', '@ADVL'): [0.6666666666666666],
  ('@NN>', '@ADVL', '@DN>'): [0.6666666666666666],
  ('@ADVL', '@DN>', '@ADVL'): [0.6666666666666666],
  ('@DN>', '@ADVL', '@Punc'): [1.0],
  ('@ADVL', '@Punc'): [1.0],
  ('@Punc',): [1.0]}}