## <span style="color:purple">Dependency syntactic analysis with Stanza</span>

[Stanza](https://stanfordnlp.github.io/stanza/) is a collection of linguistic analysis tools, which includes state-of-the-art neural dependency parser.
EstNLTK contains wrappers for Stanza's dependency parser, and provides parser models trained on the [Estonian Universal Dependencies Treebank 2.5](https://github.com/UniversalDependencies/UD_Estonian-EDT). 
For using these tools, you'll need to install:

* [EstNLTK-neural package](https://github.com/estnltk/estnltk/tree/main/estnltk_neural)  (v1.7.0+)
* [Stanza](https://github.com/stanfordnlp/stanza) (v1.1+)

EstNLTK-neural has two Stanza-based syntax taggers: `StanzaSyntaxTagger` and `StanzaSyntaxEnsembleTagger`. 
`StanzaSyntaxTagger` has models that work on different inputs, and it works faster; `StanzaSyntaxEnsembleTagger` is more accurate, but also slower due to the usage of an ensemble of models. 

The following table gives an overview about two taggers and corresponding models:

|             | StanzaSyntaxTagger(sentences) | StanzaSyntaxTagger(morph_analysis) | StanzaSyntaxTagger(morph_extended) | StanzaSyntaxEnsembleTagger 
| ----------- | ------------- | --------- | -------------- | -------- |
| **model name**  | stanza_depparse.pt <br> (with stanza's et pre-processors)| morph_analysis.pt | morph_extended.pt | ensemble_models/model_1.pt <br>.. <br>ensemble_models/model_10.pt<br> (10 models at total) |
| **needs to be downloaded?** | Yes | Yes | Yes | Yes |
| **required preprocessing** | `words`, `sentences` | `words`, `sentences`, `morph_analysis` | `words`, `sentences`, `morph_extended` | `words`, `sentences`, `morph_extended` |
| **tagger's <br>configuration** | input_type='sentences',<br> input_morph_layer=None | input_type='morph_analysis',<br> input_morph_layer='morph_analysis' | input_type='morph_extended',<br> input_morph_layer='morph_extended' |  |
| **dependency relations (_deprel_)** | [UD tags](https://universaldependencies.org/u/dep/index.html) | [UD tags](https://universaldependencies.org/u/dep/index.html) | [UD tags](https://universaldependencies.org/u/dep/index.html)  | [UD tags](https://universaldependencies.org/u/dep/index.html)  |
| ***upostag, xpostag, feats*** | [UD tags](https://universaldependencies.org/et/index.html#morphology) | [Vabamorf's tagset](https://github.com/estnltk/estnltk/blob/main/tutorials/nlp_pipeline/B_morphology/00_tables_of_morphological_categories.ipynb)  | [morph_extended tags](01_syntax_preprocessing.ipynb) | [morph_extended tags](01_syntax_preprocessing.ipynb) |
| **comment** | *the original Stanza's et model that is ran on EstNLTK's tokenization*| *model trained on EstNLTK's 'morph_analysis' layer* | *model trained on EstNLTK's 'morph_extended' layer* | *models trained on EstNLTK's 'morph_extended' layer* |
| **accuracy<br> (LAS)**     | 83.8        | 85.75        | 85.87          | 86.43    |

### StanzaSyntaxTagger

Before using `StanzaSyntaxTagger`, you'll need to get the models.
Models are not distributed with EstNLTK due to their large size. 
You need to download the models separately:

* If you create a new instance of `StanzaSyntaxTagger` and the models are missing, you'll be prompted with a question asking for permission to download the models;
* Alternatively, you can pre-download all models as a single package manually via `download` function:

```python
from estnltk import download
download('stanzasyntaxtagger')
```

After pre-downloading, `StanzaSyntaxTagger` should be able to automatically detect the models.

EstNLTK's `StanzaSyntaxTagger` can be configured to run on three different inputs: `sentences`, `morph_analysis` or `morph_extended` layer. 
Correspondingly, the parameter `input_type` has to be set to 'sentences', 'morph_analysis' or 'morph_extended'.
On two latter occasions, the correct morphology layer name also has to be passed as the parameter `input_morph_layer` (default: 'morph_analysis').

Usage example:

In [1]:
from estnltk import Text
from estnltk_neural.taggers import StanzaSyntaxTagger

text = Text("Tema tahtis saada rikkaks ja kuulsaks ja elada kodanlikku elu .")
text.tag_layer('morph_analysis')

stanza_tagger = StanzaSyntaxTagger(input_type='morph_analysis', input_morph_layer='morph_analysis')

stanza_tagger.tag( text )

text.stanza_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
stanza_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc",morph_analysis,,False,11

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Tema,1,tema,P,P,"OrderedDict([('sg', 'sg'), ('n', 'n')])",2,nsubj,_,_
tahtis,2,tahtma,V,V,"OrderedDict([('s', 's')])",0,root,_,_
saada,3,saama,V,V,"OrderedDict([('da', 'da')])",2,xcomp,_,_
rikkaks,4,rikas,A,A,"OrderedDict([('sg', 'sg'), ('tr', 'tr')])",3,xcomp,_,_
ja,5,ja,J,J,OrderedDict(),6,cc,_,_
kuulsaks,6,kuulus,A,A,"OrderedDict([('sg', 'sg'), ('tr', 'tr')])",4,conj,_,_
ja,7,ja,J,J,OrderedDict(),8,cc,_,_
elada,8,elama,V,V,"OrderedDict([('da', 'da')])",3,conj,_,_
kodanlikku,9,kodanlik,A,A,"OrderedDict([('sg', 'sg'), ('p', 'p')])",10,amod,_,_
elu,10,elu,S,S,"OrderedDict([('sg', 'sg'), ('p', 'p')])",8,obj,_,_


**Output categories**. By default, `StanzaSyntaxTagger` uses [UD tags](https://universaldependencies.org/u/dep/index.html) for _deprel_ , and values of _lemma, upostag, xpostag, feats_ come from the input layer ('morph_analysis' or ['morph_extended'](01_syntax_preprocessing.ipynb)).

If `input_type='sentences'`, then `StanzaSyntaxTagger` only uses tokenization from EstNLTK, and applies the original Stanza's Estonian models (`processors='tokenize,pos,lemma,depparse'`) on the input text.
As a result, [UD tags](https://universaldependencies.org/et/index.html#morphology) are also used in _upostag_ and _feats_ attributes.
Example:

In [2]:
# Use the original Stanza's Estonian models with EstNLTK's tokenization
stanza_tagger_sent = StanzaSyntaxTagger(input_type='sentences', output_layer='stanza_syntax_sent', depparse_path='')

stanza_tagger_sent.tag( text )

text.stanza_syntax_sent

layer name,attributes,parent,enveloping,ambiguous,span count
stanza_syntax_sent,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc",words,,False,11

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Tema,1,tema,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('Person', '3'), ('PronType', 'Prs')])",2,nsubj,_,_
tahtis,2,tahtma,VERB,V,"OrderedDict([('Mood', 'Ind'), ('Number', 'Sing'), ('Person', '3'), ('Tense', 'Pa ..., type: <class 'collections.OrderedDict'>, length: 6",0,root,_,_
saada,3,saama,VERB,V,"OrderedDict([('VerbForm', 'Inf')])",2,xcomp,_,_
rikkaks,4,rikas,ADJ,A,"OrderedDict([('Case', 'Tra'), ('Degree', 'Pos'), ('Number', 'Sing')])",3,xcomp,_,_
ja,5,ja,CCONJ,J,OrderedDict(),6,cc,_,_
kuulsaks,6,kuulus,ADJ,A,"OrderedDict([('Case', 'Tra'), ('Degree', 'Pos'), ('Number', 'Sing')])",4,conj,_,_
ja,7,ja,CCONJ,J,OrderedDict(),8,cc,_,_
elada,8,elama,VERB,V,"OrderedDict([('VerbForm', 'Inf')])",3,conj,_,_
kodanlikku,9,kodanlik,ADJ,A,"OrderedDict([('Case', 'Par'), ('Degree', 'Pos'), ('Number', 'Sing')])",10,amod,_,_
elu,10,elu,NOUN,S,"OrderedDict([('Case', 'Par'), ('Number', 'Sing')])",8,obj,_,_


**Providing your own models**: if you are developing your own models and want to test these with `StanzaSyntaxTagger`, you can  pass the folder of models as argument `resources_path`. 

**Extra flags.** Use flag `add_parent_and_children` to add extra attributes `parent_span` and `children` to the output layer, which make querying dependency relations easier. 
Use flags `mark_syntax_error` and `mark_agreement_error` to switch on debugging modes which mark possible errors in syntactic relations (see more under UDValidationRetagger and DeprelAgreementRetagger).  

### StanzaSyntaxEnsembleTagger 

StanzaSyntaxEnsembleTagger aggregates the prediction of 10 models to form one prediction.

`StanzaSyntaxEnsembleTagger`'s models are not distributed with EstNLTK, and need to be downloaded separately:

* If you create a new instance of `StanzaSyntaxEnsembleTagger` and the models are missing, you'll be prompted with a question asking for permission to download the models;
* Alternatively, you can pre-download all models as a single package manually via `download` function:

```python
from estnltk import download
download('stanzasyntaxensembletagger')
```

After pre-downloading, `StanzaSyntaxEnsembleTagger` should be able to automatically detect the models.

`StanzaSyntaxEnsembleTagger` uses 'morph_extended' layer as input, and produces `stanza_enseble_syntax` layer as a result:

In [3]:
from estnltk_neural.taggers import StanzaSyntaxEnsembleTagger

text.tag_layer('morph_extended')

ensembletagger = StanzaSyntaxEnsembleTagger(output_layer='stanza_ensemble_syntax',
                                            input_morph_layer='morph_extended')

ensembletagger.tag( text )

text.stanza_ensemble_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
stanza_ensemble_syntax,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc",morph_extended,,False,11

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Tema,1,tema,P,P,"OrderedDict([('sg', 'sg'), ('nom', 'nom')])",2,nsubj,_,_
tahtis,2,tahtma,V,V,"OrderedDict([('main', 'main'), ('indic', 'indic'), ('impf', 'impf'), ('ps3', 'ps ..., type: <class 'collections.OrderedDict'>, length: 7",0,root,_,_
saada,3,saama,V,V,"OrderedDict([('aux', 'aux'), ('inf', 'inf')])",2,xcomp,_,_
rikkaks,4,rikas,A,A,"OrderedDict([('pos', 'pos'), ('sg', 'sg'), ('tr', 'tr')])",3,xcomp,_,_
ja,5,ja,J,J,"OrderedDict([('sub', 'sub'), ('crd', 'crd')])",6,cc,_,_
kuulsaks,6,kuulus,A,A,"OrderedDict([('pos', 'pos'), ('sg', 'sg'), ('tr', 'tr')])",4,conj,_,_
ja,7,ja,J,J,"OrderedDict([('sub', 'sub'), ('crd', 'crd')])",8,cc,_,_
elada,8,elama,V,V,"OrderedDict([('mod', 'mod'), ('inf', 'inf')])",3,conj,_,_
kodanlikku,9,kodanlik,A,A,"OrderedDict([('pos', 'pos'), ('sg', 'sg'), ('part', 'part')])",10,amod,_,_
elu,10,elu,S,S,"OrderedDict([('com', 'com'), ('sg', 'sg'), ('part', 'part')])",8,obj,_,_


**Output categories**. `StanzaSyntaxEnsembleTagger` uses [UD tags](https://universaldependencies.org/u/dep/index.html) for _deprel_ , and values of _lemma, upostag, xpostag, feats_ come from input the ['morph_extended'](01_syntax_preprocessing.ipynb) layer.

**Providing your own models**: if you are developing your own models and want to test these with `StanzaSyntaxEnsembleTagger`, you can also pass a list of models' paths to the argument `model_paths` when initializing the tagger.

**Extra flags.** Use flag `add_parent_and_children` to add extra attributes `parent_span` and `children` to the output layer, which make querying dependency relations easier. 
Use flags `mark_syntax_error` and `mark_agreement_error` to switch on debugging modes which mark possible errors in syntactic relations (see more under UDValidationRetagger and DeprelAgreementRetagger).  

## <span style="color:purple">Validation Retaggers</span>

Validation retaggers are meant to point out possible errors made by syntactic parsers. UDValidationRetagger and DeprelAgreementRetagges can be used through StanzaSyntaxTagger and StanzaSyntaxEnsembleTagger by setting arguments `mark_syntax_error = True` and `mark_agreement_error = True` when creating the tagger.

Other parsers' outputs can also be evaluated, but in that case validation retaggers must be created explictly.

### UDValidationRetagger

UDValidationRetagger checks syntax layer, which has deprels in **UD format** and preferably vabamorf features, against common errors and inconsistencies. For that purpose UD validation script from https://github.com/universaldependencies/tools/ is used. Syntactic errors relating to UPOS-tag are ignored as Vabamorf produces different POS-tags.

Syntax layer must have attributes `id`, `head`, `lemma`, `upostag`, `xpostag`, `feats`, `deprel`, `dep` and `misc`. Layer must be passed to the argument `output_layer` when initializing the retagger.

As a result, attributes `syntax_error` and `error_message` are created. Value of 'syntax_error' is True if any syntactic errors (such as non-projectivity, unsuitable children etc) were discovered, otherwise False. 'Error_message' describes the nature of the error.

In [4]:
from estnltk.taggers.standard.syntax.ud_validation.ud_validation_retagger import UDValidationRetagger

text = Text('Võid mõelda nii, et ise oledki seal ja mis sulle seal hea on, mis mitte.').tag_layer('sentences')

stanza_tagger_sent.tag ( text )
validation_retagger = UDValidationRetagger(output_layer = 'stanza_syntax_sent')

validation_retagger.retag( text )

text.stanza_syntax_sent

layer name,attributes,parent,enveloping,ambiguous,span count
stanza_syntax_sent,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc, syntax_error, error_message",words,,False,18

text,id,lemma,upostag,xpostag,feats,head,deprel,deps,misc,syntax_error,error_message
Võid,1,võima,AUX,V,"OrderedDict([('Mood', 'Ind'), ('Number', 'Sing'), ('Person', '2'), ('Tense', 'Pr ..., type: <class 'collections.OrderedDict'>, length: 6",2,aux,_,_,False,
mõelda,2,mõtlema,VERB,V,"OrderedDict([('VerbForm', 'Inf')])",0,root,_,_,False,
nii,3,nii,ADV,D,OrderedDict(),2,advmod,_,_,False,
",",4,",",PUNCT,Z,OrderedDict(),8,punct,_,_,False,
et,5,et,SCONJ,J,OrderedDict(),8,mark,_,_,False,
ise,6,ise,ADV,D,OrderedDict(),8,advmod,_,_,False,
oledki,7,olema,AUX,V,"OrderedDict([('Mood', 'Ind'), ('Number', 'Sing'), ('Person', '2'), ('Tense', 'Pr ..., type: <class 'collections.OrderedDict'>, length: 6",8,cop,_,_,False,
seal,8,seal,ADV,D,OrderedDict(),2,ccomp,_,_,False,
ja,9,ja,CCONJ,J,OrderedDict(),13,cc,_,_,False,
mis,10,mis,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('PronType', 'Int,Rel')])",13,nsubj:cop,_,_,False,


### DeprelAgreementRetagger

DeprelAgreementRetagger detects arc labels that are inconsistent with UD labelling rules defined for Estonian language by authors of Estonian UD Treebank. 
Rules are concerned with words in translative or essive case that have verb as head.

DeprelAgreementRetagger assumes that input syntax layer has attributes `parent_span` and `children` (by SyntaxDependencyRetagger).
Retagger adds `agreement_deprel` attribute on chosen `output_layer`. If incorrect label is detected, set of correct labels is given. Otherwise the value is `None`.

In [5]:
from estnltk.taggers.standard.syntax.ud_validation.deprel_agreement_retagger import DeprelAgreementRetagger

text = Text('Hispaanias oli kombeks anda jootrahaks 25 peseetat.').tag_layer('morph_analysis')

stanza_tagger = StanzaSyntaxTagger(input_type='morph_analysis', add_parent_and_children=True)

agreement_retagger = DeprelAgreementRetagger(output_layer='stanza_syntax')

stanza_tagger.tag( text )

agreement_retagger.retag( text )

text.stanza_syntax['text', 'id', 'head', 'deprel', 'agreement_deprel']

Unnamed: 0,text,id,head,deprel,agreement_deprel
0,Hispaanias,1,3,advcl,
1,oli,2,3,cop,
2,kombeks,3,0,root,
3,anda,4,3,csubj:cop,
4,jootrahaks,5,4,xcomp,{'obl'}
5,25,6,7,nummod,
6,peseetat,7,4,obj,
7,.,8,3,punct,
