## <span style="color:purple">Dependency syntactic analysis with UDPipe</span>

[UDPipe](http://ufal.mff.cuni.cz/udpipe) is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL files. 
EstNLTK provides a wrapper for UDPipe dependency parser and provides models of syntactic analysis that output UD or CG format dependency relations. 

### UDPipeTagger

#### Requirements
**UDPipe executable**. In order to use UDPipe based syntactic analysis, the UDPipe must be installed into the system. The information about how to install it can be found from [UDPipe Installation page](http://ufal.mff.cuni.cz/udpipe/install).
 
EstNLTK expects that the directory containing UDPipe executable (udpipe in UNIX, udpipe.exe in Windows) is accessible from system's environment variable PATH. UDPipe executables can be downloaded [here](https://github.com/ufal/udpipe/releases/tag/v1.2.0) (currently supported version is 1.2.0).

You can test that the UDPipe is available by the following command:

In [1]:
!udpipe --version

UDPipe version 1.2.0 (using UniLib 3.1.1,
MorphoDiTa 1.9.1-devel, Parsito 1.1.1-devel)
Copyright 2016 by Institute of Formal and Applied Linguistics, Faculty of
Mathematics and Physics, Charles University in Prague, Czech Republic.


**UDPipe's models**. As with other data-driven syntactic parsers, UDPipe's models are distributed separately and need to be downloaded: 

* If you create a new instance of `UDPipeTagger` and the models are missing, you'll be prompted with a question asking for permission to download the models;
* Alternatively, you can pre-download all models manually via `download` function:
```python
from estnltk import download
download('udpipetagger')
```

The following table gives an overview about EstNLTK's UDPipeTagger models:

|             | UD + morph_analysis | UD + morph_extended | UD + morph_extended <br> (with VislCG3 parser) | CG + morph_analysis | CG + morph_extended | CG + morph_extended <br> (with VislCG3 parser)  |
| ----------- | ------------- | --------- | -------------- | -------- | -------- | -------- |
| **model name <br>(with relative path)**  | conllu\ud_morph_analysis.output  |  conllu\ud_morph_extended.output | model_1.output | conllx\model_2.output          | conllx\model_3.output   | model_0.output |
| **Needs to be downloaded?** | Yes | Yes | Yes | Yes | Yes | Yes |
| **required preprocessing** | `words`, `sentences`, `morph_analysis`, `conll_morph` (from `morph_analysis`)  | `words`, `sentences`, `morph_extended`, `conll_morph` (from `morph_extended`) | `words`, `sentences`, `morph_extended`, `conll_morph` from `morph_extended` (requires VislCG3 parser) | `words`, `sentences`, `morph_analysis`, `conll_morph` (from `morph_analysis`) | `words`, `sentences`, `morph_extended`, `conll_morph` (from `morph_extended`) | `words`, `sentences`, `morph_extended`, `conll_morph` from `morph_extended` (requires VislCG3 parser) |
| **UDPipeTagger's <br>configuration** | input_type='morph_analysis',<br> version='conllu' | input_type='morph_extended',<br> version='conllu' | input_type='visl_morph',<br> version='conllu' | input_type='morph_analysis',<br> version='conllx' | input_type='morph_extended',<br> version='conllx' | input_type='visl_morph',<br> version='conllx' | 
| **Dependency relations (_deprel_)** | [UD tags](https://universaldependencies.org/u/dep/index.html) | [UD tags](https://universaldependencies.org/u/dep/index.html) | [UD tags](https://universaldependencies.org/u/dep/index.html) | [CG tags](https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf) |  [CG tags](https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf) | [CG tags](https://korpused.keeleressursid.ee/syntaks/dokumendid/syntaksiliides_en.pdf) |
| ***upostag, xpostag, feats*** | [Vabamorf's tagset](https://github.com/estnltk/estnltk/blob/main/tutorials/nlp_pipeline/B_morphology/00_tables_of_morphological_categories.ipynb) | [morph_extended tags](01_syntax_preprocessing.ipynb) | [morph_extended tags](01_syntax_preprocessing.ipynb) | [Vabamorf's tagset](https://github.com/estnltk/estnltk/blob/main/tutorials/nlp_pipeline/B_morphology/00_tables_of_morphological_categories.ipynb) | [morph_extended tags](01_syntax_preprocessing.ipynb) | [morph_extended tags](01_syntax_preprocessing.ipynb) |


#### Preprocessing for UDPipeTagger

UDPipeTagger preprocessing is analogous to [MaltparserTagger's preprocessing](03_syntactic_analysis_with_maltparser.ipynb).
For models not requiring VislCG3 parser, you can use `ConllMorphTagger` with the setting `no_visl=True` for preprocessing. 
This transforms annotations of the preprocessing layer to CONLL format fields, as described [here](01_syntax_preprocessing.ipynb). 
The parameter `morph_extended_layer` can be used to change between input layers `morph_analysis` and `morph_extended`.

In [2]:
from estnltk.taggers import ConllMorphTagger

# create preprocessing tagger
conll_tagger = ConllMorphTagger(output_layer='conll_morph',       # default: 'conll_morph'
                          morph_extended_layer='morph_analysis',  # default: 'morph_extended'
                          no_visl=True
                          )

# Create text and preprocess for UDPipeTagger syntax
from estnltk import Text
text = Text("Tema tahtis saada rikkaks ja kuulsaks ja elada kodanlikku elu .")
text.tag_layer('morph_analysis')
conll_tagger.tag(text)

text
Tema tahtis saada rikkaks ja kuulsaks ja elada kodanlikku elu .

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,11
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,11
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,11
conll_morph,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",morph_analysis,,True,11


For models that require VislCG3 parser ( model_1 and model_0 ), use `ConllMorphTagger` with the setting `no_visl=False` (which is default, by the way). 
Note, however, that this requires that **VISLCG3 parser is installed into the system, and accessible from system's environment variable PATH** .
The information about the parser is distributed in the [Constraint Grammar's Google Group](https://groups.google.com/g/constraint-grammar), and this is also the place to look for the most compact guide about [getting & installing the parser](https://groups.google.com/g/constraint-grammar/c/fNMkpAb_g3U).
Once you have the parser installed and available, you can create a suitable ConllMorphTagger in the following way:

```python
# Initialize ConllMorphTagger with VislCG3 preprocessing (only works for 'morph_extended')
conll_tagger = ConllMorphTagger(morph_extended_layer='morph_extended')
```

In [3]:
import warnings
warnings.filterwarnings("ignore")
#
# Ignore the following repeated warnings:
#
# ..\lib\subprocess.py:935: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
#  self.stdin = io.open(p2cwrite, 'wb', bufsize)
# ..\lib\subprocess.py:941: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
#  self.stdout = io.open(c2pread, 'rb', bufsize)
# ..\lib\subprocess.py:946: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
#  self.stderr = io.open(errread, 'rb', bufsize)

#### Basic usage

You can change UDPipeTagger's model via combination of parameters `version` and `input_type`.
Parameter `version='conllu'` specifies to use a model with the UD deprel tags (default), and `version='conllx'` a model with the CG deprel tags.
Parameter `input_type` specifies the input preprocessing layer, possible values: `input_type='morph_analysis'`,  `input_type='morph_extended'`, and `input_type='visl_morph'` (morph_extended processed with VislCG3 parser; default).

In [4]:
from estnltk.taggers import UDPipeTagger

udpipe_tagger = UDPipeTagger(input_type='morph_analysis', version='conllx')

udpipe_tagger.tag(text)

text.udpipe_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
udpipe_syntax,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",conll_morph,,True,11

text,id,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Tema,1,Tema,tema,P,P,"OrderedDict([('sg', ''), ('n', '')])",2,@SUBJ,_,_
tahtis,2,tahtis,tahtma,V,V,"OrderedDict([('s', '')])",0,root,_,_
saada,3,saada,saama,V,V,"OrderedDict([('da', '')])",2,@OBJ,_,_
rikkaks,4,rikkaks,rikas,A,A,"OrderedDict([('sg', ''), ('tr', '')])",3,@ADVL,_,_
ja,5,ja,ja,J,J,OrderedDict(),6,@J,_,_
kuulsaks,6,kuulsaks,kuulus,A,A,"OrderedDict([('sg', ''), ('tr', '')])",4,@ADVL,_,_
ja,7,ja,ja,J,J,OrderedDict(),8,@J,_,_
elada,8,elada,elama,V,V,"OrderedDict([('da', '')])",3,@OBJ,_,_
kodanlikku,9,kodanlikku,kodanlik,A,A,"OrderedDict([('sg', ''), ('p', '')])",10,@AN>,_,_
elu,10,elu,elu,S,S,"OrderedDict([('sg', ''), ('p', '')])",8,@OBJ,_,_


In [5]:
# Use UD dependency relations instead 
udpipe_tagger = UDPipeTagger(output_layer='udpipe_syntax_ud', input_type='morph_analysis', version='conllu')

udpipe_tagger.tag(text)

text.udpipe_syntax_ud

layer name,attributes,parent,enveloping,ambiguous,span count
udpipe_syntax_ud,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",conll_morph,,True,11

text,id,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Tema,1,Tema,tema,P,P,"OrderedDict([('sg', ''), ('n', '')])",2,nsubj,_,_
tahtis,2,tahtis,tahtma,V,V,"OrderedDict([('s', '')])",0,root,_,_
saada,3,saada,saama,V,V,"OrderedDict([('da', '')])",2,xcomp,_,_
rikkaks,4,rikkaks,rikas,A,A,"OrderedDict([('sg', ''), ('tr', '')])",3,xcomp,_,_
ja,5,ja,ja,J,J,OrderedDict(),6,cc,_,_
kuulsaks,6,kuulsaks,kuulus,A,A,"OrderedDict([('sg', ''), ('tr', '')])",4,conj,_,_
ja,7,ja,ja,J,J,OrderedDict(),8,cc,_,_
elada,8,elada,elama,V,V,"OrderedDict([('da', '')])",3,conj,_,_
kodanlikku,9,kodanlikku,kodanlik,A,A,"OrderedDict([('sg', ''), ('p', '')])",10,amod,_,_
elu,10,elu,elu,S,S,"OrderedDict([('sg', ''), ('p', '')])",8,obj,_,_


**Extra flags.** Use flag `add_parent_and_children` to add extra attributes `parent_span` and `children` to the output layer, which make querying dependency relations easier. 

**Augmenting existing CoNNL-U format text.** It is also possible to add syntactic analysis to text with CoNNL-U format layer. We can read file in CoNNL-U format and add syntactic analysis to it using UDPipeTagger. To do that, version should be specified for the tagger.  

In [6]:
from estnltk.converters.conll.conll_importer import conll_to_text

text = conll_to_text(file='example.conllu', syntax_layer='conllu_morph') 

udpipe_tagger = UDPipeTagger(input_syntax_layer='conllu_morph', version='conllu') # default version is conllx

udpipe_tagger.tag(text)
text.udpipe_syntax

layer name,attributes,parent,enveloping,ambiguous,span count
udpipe_syntax,"id, form, lemma, upostag, xpostag, feats, head, deprel, deps, misc",conllu_morph,,True,22

text,id,form,lemma,upostag,xpostag,feats,head,deprel,deps,misc
Aga,1,Aga,aga,CCONJ,J,OrderedDict(),5,cc,_,_
ma,2,ma,mina,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('Person', '1'), ('PronType', 'Prs')])",5,nsubj,_,_
peaksin,3,peaksin,pidama,AUX,V,"OrderedDict([('Mood', 'Cnd'), ('Number', 'Sing'), ('Person', '1'), ('Tense', 'Pr ..., type: <class 'collections.OrderedDict'>, length: 6",5,aux,_,_
vist,4,vist,vist,ADV,D,OrderedDict(),5,advmod,_,_
pihtima,5,pihtima,pihtima,VERB,V,"OrderedDict([('Case', 'Ill'), ('VerbForm', 'Sup'), ('Voice', 'Act')])",0,root,_,_
",",6,",",",",PUNCT,Z,OrderedDict(),10,punct,_,_
et,7,et,et,SCONJ,J,OrderedDict(),10,mark,_,_
ma,8,ma,mina,PRON,P,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing'), ('Person', '1'), ('PronType', 'Prs')])",10,nsubj,_,_
ei,9,ei,ei,AUX,V,"OrderedDict([('Polarity', 'Neg')])",10,aux,_,_
tea,10,tea,teadma,VERB,V,"OrderedDict([('Connegative', 'Yes'), ('Mood', 'Ind'), ('Tense', 'Pres'), ('VerbF ..., type: <class 'collections.OrderedDict'>, length: 5",5,ccomp,_,_


---