## <span style="color:purple">Morphological analysis with UD (Universal Dependencies) categories</span>

By default, EstNLTK uses Vabamorf's morphological analysis categories, which are described [here](https://github.com/Filosoft/vabamorf/blob/master/doc/tagset.md). 
Vabamorf's categories can be converted to UD ([Universal Dependencies](https://universaldependencies.org/guidelines.html)) categories that are used in grammar annotation. 
Currently, this conversion has **limitations**: roughly 3% to 9% of words do not obtain correct UD annotation, and there will be more ambiguities (compared to the default Vabamorf's analysis).
For details, see the section "Performance of the conversion" below.

A full overview and comparison of different Estonian morphological category systems can be found from [this document](https://cl.ut.ee/ressursid/morfo-systeemid/index.php?lang=en).

The conversion process is handled by `UDMorphConverter`:

In [1]:
from estnltk.text import Text
from estnltk.taggers import UDMorphConverter
ud_converter = UDMorphConverter()
ud_converter

name,output layer,output attributes,input layers
UDMorphConverter,ud_morph_analysis,"('id', 'lemma', 'upostag', 'xpostag', 'feats', 'misc')","('words', 'sentences', 'morph_analysis')"

0,1
remove_connegatives,True
generate_num_cases,True
add_deprel_attribs,False
adj_with_no_verb_feats_file,"C:\Programmid\Miniconda3\envs\py39_devel\lib\site-packages\estnltk-1.7.1-py3.9-w ..., type: <class 'str'>, length: 173"


In [2]:
# Create analysable text and add dependency layers
text = Text('Rändur võttis istet.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])
# Convert morph categories to UD
ud_converter.tag( text )
text

text
Rändur võttis istet.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4
ud_morph_analysis,"id, lemma, upostag, xpostag, feats, misc",morph_analysis,,True,4


In [3]:
# Small tweak: display at least 200 characters in html field output 
# (in order to see 'feats' at the full length)
from estnltk_core.common import OUTPUT_CONFIG
default_html_str_max_len = OUTPUT_CONFIG['html_str_max_len']
OUTPUT_CONFIG['html_str_max_len'] = 200

In [4]:
text['ud_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
ud_morph_analysis,"id, lemma, upostag, xpostag, feats, misc",morph_analysis,,True,4

text,id,lemma,upostag,xpostag,feats,misc
Rändur,1,rändur,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Nom')])",
võttis,2,võtma,VERB,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Past'), ('Mood', 'Ind'), ('VerbForm', 'Fin'), ('Number', 'Sing'), ('Person', '3')])",
istet,3,iste,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Par')])",
.,4,.,PUNCT,Z,OrderedDict(),


By default, `UDMorphConverter` only adds [CoNLL-U fields](https://universaldependencies.org/format.html) related to morphological information: 	`id`, `lemma`, `upostag`, `xpostag`, `feats`, `misc`.

Use flag `add_deprel_attribs=True` to get the full set of CoNLL-U fields (`id`, `lemma`, `upostag`, `xpostag`, `feats`, `misc`, `head`, `deprel`, `deps`, `misc`); however, the dependency syntax fields will remain unfilled (get `None` values).

```python
# add dependency syntax fields to the annotations (filled with None values)
ud_converter = UDMorphConverter(add_deprel_attribs=True)
```

### Conversion heuristics / options

#### Remove connegatives

The flag `remove_connegatives` can be used to remove verb annotations with `Connegative=Yes` when they are not preceded by words with `Polarity=Neg` in the sentence context. This heuristic is switched on by default:

In [5]:
# Create analysable text
text = Text('Rändur peaks kohe saabuma.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])
# Convert morph categories to UD
ud_converter = UDMorphConverter()
ud_converter.tag( text )
text['ud_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
ud_morph_analysis,"id, lemma, upostag, xpostag, feats, misc",morph_analysis,,True,5

text,id,lemma,upostag,xpostag,feats,misc
Rändur,1,rändur,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Nom')])",
peaks,2,pidama,VERB,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Pres'), ('Mood', 'Cnd'), ('VerbForm', 'Fin')])",
,2,pidama,AUX,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Pres'), ('Mood', 'Cnd'), ('VerbForm', 'Fin')])",
kohe,3,kohe,ADV,D,OrderedDict(),
saabuma,4,saabuma,VERB,V,"OrderedDict([('Voice', 'Act'), ('VerbForm', 'Sup'), ('Case', 'Ill')])",
.,5,.,PUNCT,Z,OrderedDict(),


In [6]:
# Create analysable text
text = Text('Rändur peaks kohe saabuma.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])
# Convert morph categories to UD (keep connegatives)
ud_converter = UDMorphConverter(remove_connegatives=False)
ud_converter.tag( text )
text['ud_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
ud_morph_analysis,"id, lemma, upostag, xpostag, feats, misc",morph_analysis,,True,5

text,id,lemma,upostag,xpostag,feats,misc
Rändur,1,rändur,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Nom')])",
peaks,2,pidama,VERB,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Pres'), ('Mood', 'Cnd'), ('VerbForm', 'Fin')])",
,2,pidama,AUX,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Pres'), ('Mood', 'Cnd'), ('VerbForm', 'Fin')])",
,2,pidama,VERB,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Pres'), ('Mood', 'Cnd'), ('VerbForm', 'Fin'), ('Connegative', 'Yes')])",
,2,pidama,AUX,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Pres'), ('Mood', 'Cnd'), ('VerbForm', 'Fin'), ('Connegative', 'Yes')])",
kohe,3,kohe,ADV,D,OrderedDict(),
saabuma,4,saabuma,VERB,V,"OrderedDict([('Voice', 'Act'), ('VerbForm', 'Sup'), ('Case', 'Ill')])",
.,5,.,PUNCT,Z,OrderedDict(),


Note: this is a heuristic, and can also remove annotations erroneously. If a connegative word is preceded by an unrecognized negative word (such as a slang word `'2ra'`, `'2i'` or `'äi'`), then the deletion will be erroneous.

#### Generate case/number information for numerics

Flag `generate_num_cases` generates exhaustively all cases for number tokens (words with postag NUM) that lack case/number information. Only exclusions: roman numerals will not recieve case/number information. This option is switched on by default:

In [7]:
# Create analysable text
text = Text('Vanalinn kuulub 1997. aastast UNESCO maailmapärandisse.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])
# Convert morph categories to UD
ud_converter = UDMorphConverter()
ud_converter.tag( text )
text['ud_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
ud_morph_analysis,"id, lemma, upostag, xpostag, feats, misc",morph_analysis,,True,7

text,id,lemma,upostag,xpostag,feats,misc
Vanalinn,1,vanalinn,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Nom')])",
kuulub,2,kuuluma,VERB,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Pres'), ('Mood', 'Ind'), ('VerbForm', 'Fin'), ('Number', 'Sing'), ('Person', '3')])",
1997.,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit')])",
,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit'), ('Number', 'Sing'), ('Case', 'Nom')])",
,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit'), ('Number', 'Sing'), ('Case', 'Gen')])",
,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit'), ('Number', 'Sing'), ('Case', 'Par')])",
,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit'), ('Number', 'Sing'), ('Case', 'Ill')])",
,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit'), ('Number', 'Sing'), ('Case', 'Ine')])",
,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit'), ('Number', 'Sing'), ('Case', 'Ela')])",
,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit'), ('Number', 'Sing'), ('Case', 'All')])",


In [8]:
# Create analysable text
text = Text('Vanalinn kuulub 1997. aastast UNESCO maailmapärandisse.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])
# Convert morph categories to UD (do not generate cases for numerics)
ud_converter = UDMorphConverter(generate_num_cases=False)
ud_converter.tag( text )
text['ud_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
ud_morph_analysis,"id, lemma, upostag, xpostag, feats, misc",morph_analysis,,True,7

text,id,lemma,upostag,xpostag,feats,misc
Vanalinn,1,vanalinn,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Nom')])",
kuulub,2,kuuluma,VERB,V,"OrderedDict([('Voice', 'Act'), ('Tense', 'Pres'), ('Mood', 'Ind'), ('VerbForm', 'Fin'), ('Number', 'Sing'), ('Person', '3')])",
1997.,3,1997.,NUM,O,"OrderedDict([('NumType', 'Ord'), ('NumForm', 'Digit')])",
aastast,4,aasta,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Ela')])",
UNESCO,5,Unesco,PROPN,H,"OrderedDict([('Number', 'Sing'), ('Case', 'Gen')])",
maailmapärandisse,6,maailmapärand,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Ill')])",
.,7,.,PUNCT,Z,OrderedDict(),


#### Adding verb features to adjectives

By default, all adjectives ending with `'tud'`, `'nud'`, `'v'` or `'tav'` will receive corresponding verb participle features. Example:

In [9]:
# Create analysable text
text = Text('Totaalselt erinev suhtumine')
text.tag_layer(['words', 'sentences', 'morph_analysis'])
# Convert morph categories to UD (do not generate cases for numerics)
ud_converter = UDMorphConverter( generate_num_cases=False )
ud_converter.tag( text )
text['ud_morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
ud_morph_analysis,"id, lemma, upostag, xpostag, feats, misc",morph_analysis,,True,3

text,id,lemma,upostag,xpostag,feats,misc
Totaalselt,1,totaalselt,ADV,D,OrderedDict(),
erinev,2,erinev,ADJ,A,"OrderedDict([('Degree', 'Pos'), ('Number', 'Sing'), ('Case', 'Nom'), ('Voice', 'Act'), ('Tense', 'Pres'), ('VerbForm', 'Part')])",
suhtumine,3,suhtumine,NOUN,S,"OrderedDict([('Number', 'Sing'), ('Case', 'Nom')])",


This setting is affected by `UDMorphConverter`'s parameter `adj_with_no_verb_feats_file`, which gives a path to a text file with a list of adjectives that should not obtain verb participle features. 
The file should be in "utf-8" encoding and should list adjective lemmas, each lemma on a new line. 
The file [adj_without_verb_feats.txt](https://github.com/estnltk/estnltk/blob/47af253c0f54e91646c05c7c408a02f00f6e0ff1/estnltk/estnltk/taggers/standard/morph_analysis/ud_conv_rules/adj_without_verb_feats.txt) is used as a default exclusion listing.

#### Dictionary-based conversion rules

`UDMorphConverter` has built-in conversion rules and additional dictionary-based conversion rules, which define how specific lemmas need to be converted.
Dictionary-based conversion rules are loaded from files and can be changed if needed. 
Default dictionary-based conversions can be found [here]( https://github.com/estnltk/estnltk/tree/47af253c0f54e91646c05c7c408a02f00f6e0ff1/estnltk/estnltk/taggers/standard/morph_analysis/ud_conv_rules).
Rules are in \*.tab files, each mapping Vabamorf's a lemma (and optionally part of speech) to appropriate upostags and feats. For instance:

```
mitte	D	ADV	Polarity=Neg
ega	D	ADV	Polarity=Neg
```

There can be multiple entries for a single Vabamorf's a lemma and part of speech: in that case, the conversion produces ambiguous annotations.

Use `UDMorphConverter`'s parameter `conversion_rules_dir` to change the directory from where \*.tab files will be loaded. 
Note that if you want to introduce new rules, it is advisable to copy the old ones and build new ones on top of them.

### Converting layer to CONLL-U string

`UDMorphConverter`'s output layer can be converted to CONLL-U string. 
For that, you need to used parameter `add_deprel_attribs=True` to add full set of [CONLLU fields](https://universaldependencies.org/format.html) to the layer:

In [10]:
# Create analysable text
text = Text('Rändur peaks kohe saabuma. Siis saame asjas selgust.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])

# Convert morph categories to UD (add all conll fields)
ud_converter = UDMorphConverter(add_deprel_attribs=True)
ud_converter.tag( text )

text
Rändur peaks kohe saabuma. Siis saame asjas selgust.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,10
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,10
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,10
ud_morph_analysis,"id, lemma, upostag, xpostag, feats, head, deprel, deps, misc",morph_analysis,,True,10


Now we can use function `layer_to_conll` to convert the `ud_morph_analysis` layer to CONLL format string:

In [11]:
from estnltk.converters.conll.conll_exporter import layer_to_conll
print( layer_to_conll(text, 'ud_morph_analysis') )

1	Rändur	rändur	NOUN	S	Number=Sing|Case=Nom	_	_	_	_
2	peaks	pidama	VERB	V	Voice=Act|Tense=Pres|Mood=Cnd|VerbForm=Fin	_	_	_	_
2	peaks	pidama	AUX	V	Voice=Act|Tense=Pres|Mood=Cnd|VerbForm=Fin	_	_	_	_
3	kohe	kohe	ADV	D	_	_	_	_	_
4	saabuma	saabuma	VERB	V	Voice=Act|VerbForm=Sup|Case=Ill	_	_	_	_
5	.	.	PUNCT	Z	_	_	_	_	_

1	Siis	siis	ADV	D	_	_	_	_	_
2	saame	saama	VERB	V	Voice=Act|Tense=Pres|Mood=Ind|VerbForm=Fin|Number=Plur|Person=1	_	_	_	_
2	saame	saama	AUX	V	Voice=Act|Tense=Pres|Mood=Ind|VerbForm=Fin|Number=Plur|Person=1	_	_	_	_
3	asjas	asi	NOUN	S	Number=Sing|Case=Ine	_	_	_	_
4	selgust	selgus	NOUN	S	Number=Sing|Case=Par	_	_	_	_
5	.	.	PUNCT	Z	_	_	_	_	_




Note that the conversion preserves ambiguities by default (there are multiple entries for ambiguous words).

### Performance of the conversion

Performance of `UDMorphConverter`'s morphological conversion has been measured on Estonian UD corpora: [UD_Estonian-EDT](https://github.com/UniversalDependencies/UD_Estonian-EDT/releases/tag/r2.10) and [UD_Estonian-EWT](https://github.com/UniversalDependencies/UD_Estonian-EWT/releases/tag/r2.10) (UD v2.10). 
Measurements were done on both corpora jointly.

Two measures were used: 
* **correct** -- the percentage of correctly converted words, including words obtaining ambiguous annotations;
* **ambiguous** -- the percentage of words remaining ambiguous after the conversion;


The following table shows conversion results under different settings.
Settings **A** and **B** use Vabamorf's morphological analysis without disambiguation (`VabamorfAnalyzer`), meaning that words will obtain maximum number of morphological interpretations.
Settings **C** and **D** use EstNLTK's default morphological analysis (`VabamorfTagger`), meaning that the number of morphological interpretations is reduced via disambiguation.

| Morph analysis and UDMorphConverter's settings                                                               | train       |             | dev         |             | test        |             |
| :----------------------------------------------------------------------------------------------------------- | ----------- | ----------- | ----------- | ----------- | ----------- | ----------- |
|                                                                                                              | **correct** | **ambiguous** | **correct** | **ambiguous** | **correct** | **ambiguous** |
| **A)** VabamorfAnalyzer(default settings), <br> UDMorphConverter(remove_connegatives=False, generate_num_cases=False) | 96.43%      | 45.17%      | 95.83%      | 45.19%      | 96.29%      | 45.13%      |
| **B)** VabamorfAnalyzer(default settings), <br> UDMorphConverter(remove_connegatives=True, generate_num_cases=True)   | 97.04%      | 46.48%      | 96.25%      | 46.86%      | 97.05%      | 46.20%      |
| **C)** VabamorfTagger(default settings), <br> UDMorphConverter(remove_connegatives=False, generate_num_cases=False)   | 92.55%      | 22.38%      | 92.17%      | 22.23%      | 92.32%      | 23.33%      |
| **D)** VabamorfTagger(default settings), <br> UDMorphConverter(remove_connegatives=True, generate_num_cases=True)     | 93.16%      | 23.73%      | 92.59%      | 23.90%      | 93.08%      | 24.45%      |

Source code for performing the evalution can be found from: https://github.com/estnltk/estnltk-model-training/tree/main/ud_morph_tools/eval_ud_morph_conv

---

In [12]:
# Reverse tweak
OUTPUT_CONFIG['html_str_max_len'] = default_html_str_max_len