# Bert-based morphological tagger and disambiguator (BertMorphTagger)

EstNLTK contains a Bert-based morphological tagger for annotating partofpseech and morphological form features in Estonian texts using [Vabamorf's tagset](00_tables_of_morphological_categories.ipynb). 
The tagger can also be used as a disambiguator to resolve ambiguities of an existing morphological analysis layer that uses Vabamorf's tagset. 

## Prerequisites

*Note: you need to install [estnltk_neural](https://github.com/estnltk/estnltk/tree/main/estnltk_neural) and [sentencepiece](https://pypi.org/project/sentencepiece/) packages for using the Bert-based morphological tagger.*

Note that the Bert model required by the tagger is not distributed with the estnltk\_neural package. You can download the model in the following ways:
* If you create a new instance of `BertMorphTagger` and the model has not been downloaded yet, you'll be prompted with a question asking for a permission to download the model;
* Alternatively, you can pre-download the model manually via the download function:

```python
from estnltk import download
download('bert_morph_tagging')
```

Import modules

To use `BertMorphTagger`, you need to import BertMorphTagger from `estnltk_neural.taggers`:

In [1]:
import estnltk
from estnltk_neural.taggers import BertMorphTagger

### Using BertMorphTagger as a tagger

To initialize BertMorphTagger, call `BertMorphTagger` with preferred parameter values.

Important parameters for the tagger are:
* `get_top_n_predictions`: Output $n$ most probable morphological tags for each word. Defaults to **1**;
* `token_level`: If True outputs morphological tags for each BERT token, otherwise for each word in EstNLTK's `words` layer. Defaults to **False**;
* `split_pos_form`: If True splits predicted label into Vabamorf's tags `form` and `partofspeech`, otherwise outputs the label itself, which is a concatenation of previous Vabamorf's tags joined with `_`. Defaults to **True**

In [25]:
# morph_tagger = BertMorphTagger(get_top_n_predictions=1, token_level=False, split_pos_form=True) # With defined parameters
morph_tagger = BertMorphTagger() # Without defined parameters

Fetching 11 files: 100%|██████████| 11/11 [00:21<00:00,  1.94s/it]


`BertMorphTagger` relies on EstNLTK's `sentences` and `words` layer. Without it, tagger will not work.

In [26]:
input_str = "A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija."
text_obj = estnltk.Text(input_str)
# text_obj.tag_layer('sentences') # Without sentences and words layer
try:
    morph_tagger.tag(text_obj)
except ValueError as e:
    print(f"ValueError: {e}")

ValueError: missing input layer: 'sentences'


In [27]:
input_str = "A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija."
text_obj = estnltk.Text(input_str)
text_obj.tag_layer('sentences') # With sentences and words layer
morph_tagger.tag(text_obj)

text
"A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,15
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,11
bert_morph_tagging,"bert_tokens, form, partofspeech, probability",words,,True,11


Accessing `bert_morph_tagging` layer.

In [28]:
text_obj['bert_morph_tagging']

layer name,attributes,parent,enveloping,ambiguous,span count
bert_morph_tagging,"bert_tokens, form, partofspeech, probability",words,,True,11

text,bert_tokens,form,partofspeech,probability
A. H. Tammsaare,"['▁A', '.', '▁H', '.', '▁Tammsaare']",sg n,H,0.99541
oli,['▁oli'],s,V,0.99992
eesti,['▁eesti'],,G,0.99577
kirjanik,['▁kirjanik'],sg n,S,0.9997
",","[',']",,Z,0.99995
esseist,"['▁es', 'se', 'ist']",sg n,S,0.99956
",","[',']",,Z,0.99995
kultuurifilosoof,"['▁kultuuri', 'filosoof']",sg n,S,0.99963
ja,['▁ja'],,J,0.99988
tõlkija,['▁tõlkija'],sg n,S,0.99926


BertMorphTagger generates four tags:
* `bert_tokens`: List of BERT tokens that are in one Vabamorf's word;
* `form`: Noun or verb form (following [Vabamorf's tagset](00_tables_of_morphological_categories.ipynb));
* `partofspeech`: Word's part of speech (following [Vabamorf's tagset](00_tables_of_morphological_categories.ipynb));
* `probability`: probability of generated tags.

Below is an example of one Vabamorf's word and its related BertMorphTagger's tokens.

In [29]:
print(f"""Vabamorf's word: {text_obj['words'][0].text}
BertMorphTagger's tokens: {text_obj['bert_morph_tagging'][0]['bert_tokens'][0]}""")

Vabamorf's word: A. H. Tammsaare
BertMorphTagger's tokens: ['▁A', '.', '▁H', '.', '▁Tammsaare']


By default, BertMorphTagger only outputs the most probable partofspeech/form tag for each word. 

However, you can use the parameter `get_top_n_predictions` to output more partofspeech/form interpretations. 

Example: let's change `get_top_n_predictions` parameter to 2 to output two top-most probability predictions for each word:

In [30]:
morph_tagger = BertMorphTagger(get_top_n_predictions=2)
input_str = "A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija."
text_obj = estnltk.Text(input_str)
text_obj.tag_layer('sentences') # With sentences and words layer
morph_tagger.tag(text_obj)
text_obj['bert_morph_tagging']

layer name,attributes,parent,enveloping,ambiguous,span count
bert_morph_tagging,"bert_tokens, form, partofspeech, probability",words,,True,11

text,bert_tokens,form,partofspeech,probability
A. H. Tammsaare,"['▁A', '.', '▁H', '.', '▁Tammsaare']",sg n,H,0.99541
,"['▁A', '.', '▁H', '.', '▁Tammsaare']",sg g,H,0.00244
oli,['▁oli'],s,V,0.99992
,['▁oli'],sg n,S,0.0
eesti,['▁eesti'],,G,0.99577
,['▁eesti'],sg g,S,0.00042
kirjanik,['▁kirjanik'],sg n,S,0.9997
,['▁kirjanik'],sg n,A,3e-05
",","[',']",,Z,0.99995
,"[',']",,J,1e-05


Parameter `token_level` can be used to switch tokenization of the output layer from the EstNLTK's words layer (the default setting) to original tokenization that was produced by the tokenizer of the Bert model. 

Example: changing `token_level` parameter to **True**:

In [31]:
morph_tagger = BertMorphTagger(token_level=True)
input_str = "A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija."
text_obj = estnltk.Text(input_str)
text_obj.tag_layer('sentences') # With sentences and words layer
morph_tagger.tag(text_obj)
text_obj['bert_morph_tagging']

layer name,attributes,parent,enveloping,ambiguous,span count
bert_morph_tagging,"bert_tokens, form, partofspeech, probability",words,,True,18

text,bert_tokens,form,partofspeech,probability
A,▁A,sg n,H,0.99541
.,.,,Z,0.99949
H,▁H,sg n,H,0.99685
.,.,,Z,0.99929
Tammsaare,▁Tammsaare,sg n,H,0.99937
oli,▁oli,s,V,0.99992
eesti,▁eesti,,G,0.99577
kirjanik,▁kirjanik,sg n,S,0.9997
",",",",,Z,0.99995
es,▁es,sg n,S,0.99956


Parameter `split_pos_form` can be used to alternate between outputting `partofspeech` and `form` as separate features (the default setting), and outputting a single morphological tag (the attribute `morph_label`).

Example: changing `split_pos_form` parameter to **False**:

In [32]:
morph_tagger = BertMorphTagger(split_pos_form=False)
input_str = "A. H. Tammsaare oli eesti kirjanik, esseist, kultuurifilosoof ja tõlkija."
text_obj = estnltk.Text(input_str)
text_obj.tag_layer('sentences') # With sentences and words layer
morph_tagger.tag(text_obj)
text_obj['bert_morph_tagging']

layer name,attributes,parent,enveloping,ambiguous,span count
bert_morph_tagging,"bert_tokens, morph_label, probability",words,,True,11

text,bert_tokens,morph_label,probability
A. H. Tammsaare,"['▁A', '.', '▁H', '.', '▁Tammsaare']",sg n_H,0.99541
oli,['▁oli'],s_V,0.99992
eesti,['▁eesti'],G,0.99577
kirjanik,['▁kirjanik'],sg n_S,0.9997
",","[',']",Z,0.99995
esseist,"['▁es', 'se', 'ist']",sg n_S,0.99956
",","[',']",Z,0.99995
kultuurifilosoof,"['▁kultuuri', 'filosoof']",sg n_S,0.99963
ja,['▁ja'],J,0.99988
tõlkija,['▁tõlkija'],sg n_S,0.99926


As you can see above, tags `form` and `partofspeech` are concatenated into one tag `morph_label`.

## Using BertMorphTagger as a disambiguator

If the flag `disambiguate` is set to `True`, then the tagger can be used as an disambiguator (retagger) of an existing Vabamorf-based morph analysis layer. For this, the `output_layer` name also needs to be changed to the name of the morph analysis layer, which will become the input layer of the tagger.

Example

In [2]:
from estnltk import Text
from estnltk_neural.taggers import BertMorphTagger

In [3]:
# initialize BertMorphTagger as a disambiguator
morph_disambiguator = BertMorphTagger(output_layer='morph_analysis', disambiguate=True)

In [4]:
morph_disambiguator.input_layers

('sentences', 'words', 'morph_analysis')

In [5]:
# Use Vabamorf's analyzer to create an ambiguous morph analysis layer
from estnltk.taggers import VabamorfAnalyzer
vm_analyzer = VabamorfAnalyzer()
text = Text("Kärbes hulbib mees ja naeris puhub sädelevaid mulle.")
text.tag_layer( vm_analyzer.input_layers )
vm_analyzer.tag(text)
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Kärbes,Kärbes,Kärbe,Kärbe,['Kärbe'],s,,sg in,H
,Kärbes,Kärbes,Kärbes,['Kärbes'],0,,sg n,H
,Kärbes,kärbes,kärbes,['kärbes'],0,,sg n,S
hulbib,hulbib,hulpima,hulpi,['hulpi'],b,,b,V
mees,mees,mees,mees,['mees'],0,,sg n,S
,mees,mesi,mesi,['mesi'],s,,sg in,S
ja,ja,ja,ja,['ja'],0,,,J
naeris,naeris,naerma,naer,['naer'],is,,s,V
,naeris,naeris,naeris,['naeris'],0,,sg n,S
,naeris,naeris,naeris,['naeris'],s,,sg in,S


In [6]:
# Use BertMorphTagger to disambiguate the layer
morph_disambiguator.retag( text )
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Kärbes,Kärbes,kärbes,kärbes,['kärbes'],0,,sg n,S
hulbib,hulbib,hulpima,hulpi,['hulpi'],b,,b,V
mees,mees,mees,mees,['mees'],0,,sg n,S
ja,ja,ja,ja,['ja'],0,,,J
naeris,naeris,naeris,naeris,['naeris'],0,,sg n,S
puhub,puhub,puhuma,puhu,['puhu'],b,,b,V
sädelevaid,sädelevaid,sädelev,sädelev,['sädelev'],id,,pl p,A
mulle,mulle,mina,mina,['mina'],lle,,sg all,P
.,.,.,.,['.'],,,,Z


As for the limitation, the disambiguator does not resolve lemma ambiguities ( such as the ambiguity of the word _'teod'_ between lemmas _'tegu'_ and _'tigu'_ ).

---