# GliLem lemmatizer and morphological disambiguator (GliLemTagger)

This tutorial introduces GliLem lemmatizer and morphological disambiguator created by [Dorkin and Sirts (2024)](https://arxiv.org/abs/2412.20597). 
GliLem enhances Vabamorf's lemmatizer with an external disambiguation module based on GliNER ([Zaratiana et al. (2023)](https://arxiv.org/abs/2311.08526)) to improve the lemmatization accuracy.

## Prerequisites

*Note: you need to install [estnltk_neural](https://github.com/estnltk/estnltk/tree/main/estnltk_neural) and [gliner](https://pypi.org/project/gliner/) packages for using the GliLemTagger.*

Note that the model required by the tagger is not distributed with the estnltk\_neural package. You can download the model in the following ways:
* If you create a new instance of `GliLemTagger` and the model has not been downloaded yet, you'll be prompted with a question asking for a permission to download the model;
* Alternatively, you can pre-download the model manually via the download function:

```python
from estnltk import download
download('glilem')
```


### Usage example

In [1]:
from estnltk import Text
from estnltk_neural.taggers import GliLemTagger

In [2]:
glilem_tagger = GliLemTagger()

config.json not found in C:\Programmid\Miniconda3\envs\py310_devel\Lib\site-packages\estnltk\estnltk_resources\gliner\vabamorf_disambiguator_hf_2024-09-26




In [3]:
# Examine input layers
glilem_tagger.input_layers

('words', 'compound_tokens', 'sentences')

In [4]:
# Create example input text
text = Text('4. koha tõenäsus on täpselt 0, seda sõltumata lisakoha tulekust või mittetulekust.')
# Add required layers
text.tag_layer( glilem_tagger.input_layers )
# Tag with glilem
glilem_tagger.tag(text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


text
"4. koha tõenäsus on täpselt 0, seda sõltumata lisakoha tulekust või mittetulekust."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,15
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,14
glilem,"lemma, score, label, vabamorf_overwritten, is_input_token",words,,True,7


In [5]:
text['glilem']

layer name,attributes,parent,enveloping,ambiguous,span count
glilem,"lemma, score, label, vabamorf_overwritten, is_input_token",words,,True,7

text,lemma,score,label,vabamorf_overwritten,is_input_token
koha,koht,0.9546396732330322,↓0;d¦-+t,True,True
on,olema,0.9999995231628418,↓0;d¦-+l+e+m+a,False,True
seda,see,0.9980397820472716,↓0;d¦--+e,False,True
sõltumata,sõltuma,0.9897437691688538,↓0;d¦--,True,True
lisakoha,lisakoht,0.963400900363922,↓0;d¦-+t,False,True
tulekust,tulek,0.9831607937812804,↓0;d¦--,False,True
mittetulekust,mittetulek,0.9926141500473022,↓0;d¦-+n→--,False,True


In [6]:
text['glilem'].display()

Note that by default, the output layer only contains words which `lemma` differs from the surface form (`text`). 
Layer attributes:
* `score` -- probability of the given lemma (assigned by GliNER);
* `label` -- how the lemma was derived from the surface form (for details about the encoding, see [Dorkin and Sirts (2024)](https://arxiv.org/abs/2412.20597), [Straka (2018)](https://aclanthology.org/K18-2020/));
* `vabamorf_overwritten` -- whether this lemma overwrites ambiguous lemmas output by Vabamorf. In other words: whether Vabamorf's lemma was disambiguated or not. If `False`, then this lemma corresponds to the unambiguous lemma output by Vabamorf:
* `is_input_token` -- whether this word/token corresponds to a word in the input `words` layer or not. Note that GliNER's default tokenization may differ from the tokenization in the input `words` layer, so this flag indicates whether the tokens in two layers match.

### Missing lemmas strategy

GliLem does not produce a lemma for every token, and there are many words that also do not need to be changed during the lemmatization. 
However, sometimes GliLem model erroneously misses words which need to be lemmatized. 
Parameter `missing_lemmas_strategy` (string) tells `GliLemTagger` what to do in case a word does not obtain a lemma.
Possible values:
* `"discard"` (default) -- do no produce any spans for words with missing lemmas;
* `"none_values"` -- add spans filled in with `None` values for words with missing lemmas;
* `"vabamorf_lemmas"` -- add missing lemmas from underlying Vabamorf's lemmatizer. Note: the underlying Vabamorf's lemmatizer uses settings `compound=False`, `disambiguate=False`, `guess=True`, `slang_lex=False`, `propername=True`;

In [7]:
# Initialize GliLem in a mode where missing lemmas will be obtained from Vabamorf
glilem_tagger_2 = GliLemTagger(missing_lemmas_strategy="vabamorf_lemmas")

config.json not found in C:\Programmid\Miniconda3\envs\py310_devel\Lib\site-packages\estnltk\estnltk_resources\gliner\vabamorf_disambiguator_hf_2024-09-26




In [8]:
 # Create example input text
text = Text('Teod roomasid aeglaselt kõrtsu poole.')
# Add required layers
text.tag_layer( glilem_tagger.input_layers )
# Tag with glilem
glilem_tagger_2.tag(text)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


text
Teod roomasid aeglaselt kõrtsu poole.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,6
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,6
glilem,"lemma, score, label, vabamorf_overwritten, is_input_token",words,,True,6


In [9]:
text['glilem']

layer name,attributes,parent,enveloping,ambiguous,span count
glilem,"lemma, score, label, vabamorf_overwritten, is_input_token",words,,True,6

text,lemma,score,label,vabamorf_overwritten,is_input_token
Teod,tegu,0.569809079170227,↓0;d¦--+g+u,True,True
roomasid,roomama,0.9996744394302368,↓0;d¦---+m+a,True,True
aeglaselt,aeglane,,,False,True
,aeglaselt,,,False,True
kõrtsu,kõrts,0.8469443917274475,↓0;d¦-,True,True
poole,pool,,,False,True
,poole,,,False,True
.,.,,,False,True


In the example above, words _aeglaselt_ and _poole_ did not obtain any lemmatization suggestions by GliLem, so their lemmas were obtained from Vabamorf. Note that because the underlying Vabamorf's lemmatizer does not used disambiguation, ambiguous lemmatization will be obtained. 

### Using as a disambiguator

`GliLemTagger` can also be used as a disambiguator to resolve **lemma ambiguities** of an existing morphological analysis layer that uses Vabamorf's tagset. 
For this, the flag `disambiguate` needs to be set to `True`, and the `output_layer` needs to be changed to the name of the morph analysis layer, which will be disambiguated.

Example

In [10]:
from estnltk import Text
from estnltk_neural.taggers import GliLemTagger

In [11]:
# initialize GliLemTagger as a disambiguator
glilem_disambiguator = GliLemTagger(output_layer='morph_analysis', disambiguate=True)

config.json not found in C:\Programmid\Miniconda3\envs\py310_devel\Lib\site-packages\estnltk\estnltk_resources\gliner\vabamorf_disambiguator_hf_2024-09-26




In [12]:
glilem_disambiguator.input_layers

('words', 'compound_tokens', 'sentences', 'morph_analysis')

In [13]:
# Use Vabamorf's analyzer to create an ambiguous morph analysis layer
from estnltk.taggers import VabamorfAnalyzer
vm_analyzer = VabamorfAnalyzer()
text = Text("Mees peeti kinni.")
text.tag_layer( vm_analyzer.input_layers )
vm_analyzer.tag(text)
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mees,Mees,Mees,Mees,['Mees'],0,,sg n,H
,Mees,Mee,Mee,['Mee'],s,,sg in,H
,Mees,Mesi,Mesi,['Mesi'],s,,sg in,H
,Mees,mees,mees,['mees'],0,,sg n,S
,Mees,mesi,mesi,['mesi'],s,,sg in,S
peeti,peeti,peet,peet,['peet'],0,,adt,S
,peeti,pidama,pida,['pida'],ti,,ti,V
,peeti,peet,peet,['peet'],0,,sg p,S
kinni,kinni,kinni,kinni,['kinni'],0,,,D
.,.,.,.,['.'],,,,Z


In [14]:
# Use GliLemTagger to disambiguate the layer
glilem_disambiguator.retag( text )
text['morph_analysis']

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mees,Mees,mees,mees,['mees'],0,,sg n,S
peeti,peeti,pidama,pida,['pida'],ti,,ti,V
kinni,kinni,kinni,kinni,['kinni'],0,,,D
.,.,.,.,['.'],,,,Z


As for the limitation, this disambiguator does not resolve partofspeech and form ambiguities ( such as ambiguity of the word _'üks'_ between pronoun and number interpretations, and ambiguity of the word _'kõne'_ between nominative and genitive interpretations ).