# Morphological analysis with premorph and postmorph

In [12]:
from estnltk.text import Text

Premorph and postmorph are tools to improve the morphological analysis given by vabamorf. Premorph normalizes the input before giving it to vabamorf, postmorph normalizes the output of vabamorf. By default, premorph and postmorph are both executed.

In [136]:
# Let's take a sentence that contains an unnecessary hyphen and an incorrectly declined number.
t = Text('Tiit müüs lil-li 10e krooniga.')

In [137]:
t

Text(text="Tiit müüs lil-li 10e krooniga.")

In [138]:
# And let's tag an unknown layer (???) on the sentence. Cannot see that anything happened
t.tag_layer()

Text(text="Tiit müüs lil-li 10e krooniga.")

In [139]:
# But now we can ask for morphological analysis of the sentence
t.morf_analysis

SL[SL[Span(Tiit, {'ending': '0', 'root_tokens': ['Tiit'], 'root': 'Tiit', 'lemma': 'Tiit', 'clitic': '', 'partofspeech': 'H', 'form': 'sg n'})],
SL[Span(müüs, {'ending': 's', 'root_tokens': ['müü'], 'root': 'müü', 'lemma': 'müüma', 'clitic': '', 'partofspeech': 'V', 'form': 's'})],
SL[Span(lil-li, {'ending': 'i', 'root_tokens': ['lill'], 'root': 'lill', 'lemma': 'lill', 'clitic': '', 'partofspeech': 'S', 'form': 'pl p'})],
SL[Span(10e, {'ending': 'ile', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl all'}),
Span(10e, {'ending': 'tele', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl all'}),
Span(10e, {'ending': 'te', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl g'}),
Span(10e, {'ending': 'isse', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl ill'}),
Span(10e, {'ending': 'tesse', '

As we can see, the hyphen was removed and the correct form of the word "lilli" was found. For 10e, all the forms of 10 that end with an 'e' are given out. (But this is not the desired behaviour...)

In [140]:
# Can we decide whether we want disambiguation and guessing? 

The previous code actually uses WordNormalizingTagger, VabamorfTagger and VabamordCorrectionRewriter tools. So, we can write it out as follows:

In [151]:
# Import the taggers and rewriter
from estnltk.taggers.premorph.premorf import WordNormalizingTagger
from estnltk.taggers.morf import VabamorfTagger
from estnltk.rewriting.postmorph.vabamorf_corrector import VabamorfCorrectionRewriter

An instance of VabamorfCorrectionRewriter is created that [does something and replaces something...]. This is the rewriter that is used for postmorph by default.
In our example, it processes the 10e token.
If we want, we can write our own rewriter and use that instead.

In [231]:
# Create an instance of VabamorfCorrectionRewriter 
vabamorf_corrector = VabamorfCorrectionRewriter(replace=True)

In [153]:
# Let's take the same sentence as previously
t = Text('Tiit müüs lil-li 10e krooniga.')

In [154]:
# And tag the layer 'words' on it. Cannot see that anything happened
t.tag_layer(['words'])

Text(text="Tiit müüs lil-li 10e krooniga.")

In [155]:
# But now we can ask for the words layer
t.words

SL[Span(Tiit, {}),
Span(müüs, {}),
Span(lil-li, {}),
Span(10e, {}),
Span(krooniga, {}),
Span(., {})]

Now we can normalize the sentence with WordNormalizingTagger that takes care of the unnecessary hypens in words and [maybe sth else?]. In our example, it normalizes the word "lilli". It can be replaced with our own tagger if we decide to write one.

In [156]:
# Normalize the sentence with WordNormalizingTagger
WordNormalizingTagger().tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [157]:
# And we can see the word that was changed
t.normalized

SL[Span(lil-li, {'normal': 'lilli'})]

Now we can use the VabamorfTagger on the normalized layer received from WordNormalizingTagger and ask for the created VabamorfCorrectionRewriter to be used after vabamorf.

In [158]:
# Tag the text with VabamorfTagger using default premorph and postmorph
VabamorfTagger(premorf_layer='normalized', postmorph_rewriter=vabamorf_corrector).tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [159]:
# The same output is received as in the first example
t.morf_analysis

SL[SL[Span(Tiit, {'ending': '0', 'root_tokens': ['Tiit'], 'root': 'Tiit', 'lemma': 'Tiit', 'clitic': '', 'partofspeech': 'H', 'form': 'sg n'})],
SL[Span(müüs, {'ending': 's', 'root_tokens': ['müü'], 'root': 'müü', 'lemma': 'müüma', 'clitic': '', 'partofspeech': 'V', 'form': 's'})],
SL[Span(lil-li, {'ending': 'i', 'root_tokens': ['lill'], 'root': 'lill', 'lemma': 'lill', 'clitic': '', 'partofspeech': 'S', 'form': 'pl p'})],
SL[Span(10e, {'ending': 'ile', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl all'}),
Span(10e, {'ending': 'tele', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl all'}),
Span(10e, {'ending': 'te', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl g'}),
Span(10e, {'ending': 'isse', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl ill'}),
Span(10e, {'ending': 'tesse', '

As mentioned, we can customize premorph or postmorph.

To turn off postmorph for the same example, we need to set postmorph_rewriter to None:

In [174]:
t = Text('Tiit müüs lil-li 10e krooniga.')

In [175]:
t.tag_layer(['words'])

Text(text="Tiit müüs lil-li 10e krooniga.")

In [176]:
WordNormalizingTagger().tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [177]:
# postmorph_rewriter = None says that we don't want to apply the default rewriter
VabamorfTagger(premorf_layer='normalized', postmorph_rewriter=None).tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [171]:
t.morf_analysis

SL[SL[Span(Tiit, {'ending': '0', 'root_tokens': ['Tiit'], 'root': 'Tiit', 'lemma': 'Tiit', 'clitic': '', 'partofspeech': 'H', 'form': 'sg n'})],
SL[Span(müüs, {'ending': 's', 'root_tokens': ['müü'], 'root': 'müü', 'lemma': 'müüma', 'clitic': '', 'partofspeech': 'V', 'form': 's'})],
SL[Span(lil-li, {'ending': 'i', 'root_tokens': ['lill'], 'root': 'lill', 'lemma': 'lill', 'clitic': '', 'partofspeech': 'S', 'form': 'pl p'})],
SL[Span(10e, {'ending': '0', 'root_tokens': ['10e'], 'root': '10e', 'lemma': '10e', 'clitic': '', 'partofspeech': 'Y', 'form': '?'})],
SL[Span(krooniga, {'ending': 'ga', 'root_tokens': ['kroon'], 'root': 'kroon', 'lemma': 'kroon', 'clitic': '', 'partofspeech': 'S', 'form': 'sg kom'})],
SL[Span(., {'ending': '', 'root_tokens': ['.'], 'root': '.', 'lemma': '.', 'clitic': '', 'partofspeech': 'Z', 'form': ''})]]

Now we got only one analysis for the token '10e' which, unfortunately, is not correct.

We can also turn off premorph:

In [198]:
t = Text('Tiit müüs lil-li 10e krooniga.')

In [199]:
t.tag_layer(['words'])

Text(text="Tiit müüs lil-li 10e krooniga.")

In [200]:
# premorph_layer = None says that we don't want to apply the default rewriter
VabamorfTagger(premorf_layer=None, postmorph_rewriter=vabamorf_corrector).tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [202]:
t.morf_analysis

SL[SL[Span(Tiit, {'ending': '0', 'root_tokens': ['Tiit'], 'root': 'Tiit', 'lemma': 'Tiit', 'clitic': '', 'partofspeech': 'H', 'form': 'sg n'})],
SL[Span(müüs, {'ending': 's', 'root_tokens': ['müü'], 'root': 'müü', 'lemma': 'müüma', 'clitic': '', 'partofspeech': 'V', 'form': 's'})],
SL[Span(lil-li, {'ending': '0', 'root_tokens': ['lil', 'li'], 'root': 'lil-li', 'lemma': 'lil-li', 'clitic': '', 'partofspeech': 'Y', 'form': '?'})],
SL[Span(10e, {'ending': 'ile', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl all'}),
Span(10e, {'ending': 'tele', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl all'}),
Span(10e, {'ending': 'te', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl g'}),
Span(10e, {'ending': 'isse', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl ill'}),
Span(10e, {'ending': 'tes

In [204]:
t

Text(text="Tiit müüs lil-li 10e krooniga.")

Here we can see that the word "lil-li" was not normalized and therefore didn't receive the correct analysis.

In [None]:
# Can't we switch off premorph and postmorph some easier way?
# Somehow without explicitly creating the VabamorfTagger?

## Use the `morf_analysis` layer to create a `corrected_morph` layer


In [232]:
# What is the purpose of this?

1. Create a text object.
2. Tag the `normalized` layer (and also the `words` layer  #this happens somehow magically by tagging the 'normalized' layer?).
3. Create a layer `_morph` that contains the data from the layers `morf_analysis` and `normalized`.
5. Rewrite the `_morph` layer and get the `corrected_morph` layer as a result.
6. Attach the `corrected_morph` layer to the text object.

Now `text.corrected_morph` is the same as `t.morf_analysis` in the first example where premorph and postmorph are executed.

In [205]:
# Import the necessary stuff
from estnltk.text import Span, Layer
from estnltk.rewriting.postmorph.vabamorf_corrector import VabamorfCorrectionRewriter

In [212]:
text = Text('Tiit müüs lil-li 10e krooniga.')
text.tag_layer(['normalized'])
VabamorfTagger(premorf_layer='normalized', postmorph_rewriter=None).tag(text)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [213]:
# These are all the attributes of the morf_analysis layer? Why do I need to specify them here? 
morph_attributes = ['form', 'root_tokens', 'clitic', 'partofspeech', 'ending', 'root', 'lemma']

In [214]:
attributes = morph_attributes + ['word_normal']

In [215]:

_morph = Layer(name='words',
               parent='words',
               ambiguous=True,
               attributes=attributes
               )

In [216]:
for word, analyses in zip(text.words, text.morf_analysis):
    for analysis in analyses:
        span = _morph.add_span(Span(parent=word))
        for attr in morph_attributes:
            setattr(span, attr, getattr(analysis, attr))
        setattr(span, 'word_normal', word.normal or word.text)

In [217]:

postmorph_rewriter = VabamorfCorrectionRewriter()

In [218]:
corrected_morph = _morph.rewrite(source_attributes=attributes,
                                 target_attributes=morph_attributes, 
                                 rules=postmorph_rewriter,
                                 name='corrected_morph',
                                 ambiguous=True)

In [219]:
text['corrected_morph'] = corrected_morph

In [220]:
text.corrected_morph

SL[SL[Span(Tiit, {'ending': '0', 'root_tokens': ['Tiit'], 'root': 'Tiit', 'lemma': 'Tiit', 'clitic': '', 'partofspeech': 'H', 'form': 'sg n'})],
SL[Span(müüs, {'ending': 's', 'root_tokens': ['müü'], 'root': 'müü', 'lemma': 'müüma', 'clitic': '', 'partofspeech': 'V', 'form': 's'})],
SL[Span(lil-li, {'ending': 'i', 'root_tokens': ['lill'], 'root': 'lill', 'lemma': 'lill', 'clitic': '', 'partofspeech': 'S', 'form': 'pl p'})],
SL[Span(10e, {'ending': 'ile', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl all'}),
Span(10e, {'ending': 'tele', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl all'}),
Span(10e, {'ending': 'te', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl g'}),
Span(10e, {'ending': 'isse', 'root_tokens': ['10'], 'root': '10', 'lemma': '10', 'clitic': '', 'partofspeech': 'N', 'form': 'pl ill'}),
Span(10e, {'ending': 'tesse', '

In [222]:
text1 = Text('Tiit müüs lil-li 10e krooniga.')
text1.tag_layer(['normalized'])
VabamorfTagger(premorf_layer='normalized', postmorph_rewriter=postmorph_rewriter).tag(text1)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [229]:
str(text1.morf_analysis) == str(text.corrected_morph)

True

In [None]:
# In the end, we have the same output as from the first example but with a different name?
# No clue why we did this.