# Morphological analysis with premorph and postmorph

In [1]:
from estnltk.text import Text

Premorph and postmorph are tools to improve the morphological analysis given by vabamorf. Premorph normalizes the input before giving it to vabamorf, postmorph normalizes the output of vabamorf. By default, premorph and postmorph are both executed.

In [2]:
# Let's take a sentence that contains an unnecessary hyphen and an incorrectly declined number.
t = Text('Tiit müüs lil-li 10e krooniga.')

In [3]:
t

Text(text="Tiit müüs lil-li 10e krooniga.")

In [4]:
# And let's tag the default layers on the sentence. Cannot see that anything happened
t.tag_layer()

Text(text="Tiit müüs lil-li 10e krooniga.")

The default layer tagged by tag_layer() method is the morphological analysis layer. To perform morphological analysis, 'words', 'sentences' and 'normalized' layers are tagged on the text first.

We can see the existing layers with the 'layers' class variable.

In [5]:
t.layers

{'morf_analysis': <estnltk.text.Layer at 0x7f8f8901b898>,
 'normalized': <estnltk.text.Layer at 0x7f8f8901b630>,
 'sentences': <estnltk.text.Layer at 0x7f8f8901bfd0>,
 'words': <estnltk.text.Layer at 0x7f8f8901b9e8>}

In [6]:
# And now we can ask for morphological analysis of the sentence
t.morf_analysis

SL[SL[Span(Tiit, {'partofspeech': 'H', 'root_tokens': ['Tiit'], 'ending': '0', 'form': 'sg n', 'clitic': '', 'root': 'Tiit', 'lemma': 'Tiit'})],
SL[Span(müüs, {'partofspeech': 'V', 'root_tokens': ['müü'], 'ending': 's', 'form': 's', 'clitic': '', 'root': 'müü', 'lemma': 'müüma'})],
SL[Span(lil-li, {'partofspeech': 'S', 'root_tokens': ['lill'], 'ending': 'i', 'form': 'pl p', 'clitic': '', 'root': 'lill', 'lemma': 'lill'})],
SL[Span(10e, {'partofspeech': 'N', 'root_tokens': ['10'], 'ending': '0', 'form': 'sg g', 'clitic': '', 'root': '10', 'lemma': '10'})],
SL[Span(krooniga, {'partofspeech': 'S', 'root_tokens': ['kroon'], 'ending': 'ga', 'form': 'sg kom', 'clitic': '', 'root': 'kroon', 'lemma': 'kroon'})],
SL[Span(., {'partofspeech': 'Z', 'root_tokens': ['.'], 'ending': '', 'form': '', 'clitic': '', 'root': '.', 'lemma': '.'})]]

As we can see, the hyphen was removed and the correct form of the word "lilli" was found. For 10e the form 'sg g' was given out.

In [7]:
#--------IMPORTANT-----------------#
# How to switch on/off disambiguation and/or guessing in vabamorf?
# Do I have to do this using VabamorfTagger or can I do it with tag_layer() method? How?

The previous code actually uses WordNormalizingTagger, VabamorfTagger and VabamordCorrectionRewriter tools. So, we can write it out as follows:

In [8]:
# Import the taggers and rewriter
from estnltk.taggers.premorph.premorf import WordNormalizingTagger
from estnltk.taggers.morf import VabamorfTagger
from estnltk.rewriting.postmorph.vabamorf_corrector import VabamorfCorrectionRewriter

An instance of VabamorfCorrectionRewriter is created that fixes the analysis of tokens containing numbers. This is the rewriter that is used for postmorph by default.
In our example, it processes the 10e token.
If we want, we can write our own rewriter and use that instead.

In [9]:
# Create an instance of VabamorfCorrectionRewriter 
vabamorf_corrector = VabamorfCorrectionRewriter(replace=True)

In [10]:
# Let's take the same sentence as previously
t = Text('Tiit müüs lil-li 10e krooniga.')

In [11]:
# And tag the layer 'words' on it. Cannot see that anything happened
t.tag_layer(['words'])

Text(text="Tiit müüs lil-li 10e krooniga.")

In [12]:
# But now we can ask for the words layer
t.words

SL[Span(Tiit, {}),
Span(müüs, {}),
Span(lil-li, {}),
Span(10e, {}),
Span(krooniga, {}),
Span(., {})]

Now we can normalize the sentence with WordNormalizingTagger that takes care of the unnecessary hypens in words and [maybe sth else?]. In our example, it normalizes the word "lilli". It can be replaced with our own tagger if we decide to write one.

In [13]:
# Normalize the sentence with WordNormalizingTagger
WordNormalizingTagger().tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [14]:
# And we can see the word that was changed
t.normalized

SL[Span(lil-li, {'normal': 'lilli'})]

Now we can use the VabamorfTagger on the normalized layer received from WordNormalizingTagger and ask for the created VabamorfCorrectionRewriter to be used after vabamorf.

In [15]:
# Tag the text with VabamorfTagger using default premorph and postmorph
VabamorfTagger(premorf_layer='normalized', postmorph_rewriter=vabamorf_corrector).tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [16]:
# The same output is received as in the first example
t.morf_analysis

SL[SL[Span(Tiit, {'partofspeech': 'H', 'root_tokens': ['Tiit'], 'ending': '0', 'form': 'sg n', 'clitic': '', 'root': 'Tiit', 'lemma': 'Tiit'})],
SL[Span(müüs, {'partofspeech': 'V', 'root_tokens': ['müü'], 'ending': 's', 'form': 's', 'clitic': '', 'root': 'müü', 'lemma': 'müüma'})],
SL[Span(lil-li, {'partofspeech': 'S', 'root_tokens': ['lill'], 'ending': 'i', 'form': 'pl p', 'clitic': '', 'root': 'lill', 'lemma': 'lill'})],
SL[Span(10e, {'partofspeech': 'N', 'root_tokens': ['10'], 'ending': '0', 'form': 'sg g', 'clitic': '', 'root': '10', 'lemma': '10'})],
SL[Span(krooniga, {'partofspeech': 'S', 'root_tokens': ['kroon'], 'ending': 'ga', 'form': 'sg kom', 'clitic': '', 'root': 'kroon', 'lemma': 'kroon'})],
SL[Span(., {'partofspeech': 'Z', 'root_tokens': ['.'], 'ending': '', 'form': '', 'clitic': '', 'root': '.', 'lemma': '.'})]]

As mentioned, we can customize premorph or postmorph.

To turn off postmorph for the same example, we need to set postmorph_rewriter to None:

In [17]:
t = Text('Tiit müüs lil-li 10e krooniga.')

In [18]:
t.tag_layer(['words'])

Text(text="Tiit müüs lil-li 10e krooniga.")

In [19]:
WordNormalizingTagger().tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [20]:
# postmorph_rewriter = None says that we don't want to apply the default rewriter
VabamorfTagger(premorf_layer='normalized', postmorph_rewriter=None).tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [21]:
t.morf_analysis

SL[SL[Span(Tiit, {'partofspeech': 'H', 'root_tokens': ['Tiit'], 'ending': '0', 'form': 'sg n', 'clitic': '', 'root': 'Tiit', 'lemma': 'Tiit'})],
SL[Span(müüs, {'partofspeech': 'V', 'root_tokens': ['müü'], 'ending': 's', 'form': 's', 'clitic': '', 'root': 'müü', 'lemma': 'müüma'})],
SL[Span(lil-li, {'partofspeech': 'S', 'root_tokens': ['lill'], 'ending': 'i', 'form': 'pl p', 'clitic': '', 'root': 'lill', 'lemma': 'lill'})],
SL[Span(10e, {'partofspeech': 'Y', 'root_tokens': ['10e'], 'ending': '0', 'form': '?', 'clitic': '', 'root': '10e', 'lemma': '10e'})],
SL[Span(krooniga, {'partofspeech': 'S', 'root_tokens': ['kroon'], 'ending': 'ga', 'form': 'sg kom', 'clitic': '', 'root': 'kroon', 'lemma': 'kroon'})],
SL[Span(., {'partofspeech': 'Z', 'root_tokens': ['.'], 'ending': '', 'form': '', 'clitic': '', 'root': '.', 'lemma': '.'})]]

Now we got only one analysis for the token '10e' which, unfortunately, is not correct.

We can also turn off premorph:

In [22]:
t = Text('Tiit müüs lil-li 10e krooniga.')

In [23]:
t.tag_layer(['words'])

Text(text="Tiit müüs lil-li 10e krooniga.")

In [24]:
# premorph_layer = None says that we don't want to apply the default rewriter
VabamorfTagger(premorf_layer=None, postmorph_rewriter=vabamorf_corrector).tag(t)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [25]:
t.morf_analysis

SL[SL[Span(Tiit, {'partofspeech': 'H', 'root_tokens': ['Tiit'], 'ending': '0', 'form': 'sg n', 'clitic': '', 'root': 'Tiit', 'lemma': 'Tiit'})],
SL[Span(müüs, {'partofspeech': 'V', 'root_tokens': ['müü'], 'ending': 's', 'form': 's', 'clitic': '', 'root': 'müü', 'lemma': 'müüma'})],
SL[Span(lil-li, {'partofspeech': 'Y', 'root_tokens': ['lil', 'li'], 'ending': '0', 'form': '?', 'clitic': '', 'root': 'lil-li', 'lemma': 'lil-li'})],
SL[Span(10e, {'partofspeech': 'N', 'root_tokens': ['10'], 'ending': '0', 'form': 'sg g', 'clitic': '', 'root': '10', 'lemma': '10'})],
SL[Span(krooniga, {'partofspeech': 'S', 'root_tokens': ['kroon'], 'ending': 'ga', 'form': 'sg kom', 'clitic': '', 'root': 'kroon', 'lemma': 'kroon'})],
SL[Span(., {'partofspeech': 'Z', 'root_tokens': ['.'], 'ending': '', 'form': '', 'clitic': '', 'root': '.', 'lemma': '.'})]]

In [26]:
t

Text(text="Tiit müüs lil-li 10e krooniga.")

Here we can see that the word "lil-li" was not normalized and therefore didn't receive the correct analysis.

In [27]:
# Can't we switch off premorph and postmorph some easier way?
# Somehow without explicitly creating the VabamorfTagger?

## Use the `morf_analysis` layer to create a `corrected_morph` layer


In [28]:
# What is the purpose of this?
# According to Sven, this shows how to tag your own layer with a custom postmorph rewriter
# in addition to the default layer.
# The example code is too complicated  and undocumented - impossible to understand.
# An easier example would be nice that wouldn't try to contain everything.

1. Create a text object.
2. Tag the `normalized` layer (and also the `words` layer  #this happens somehow magically by tagging the 'normalized' layer?).
3. Create a layer `_morph` that contains the data from the layers `morf_analysis` and `normalized`.
5. Rewrite the `_morph` layer and get the `corrected_morph` layer as a result.
6. Attach the `corrected_morph` layer to the text object.

Now `text.corrected_morph` is the same as `t.morf_analysis` in the first example where premorph and postmorph are executed.

In [29]:
# Import the necessary stuff
from estnltk.text import Span, Layer
from estnltk.rewriting.postmorph.vabamorf_corrector import VabamorfCorrectionRewriter

In [30]:
text = Text('Tiit müüs lil-li 10e krooniga.')
text.tag_layer(['normalized'])
VabamorfTagger(premorf_layer='normalized', postmorph_rewriter=None).tag(text)

Text(text="Tiit müüs lil-li 10e krooniga.")

In [31]:
# These are all the attributes of the morf_analysis layer? Why do I need to specify them here? 
morph_attributes = ['form', 'root_tokens', 'clitic', 'partofspeech', 'ending', 'root', 'lemma']

In [32]:
attributes = morph_attributes + ['word_normal']

In [33]:

_morph = Layer(name='words',
               parent='words',
               ambiguous=True,
               attributes=attributes
               )

In [34]:
for word, analyses in zip(text.words, text.morf_analysis):
    for analysis in analyses:
        span = _morph.add_span(Span(parent=word))
        for attr in morph_attributes:
            setattr(span, attr, getattr(analysis, attr))
        setattr(span, 'word_normal', word.normal or word.text)

In [35]:

postmorph_rewriter = VabamorfCorrectionRewriter()

In [36]:
corrected_morph = _morph.rewrite(source_attributes=attributes,
                                 target_attributes=morph_attributes, 
                                 rules=postmorph_rewriter,
                                 name='corrected_morph',
                                 ambiguous=True)

In [37]:
text['corrected_morph'] = corrected_morph

In [38]:
text.corrected_morph

SL[SL[Span(Tiit, {'partofspeech': 'H', 'root_tokens': ['Tiit'], 'ending': '0', 'form': 'sg n', 'clitic': '', 'root': 'Tiit', 'lemma': 'Tiit'})],
SL[Span(müüs, {'partofspeech': 'V', 'root_tokens': ['müü'], 'ending': 's', 'form': 's', 'clitic': '', 'root': 'müü', 'lemma': 'müüma'})],
SL[Span(lil-li, {'partofspeech': 'S', 'root_tokens': ['lill'], 'ending': 'i', 'form': 'pl p', 'clitic': '', 'root': 'lill', 'lemma': 'lill'})],
SL[Span(10e, {'partofspeech': 'N', 'root_tokens': ['10'], 'ending': '0', 'form': 'sg g', 'clitic': '', 'root': '10', 'lemma': '10'})],
SL[Span(krooniga, {'partofspeech': 'S', 'root_tokens': ['kroon'], 'ending': 'ga', 'form': 'sg kom', 'clitic': '', 'root': 'kroon', 'lemma': 'kroon'})],
SL[Span(., {'partofspeech': 'Z', 'root_tokens': ['.'], 'ending': '', 'form': '', 'clitic': '', 'root': '.', 'lemma': '.'})]]