## Reordering ambiguous morphological analyses

By design, Vabamorf's morphological analysis tool is "agnostic" on solving all the morphological ambiguities: rather than solving hard cases incorrectly, the tool opts to leave hard ambiguities unresolved, so that the end user can decide how to approach these. 
As a result, even after applying Vabamorf with full disambiguation, some of the words still have morphological ambiguities. 
It is important to note that these ambiguous morphological analyses _are not sorted by probability_ , and so, picking the first analysis is not a good strategy on handling these (there is approx 50% chance of getting the correct analysis with that strategy).

To address this issue, `MorphAnalysisReorderer` is used to reorder ambiguous analyses in a way that the first analysis has a higher likelihood of being the correct one. Example:

In [1]:
from estnltk import Text

# Create a text with hard-to-solve ambiguities
text=Text("Üks ütles, et 1. mail tähistab palju maid töörahvapüha.")
text.tag_layer('sentences')

# Add Vabamorf's morph analyses (without disambiguation)
from estnltk.taggers import VabamorfAnalyzer
vm_analyser = VabamorfAnalyzer()
vm_analyser.tag( text )

# Use default disambiguation
from estnltk.taggers import VabamorfDisambiguator
vm_disambiguator = VabamorfDisambiguator()
vm_disambiguator.retag( text )

# Examine morph_analysis ambiguities
text.morph_analysis[ lambda word_span: len(word_span.annotations) > 1 ]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Üks,Üks,üks,üks,['üks'],0,,sg n,N
,Üks,üks,üks,['üks'],0,,sg n,P
mail,mail,maa,maa,['maa'],il,,pl ad,S
,mail,mai,mai,['mai'],l,,sg ad,S
maid,maid,maa,maa,['maa'],id,,pl p,S
,maid,mai,mai,['mai'],d,,sg p,S


Now, let's use `MorphAnalysisReorderer` (a `Retagger` of `morph_analysis` layer) to reorder morphological ambiguities:

In [2]:
from estnltk.taggers import MorphAnalysisReorderer

morph_reorderer = MorphAnalysisReorderer()
morph_reorderer.retag( text )

# Examine the order of ambiguities
text.morph_analysis[ lambda word_span: len(word_span.annotations) > 1 ]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Üks,Üks,üks,üks,['üks'],0,,sg n,P
,Üks,üks,üks,['üks'],0,,sg n,N
mail,mail,mai,mai,['mai'],l,,sg ad,S
,mail,maa,maa,['maa'],il,,pl ad,S
maid,maid,maa,maa,['maa'],id,,pl p,S
,maid,mai,mai,['mai'],d,,sg p,S


Note the improvements on analysis order of words _üks_ and _mail_.

For reordering analyses, `MorphAnalysisReorderer` uses a simple frequency-dictionary based approach.

   * Firsthand reordering: if a word with ambiguous analyses is in the dictionary mapping from words to their frequency-sorted analyses, then word's ambiguous analyses are re-sorted according to the ordering in the dictionary;


   * Fallback reordering: if there was no data for the firsthand reordering, then analyses of an ambiguous word are sorted according to part of speech tags frequency information (based on the dictionary of part of speech corpus frequencies);
   
   
The default dictionaries of `MorphAnalysisReorderer` have been acquired from the training part of the [Estonian Dependency Treebank](https://github.com/UniversalDependencies/UD_Estonian-EDT/tree/5eba261d1ed63507a44063a4e05b77b1db5f4aac).
Evaluation on the dev and test parts of the corpus showed that after reorderings, the chance of having the first analysis as the correct one increased from ~50% to ~70%. The source code for creating the dictionaries and evaluating the reordering accuracy can be found from: https://github.com/estnltk/ambiguous-morph-reordering

Things to keep in mind:

   * `MorphAnalysisReorderer` has already been integrated inside `VabamorfTagger` and `VabamorfCorpusTagger`, and it is enabled by default. If you call `tag_layer('morph_analysis')`, the reordering is included in the process. So normally, you do not need to run the tagger by yourself.


   * You get a full effect of `MorphAnalysisReorderer` only on morphologically disambiguated texts, e.g. applying it after `VabamorfDisambiguator`. If you apply it directly after `VabamorfAnalyzer` (on ambiguous `morph_analysis` layer), then the reordering performance is likely suboptimal, because the firsthand reordering dictionary contains only information about words that were left ambiguous after morphological disambiguation process.
   
   
   * The firsthand reordering dictionary of `MorphAnalysisReorderer` may not be the most optimal reorderer for every text domain. If you need to handle a specific domain, then you can make your of own dictionary of reorderings and use it in `MorphAnalysisReorderer`. See below for details.

### Using a custom dictionary

`MorphAnalysisReorderer` loads its firsthand reordering data from a tab-separated-values format CSV file. The first line in the file must be a header specifying (at minimum) the following attributes:
 * `text` -- word surface form;
 * `lemma` -- 'lemma' attribute from 'morph_analysis';
 * `partofspeech` -- 'partofspeech' attribute from 'morph_analysis';
 * `form` -- 'form' attribute from 'morph_analysis';
 * `prob` or `freq` -- probability or frequency of the analysis;
 
Other attributes from the `morph_analysis` layer can also be used if higher precision is needed for differentiating analyses. 
The header is required to determine in which order the data  needs to be loaded from the file. 
Each line following the header specifies a single analysis for a word. 
Naturally, a word having multiple analyses should be described on multiple successive lines.
Important: we assume that analyses in CSV file are already in the correct order: sorted from most probable to least probable.

An example:

In [3]:
# Create a CSV file with correct orderings
import tempfile
fp = tempfile.NamedTemporaryFile(mode='w', encoding='utf-8', suffix='.csv', delete=False)
# Add header
fp.write( ('\t'.join(['text','lemma','partofspeech','form','prob'])) + '\n' )
# Add analysis reorderings:
# word 'teine'
fp.write( ('\t'.join(['teine','teine','P','sg n','0.75'])) + '\n' )
fp.write( ('\t'.join(['teine','teine','N','sg n','0.25'])) + '\n' )
# word 'maid'
fp.write( ('\t'.join(['maid','mai','S','sg p','0.8'])) + '\n' )
fp.write( ('\t'.join(['maid','maa','S','pl p','0.2'])) + '\n' )
fp.close()

In [4]:
# Create new reorderer that loads the firsthand dictionary from the CSV file
from estnltk.taggers import MorphAnalysisReorderer

morph_reorderer = MorphAnalysisReorderer( reorderings_csv_file = fp.name )

In [5]:
from estnltk import Text

# Create a text with hard-to-solve ambiguities
text=Text("Teine jälle kirus 1. maid.")
text.tag_layer('sentences')

# Add Vabamorf's morph analyses (without disambiguation)
from estnltk.taggers import VabamorfAnalyzer
vm_analyser = VabamorfAnalyzer()
vm_analyser.tag( text )

# Use default disambiguation
from estnltk.taggers import VabamorfDisambiguator
vm_disambiguator = VabamorfDisambiguator()
vm_disambiguator.retag( text )

# Apply reorderer
morph_reorderer.retag(text)

# Examine morph_analysis ambiguities
text.morph_analysis[ lambda word_span: len(word_span.annotations) > 1 ]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Teine,Teine,teine,teine,['teine'],0,,sg n,P
,Teine,teine,teine,['teine'],0,,sg n,O
maid,maid,mai,mai,['mai'],d,,sg p,S
,maid,maa,maa,['maa'],id,,pl p,S


In [6]:
# Clean-up: remove temporary file
import os
os.remove(fp.name)

### Customizing fallback reordering dictionary

In addition to customizing the firsthand dictionary, you can also customize the part of speech dictionary that is used for fallback reordering. 
Simply initialize reorderer with the parameter `postag_freq_csv_file`:

```python
from estnltk.taggers import MorphAnalysisReorderer
morph_reorderer = MorphAnalysisReorderer( postag_freq_csv_file = 'my_postag_freq.csv' )
``` 
By default, assumes that the CSV file is in tab-separated-values format (dialect='excel-tab') and in the encoding 'utf-8'. The first line must be a header specifying column ordering (`partofspeech` and `freq`).