## <span style="color:purple">Morphological analysis with corpus-based disambiguation</span>

EstNLTK's default morphological analysis uses a probabilistic disambiguator which relies on the local sentence context in making the disambiguation decisions (Kaalep and Vaino 2001). 
This works reasonably well for many types of texts: news articles, comments, mixed content etc.

However, the default disambiguator has difficulties on getting the proper name analyses correct if proper names overlap with regular words (e.g. family names that overlap with animal names, such as _Jänes_ or _Karu_).
Furthermore, there are word forms which ambiguity is rather difficult to resolve locally. 
For instance, _koha_ can be either _a place_ (in genitive), or a _walleye (a fish)_ (in nominative) -- we need some knowledge about the topic of the text to pick the most likely candidate (e.g. whether the text talks about _sports events_ or describes _nature_). 
These disambiguation problems can be addressed by using a broader context (e.g. a collection of texts) in the morphological disambiguation process.
If morphologically ambiguous words reoccur in other parts of the text or in other related texts, one can use the assumption "one lemma per discourse" (inspired by the observation "one sense per discourse" from Word Sense Disambiguation (Gale et al. 1992)) and choose the right analysis based on the most frequently occurring lemma candidate (Kaalep et al. 2012).

#### Example: ambiguities unresolved by the default disambiguation

Let's first consider an example of a text collection, which contains ambiguities that cannot be correctly resolved by the default disambiguation:

In [1]:
corpus_texts = ['Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.',\
                'Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. '+\
                'Uus võistlus toimub 2. mail.', \
                'Karu paistis silma suurima punktide summaga. Jänesega jokk. Uue võistluse '+\
                'toimumisajaks on 2. mai.']

In [2]:
# Create Texts and apply default morphological analysis
from estnltk import Text
corpus = [Text(t).tag_layer(['morph_analysis']) for t in corpus_texts]

In [3]:
# Examine words with ambiguous analyses
for text in corpus:
    for word in text.words:
        if len(word.morph_analysis) > 1:
            print( word.text,[(a.root,a.partofspeech,a.form) for a in word.morph_analysis] )

kohale [('koha', 'S', 'sg all'), ('koht', 'S', 'sg all')]
kuigi [('kuigi', 'D', ''), ('kuigi', 'J', '')]
Teise [('teine', 'O', 'sg g'), ('teine', 'P', 'sg g')]
koha [('koha', 'S', 'sg g'), ('koht', 'S', 'sg g')]
mail [('maa', 'S', 'pl ad'), ('mai', 'S', 'sg ad')]
summaga [('summa', 'S', 'sg kom'), ('summ', 'S', 'sg kom')]
on [('ole', 'V', 'b'), ('ole', 'V', 'vad')]


Note that the content nouns _kohale_, _koha_, _mail_ and _summaga_ remain ambiguous in this process.

In [4]:
# Examine analyses of titlecase words
for text in corpus:
    for word in text.words:
        if word.text.istitle():
            print( word.text,[(a.root,a.partofspeech,a.form) for a in word.morph_analysis] )

Esimesele [('esimene', 'O', 'sg all')]
Jänes [('jänes', 'S', 'sg n')]
Lõpparvestuses [('lõpp_arvestus', 'S', 'sg in')]
Karule [('karu', 'S', 'sg all')]
Teise [('teine', 'O', 'sg g'), ('teine', 'P', 'sg g')]
Jänes [('jänes', 'S', 'sg n')]
Uus [('uus', 'A', 'sg n')]
Karu [('Karu', 'H', 'sg n')]
Jänesega [('jänes', 'S', 'sg kom')]
Uue [('uus', 'A', 'sg g')]


Note that the proper names _Jänes_, _Karule_ and _Jänesega_ are incorrectly analysed as common nouns (part of speech 'S').

Let's use corpus-based disambituation to solve these problems!

### Corpus analysis with VabamorfCorpusTagger

EstNLTK provides a special tagger called VabamorfCorpusTagger, which performs morphological analysis, local context morphological disambiguation, and "one lemma per discourse" disambiguation on a list of Text objects:

In [5]:
# Create new VabamorfCorpusTagger. 
# Use a different name for morph analysis layer
from estnltk.taggers import VabamorfCorpusTagger
vm_corpus_tagger = VabamorfCorpusTagger( output_layer='morph_analysis_with_cb_disamb' )

In [6]:
# Retag the corpus, apply corpus-based morph analysis
vm_corpus_tagger.tag( corpus )

[Text(text='Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.'),
 Text(text='Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. Uus võistlus toimub 2. mail.'),
 Text(text='Karu paistis silma suurima punktide summaga. Jänesega jokk. Uue võistluse toimumisajaks on 2. mai.')]

Let's examine the results:

In [7]:
# Output improvements of corpus-based disambiguation
for text in corpus:
    for word in text.words:
        # old morph analyses
        old_morph = word.morph_analysis
        # new morph analyses
        new_morph = word.morph_analysis_with_cb_disamb
        if old_morph != new_morph:
            # Output changed analyses
            old_analyses = [(a.root,a.partofspeech,a.form) for a in old_morph]
            new_analyses = [(a.root,a.partofspeech,a.form) for a in new_morph]
            print(word.text, old_analyses, '-->', new_analyses)
            print()

kohale [('koha', 'S', 'sg all'), ('koht', 'S', 'sg all')] --> [('koht', 'S', 'sg all')]

Jänes [('jänes', 'S', 'sg n')] --> [('Jänes', 'H', 'sg n')]

Karule [('karu', 'S', 'sg all')] --> [('Karu', 'H', 'sg all')]

koha [('koha', 'S', 'sg g'), ('koht', 'S', 'sg g')] --> [('koht', 'S', 'sg g')]

Jänes [('jänes', 'S', 'sg n')] --> [('Jänes', 'H', 'sg n')]

mail [('maa', 'S', 'pl ad'), ('mai', 'S', 'sg ad')] --> [('mai', 'S', 'sg ad')]

summaga [('summa', 'S', 'sg kom'), ('summ', 'S', 'sg kom')] --> [('summa', 'S', 'sg kom')]

Jänesega [('jänes', 'S', 'sg kom')] --> [('Jänes', 'H', 'sg kom')]



Note that the content nouns _kohale_, _koha_, _mail_ and _summaga_ are no longer ambiguous, and that ambiguities of these words have been resolved correctly. And _Jänes_, _Jänesega_ and _Karule_ have been correctly analysed as proper nouns (part of speech 'H').



_Remarks_:

   * If you do not specific the argument `output_layer` in VabamorfCorpusTagger's constructor, the name 'morph\_analysis' will be used by default. If this layer already exists (like in the previous example), this will result an error;
   
   
   * You can also change names of the input layers via arguments `input_words_layer`, `input_sentences_layer` and `input_compound_tokens_layer`;

#### Details of VabamorfCorpusTagger

Under the hood, VabamorfCorpusTagger applies five processing steps:

  1. **Morphological analysis** of all the words in input texts, including guessing analyses for unknown words and proper names. This is done by applying `VabamorfAnalyzer`;
  
  2. **Post-processing of analyses**, which includes correcting words that contain numbers, correcting part of speech of compound tokens (such as names with initials, emoticons, abbreviations, numerics etc.), and marking some of the words as to be ignored during the morphological disambiguation; This is done by `PostMorphAnalysisTagger`;
  
  3. **Corpus-based pre-disambiguation of proper nouns**, which involves picking out proper name analyses based on lemma counts in the corpus; For this, a special analysis component `CorpusBasedMorphDisambiguator` is applied (see below for details);
  
  4. **Morphological disambiguation**, which involves picking out the most probable analyses for each word (based on the sentence context); This is done by `VabamorfDisambiguator`;
  
  5. **Corpus-based post-disambiguation**, which involves resolving remaining ambiguities based on lemma counts in the corpus; For this, `CorpusBasedMorphDisambiguator` is applied (see below for details);
  
By default, all processing steps are enabled. Flags `use_postanalysis`, `use_predisambiguation`, `use_vabamorf_disambiguator`, `use_postdisambiguation` can be used to turn corresponding steps on/off.
For instance, we can initialize VabamorfCorpusTagger without the _post-processing of analyses_:

In [8]:
vm_corpus_tagger = VabamorfCorpusTagger( use_postanalysis=False )
vm_corpus_tagger

output layer,output attributes,input layers
morph_analysis,"('lemma', 'root', 'root_tokens', 'ending', 'clitic', 'form', 'partofspeech')","['words', 'sentences']"

0,1
validate_inputs,True
use_postanalysis,False
use_predisambiguation,True
use_vabamorf_disambiguator,True
use_postdisambiguation,True
vabamorf_analyser,"VabamorfAnalyzer(('words', 'sentences')->morph_analysis)"
postanalysis_tagger,
vabamorf_disambiguator,"VabamorfDisambiguator(('words', 'sentences', 'morph_analysis')->morph_analysis)"
cb_disambiguator,"CorpusBasedMorphDisambiguator(['words', 'sentences', 'morph_analysis']*->morph_analysis*)"


Notes:
   * you can also change which taggers or components are used in the morphological analysis process. For this, input parameters `vabamorf_analyser`, `postanalysis_tagger`, `vabamorf_disambiguator`, and `cb_disambiguator` can be set.
All custom taggers must be from their original class, or inherit from it. An exception is `postanalysis_tagger`, which can be a custom tagger that implements the interface class  `estnltk.taggers.Retagger` and modifies the `output_layer`;


   * if you do not provide the `vabamorf_analyser` parameter, then VabamorfCorpusTagger will create the `VabamorfAnalyzer` internally, and you can use the input parameters `propername`, `guess`, `compound` and `phonetic` to change its settings. But you should set these parameters only when you know what you are doing -- VabamorfCorpusTagger does not guard you against errors and conflicts resulting from wrong settings;

### Two-level input corpus ( experimental )

Corpus-based disambiguation makes disambiguation decisions based on a broader context.
However, what is the extent of this "broader context" depends on what types of texts you have, and also requires a bit experimentation.
Sometimes, it is good to "broaden the context" step-wise, especially if the input corpus has an additional level in its structure. 
For instance, in a newspaper corpus, we can first group articles by publishing day (all articles published on a single day make up a group), and then group publishing days into a (publishing) week or month.
This follows a hypothesis that temporally closer groups should be disambiguated before temporally more distant groups.

For tagging input, `VabamorfCorpusTagger` accepts the following corpus structures: 
   * a list of Text-s;
   * a list of lists of Text-s;

Follows an example of using 2 level input corpus:

In [9]:
two_lvl_corpus = [['Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.',\
                   'Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. '+\
                   'Uus võistlus toimub 2. mail.',\
                   'Jänesega jokk.'], \
                  ['Karu paistis silma suurima punktide summaga. Uue võistluse toimumisajaks '+\
                   'on 2. mai.', \
                   'Karu ja Jänes jäävad ootama 2. maid.']]

In [10]:
# Prepare texts for morphological analysis
from estnltk import Text
nested_corpus = []
for articles in two_lvl_corpus:
    nested_corpus.append([])
    for t in articles:
        nested_corpus[-1].append( Text(t).tag_layer(['words','sentences']) )

In [11]:
# Create new VabamorfCorpusTagger
from estnltk.taggers import VabamorfCorpusTagger
vm_corpus_tagger = VabamorfCorpusTagger()

In [12]:
# Tag the nested corpus
vm_corpus_tagger.tag(nested_corpus)

[[Text(text='Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.'),
  Text(text='Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. Uus võistlus toimub 2. mail.'),
  Text(text='Jänesega jokk.')],
 [Text(text='Karu paistis silma suurima punktide summaga. Uue võistluse toimumisajaks on 2. mai.'),
  Text(text='Karu ja Jänes jäävad ootama 2. maid.')]]

What happens if two-level input corpus is used? 
Two-level input corpus changes **corpus-based post-disambiguation strategy**: firstly, corpus-based post-disambiguation is performed locally within lists of Text-s, and after that, corpus-based post-disambiguation is performed globally over the whole corpus (on the list of lists of Text-s).

### CorpusBasedMorphDisambiguator

Technically, `VabamorfCorpusTagger` just combines different morphological processing components into a small corpus processing pipeline. 
If you want to make your own processing pipeline (e.g. you want to use more tools, such as `UserDictTagger` or multiple post-analysers), then you can add corpus-based disambiguation into your pipeline with the help of `CorpusBasedMorphDisambiguator`.

An example:

In [13]:
from estnltk.taggers import CorpusBasedMorphDisambiguator
cb_disambiguator = CorpusBasedMorphDisambiguator()

In [14]:
# Input corpus
corpus_texts = ['Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.',\
                'Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. '+\
                'Uus võistlus toimub 2. mail.', \
                'Karu paistis silma suurima punktide summaga. Jänesega jokk. Uue võistluse '+\
                'toimumisajaks on 2. mai.']

# Create Texts and add segmentation layers
from estnltk import Text
corpus = [Text(t).tag_layer(['words', 'sentences']) for t in corpus_texts]

In [15]:
# Add morph layer (without disambiguation)
from estnltk.taggers import VabamorfAnalyzer
vm_analyser = VabamorfAnalyzer()
for text in corpus:
    vm_analyser.tag( text )

In [16]:
# Perform corpus-based predisambiguation
# Note: input must be a corpus, not a single Text object
cb_disambiguator.predisambiguate( corpus )

In [17]:
# Disambiguate morph layer (with Vabamorf)
from estnltk.taggers import VabamorfDisambiguator
vm_disambiguator = VabamorfDisambiguator()
for text in corpus:
    vm_disambiguator.retag( text )

In [18]:
# Perform corpus-based postdisambiguation
# Note: input must be a corpus, not a single Text object
cb_disambiguator.postdisambiguate( corpus )

Notes about CorpusBasedMorphDisambiguator:

  * `CorpusBasedMorphDisambiguator`'s constructor takes the following extra arguments:
    
     * `output_layer` -- Name of the morphological analysis layer (in the input Text objects) that needs to be disambiguated;
     
     * `input_words_layer` -- Name of the words layer (in the input Text objects);
     
     * `input_sentences_layer` -- Name of the sentences layer (in the input Text objects);
     
     * `count_position_duplicates_once` -- If set, then duplicate lemmas appearing in one word position will be counted only  once during the post-disambiguation. An example: the word 'põhja' is ambiguous between 4 analyses: `[ ('põhi', 'S', 'adt'),  ('põhi', 'S', 'sg g'), ('põhi', 'S', 'sg p'), ('põhja', 'V', 'o') ]`. If count_position_duplicates_once==False (default),       then the counter will find `{'põhi':4, 'põhja':1}` for given word, but if count_position_duplicates_once==True, then counts will be: `{'põhi':1, 'põhja':1}`. (Default value: `False`) _Note:_ this is an experimental feature, and needs further testing;
    
     * `count_inside_compounds` -- If set, then last words inside compound words (for example: `pääs` inside `edasipääs` and,  `luba` inside `ehitusluba`) will be additionally counted during the post-disambiguation. This can help to resolve ambiguities in non-compound words; for example, the non-compound word `pääsu` has a lemma ambiguity between `pääs` and `pääsu`, which can be resolved with the help from additional counts from compound words such as `edasipääs` or `finaalipääs`; (Default value: `False`) _Note:_ this is an experimental feature, and needs further testing;
     
     * `validate_inputs` -- If set (default), then input document collection will be validated for having the appropriate structure, and all documents will be checked for the existence of required layers. Normally, you wouldn't need to change this;


  * `CorpusBasedMorphDisambiguator.predisambiguate` -- tries to determine correct analysis for words that have both proper name and common noun analyses. This requires that input texts have been segmented into words and sentences, and that morphological analysis has been done with settings `guess=True` and `propername=True`. The setting `propername=True` assures that title-cased common nouns such as _Jänes_ or _Karu_ will obtain additional proper name analysis, which will be either deleted or preserved during the pre-disambiguation process;
  
  
  * `CorpusBasedMorphDisambiguator.postdisambiguate` -- tries to a determine correct analysis for all the remaining ambigouos words in the corpus (except some hard-to-resolve cases, such as ambiguities in _nud / tud / dud_ forms or singular / plural ambiguities in pronouns, which will be always excluded from the disambiguation process). The input corpus needs to be morphologically analysed and, disambiguated by VabamorfDisambiguator;
  
  
  * In similar to VabamorfCorpusTagger, CorpusBasedMorphDisambiguator also accepts either 1) a list of Text-s or, 2) a list of lists of Text-s as an input. But note that the input structure 2) only affects the `postdisambiguate` method, which will be performed in two stages if two-level structure is used;

## References

* Kaalep, Heiki-Jaan, and Vaino, Tarmo. "Complete morphological analysis in the linguist's toolbox." Congressus Nonus Internationalis Fenno-Ugristarum Pars V (2001): 9-16.
* Gale, William A., Kenneth W. Church, and David Yarowsky. "One sense per discourse." Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992.
* Kaalep, Heiki-Jaan, Riin Kirt, and Kadri Muischnek. "A trivial method for choosing the right lemma." In Baltic HLT, pp. 82-89. 2012.