## <span style="color:purple">Morphological analysis with text-based and corpus-based disambiguation</span>

EstNLTK's default morphological analysis uses a probabilistic disambiguator which relies on the local sentence context in making the disambiguation decisions (Kaalep and Vaino 2001). 
This works reasonably well for many types of texts: news articles, comments, mixed content etc.

However, the default disambiguator has difficulties on getting the proper name analyses correct if proper names overlap with regular words ( e.g. family names that overlap with animal names, such as _Jänes_ or _Karu_ ).
Furthermore, there are word forms which ambiguity is rather difficult to resolve locally. 
For instance, _koha_ can be either _a place_ (in genitive), or a _walleye (a fish)_ (in nominative) -- we need some knowledge about the topic of the text to pick the most likely candidate ( e.g. whether the text talks about _sports events_ or describes _nature_ ). 
These disambiguation problems can be addressed by using a broader context (e.g. a collection of texts) in the morphological disambiguation process.
If morphologically ambiguous words reoccur in other parts of the text or in other related texts, one can use the assumption "one lemma per discourse" (inspired by the observation "one sense per discourse" from Word Sense Disambiguation (Gale et al. 1992)) and choose the right analysis based on the most frequently occurring lemma candidate (Kaalep et al. 2012).

Considering the context used during the morphological disambiguation, we can distinguish sentence-based (the default), text-based and corpus-based disambiguation.

### Text-based morphological disambiguation

Let's first consider an example text, which contains ambiguities that cannot be resolved by the default disambiguation:

In [1]:
# Create text object
from estnltk import Text
text = Text('''Jänes oli võistluse favoriit, aga ka Karu suutis üllatada.
Kes sai kõrgeima koha? Juba 2. koha peale käis pingeline võitlus.
Karu tegi lõpuspurdi, hüppas üle Jänese ja maandus finiši ees põõsas.
Karu audis, Jänesele esimene koht.''')

# Add morphological analysis with the default disambiguation
text = text.tag_layer('morph_analysis')

In [2]:
# Examine title-cased words amd words with ambiguous analyses
for word in text.words:
    if len(word.morph_analysis.annotations) > 1 or word.text.istitle():
        print(word.text,[(a.root, a.partofspeech, a.form) for a in word.morph_analysis.annotations])

Jänes [('jänes', 'S', 'sg n')]
Karu [('Karu', 'H', 'sg n')]
Kes [('kes', 'P', 'sg n'), ('kes', 'P', 'pl n')]
koha [('koht', 'S', 'sg g'), ('koha', 'S', 'sg g')]
Juba [('juba', 'D', '')]
koha [('koht', 'S', 'sg g'), ('koha', 'S', 'sg g')]
Karu [('Karu', 'H', 'sg n')]
Jänese [('Jänes', 'H', 'sg g'), ('Jänese', 'H', 'sg g')]
Karu [('Karu', 'H', 'sg n')]
Jänesele [('jänes', 'S', 'sg all')]


Note that the common noun _koha_ remains ambiguous, and proper noun _Jänes_ is mistakenly analysed as a common noun (part-of-speech: 'S'), and has lemma ambiguity (_Jänese_ vs _Jänes_).

Now, we can use `VabamorfTagger` with options `predisambiguate=True` and `postdisambiguate=True` to provide an enhanced, text-based disambiguation to the given text:

In [3]:
# Initialize VabamorfTagger for text-based disambiguation
from estnltk.taggers import VabamorfTagger
vm_tagger = VabamorfTagger(output_layer='morph_analysis_with_tb_disamb',
                           predisambiguate=True, 
                           postdisambiguate=True )
text = vm_tagger.tag(text)

Details about the flags `predisambiguate` and `postdisambiguate` are covered in the sections below; here, we can just note that they provide "one lemma per discourse" disambiguation within a single text.

In [4]:
# Compare the results: output differences in annotations
for word_id, word_span in enumerate( text.words ):
    default_annotations    = text['morph_analysis'][word_id].annotations
    text_based_annotations = text['morph_analysis_with_tb_disamb'][word_id].annotations
    if default_annotations != text_based_annotations:
        annotations_1 = [(a.root, a.partofspeech, a.form) for a in default_annotations]
        annotations_2 = [(a.root, a.partofspeech, a.form) for a in text_based_annotations]
        print(word_span.text, annotations_1,'=>',annotations_2)
        print()

Jänes [('jänes', 'S', 'sg n')] => [('Jänes', 'H', 'sg n')]

koha [('koht', 'S', 'sg g'), ('koha', 'S', 'sg g')] => [('koht', 'S', 'sg g')]

koha [('koht', 'S', 'sg g'), ('koha', 'S', 'sg g')] => [('koht', 'S', 'sg g')]

Jänese [('Jänes', 'H', 'sg g'), ('Jänese', 'H', 'sg g')] => [('Jänes', 'H', 'sg g')]

Jänesele [('jänes', 'S', 'sg all')] => [('Jänes', 'H', 'sg all')]



Note that the common noun _koha_ has been correctly disambiguated, and _Jänes_ has been correctly analysed as a proper noun (part of speech 'H').

_Remark_ :

   * If you do not specific the argument `output_layer` in VabamorfTagger's constructor, the name 'morph\_analysis' will be used by default. If this layer already exists (like in the previous example), this will result an error;

<div class="alert alert-block alert-warning">
<h4><i>Switching on text-based disambiguation via <code>make_resolver</code></i></h4>
<br>
Naturally, you do not need to import <code>VabamorfTagger</code> in order to switch on <i>text-based disambiguation</i> -- you can also use flags <code>predisambiguate</code> and <code>postdisambiguate</code> with the <code>make_resolver</code>. Example:
<pre>
# Make new resolver
from estnltk.resolve_layer_dag import make_resolver
text_based_disamb_resolver = make_resolver(predisambiguate=True, postdisambiguate=True)
# Create text
text = Text( ... )
# Add morph_analysis with text-based disambiguation
text.tag_layer(resolver=text_based_disamb_resolver)['morph_analysis']
</pre>
</div>

### Corpus-based morphological disambiguation

Sometimes, a context wider than a single text is required to make better disambiguation decisions. 
For this, EstNLTK provides a special tagger called `VabamorfCorpusTagger`.

Let's now consider an example of a text collection. Each text contains some ambiguities that cannot be correctly resolved within the text alone -- other texts must also be taken account:

In [5]:
corpus_texts = ['Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.',\
                'Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. '+\
                'Uus võistlus toimub 2. mail.', \
                'Karu paistis silma suurima punktide summaga. Jänesega jokk. Uue võistluse '+\
                'toimumisajaks on 2. mai.']

In [6]:
# Create Texts and apply default morphological analysis
from estnltk import Text
corpus = [Text(t).tag_layer(['morph_analysis']) for t in corpus_texts]

In [7]:
# Examine words with ambiguous analyses
for text in corpus:
    for word in text.words:
        if len(word.morph_analysis.annotations) > 1:
            print(word.text,[(a.root, a.partofspeech, a.form) for a in word.morph_analysis.annotations])

kohale [('koht', 'S', 'sg all'), ('koha', 'S', 'sg all')]
kuigi [('kuigi', 'J', ''), ('kuigi', 'D', '')]
Teise [('teine', 'P', 'sg g'), ('teine', 'O', 'sg g')]
koha [('koht', 'S', 'sg g'), ('koha', 'S', 'sg g')]
mail [('mai', 'S', 'sg ad'), ('maa', 'S', 'pl ad')]
summaga [('summ', 'S', 'sg kom'), ('summa', 'S', 'sg kom')]
on [('ole', 'V', 'b'), ('ole', 'V', 'vad')]


Note that the content nouns _kohale_ , _koha_ , _mail_ and _summaga_ remain ambiguous in this process.

In [8]:
# Examine analyses of titlecase words
for text in corpus:
    for word in text.words:
        if word.text.istitle():
            print(word.text, [(a.root, a.partofspeech, a.form) for a in word.morph_analysis.annotations])

Esimesele [('esimene', 'O', 'sg all')]
Jänes [('jänes', 'S', 'sg n')]
Lõpparvestuses [('lõpp_arvestus', 'S', 'sg in')]
Karule [('karu', 'S', 'sg all')]
Teise [('teine', 'P', 'sg g'), ('teine', 'O', 'sg g')]
Jänes [('jänes', 'S', 'sg n')]
Uus [('uus', 'A', 'sg n')]
Karu [('Karu', 'H', 'sg n')]
Jänesega [('jänes', 'S', 'sg kom')]
Uue [('uus', 'A', 'sg g')]


Note that the proper names _Jänes_ , _Karule_ and _Jänesega_ are incorrectly analysed as common nouns (part of speech 'S').

Let's use corpus-based disambiguation to solve these problems!

VabamorfCorpusTagger performs morphological analysis, local context morphological disambiguation, and "one lemma per discourse" disambiguation on a list of Text objects:

In [9]:
# Create new VabamorfCorpusTagger. 
# Use a different name for morph analysis layer
from estnltk.taggers import VabamorfCorpusTagger
vm_corpus_tagger = VabamorfCorpusTagger( output_layer='morph_analysis_with_cb_disamb' )

In [10]:
# Retag the corpus, apply corpus-based morph analysis
vm_corpus_tagger.tag( corpus )

[Text(text='Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.'),
 Text(text='Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. Uus võistlus toimub 2. mail.'),
 Text(text='Karu paistis silma suurima punktide summaga. Jänesega jokk. Uue võistluse toimumisajaks on 2. mai.')]

Let's examine the results:

In [11]:
# Output improvements of corpus-based disambiguation
for text in corpus:
    for word in text.words:
        # old morph analyses
        old_morph = word.morph_analysis
        # new morph analyses
        new_morph = word.morph_analysis_with_cb_disamb
        if old_morph != new_morph:
            # Output changed analyses
            old_analyses = [(a.root, a.partofspeech, a.form) for a in old_morph.annotations]
            new_analyses = [(a.root, a.partofspeech, a.form) for a in new_morph.annotations]
            print(word.text, old_analyses, '-->', new_analyses)
            print()

kohale [('koht', 'S', 'sg all'), ('koha', 'S', 'sg all')] --> [('koht', 'S', 'sg all')]

Jänes [('jänes', 'S', 'sg n')] --> [('Jänes', 'H', 'sg n')]

Karule [('karu', 'S', 'sg all')] --> [('Karu', 'H', 'sg all')]

koha [('koht', 'S', 'sg g'), ('koha', 'S', 'sg g')] --> [('koht', 'S', 'sg g')]

Jänes [('jänes', 'S', 'sg n')] --> [('Jänes', 'H', 'sg n')]

mail [('mai', 'S', 'sg ad'), ('maa', 'S', 'pl ad')] --> [('mai', 'S', 'sg ad')]

summaga [('summ', 'S', 'sg kom'), ('summa', 'S', 'sg kom')] --> [('summa', 'S', 'sg kom')]

Jänesega [('jänes', 'S', 'sg kom')] --> [('Jänes', 'H', 'sg kom')]



Note that the content nouns _kohale_ , _koha_ , _mail_ and _summaga_ are no longer ambiguous, and that ambiguities of these words have been resolved correctly. And _Jänes_ , _Jänesega_ and _Karule_ have been correctly analysed as proper nouns (part of speech 'H').

_Remarks_ :
   * Applying VabamorfCorpusTagger on a list consisting of a single text is the same as applying _text-based disambiguation_ on that text;


   * If you do not specific the argument `output_layer` in VabamorfCorpusTagger's constructor, the name 'morph\_analysis' will be used by default. If this layer already exists (like in the previous example), this will result an error;
   
   
   * You can also change names of the input layers via arguments `input_words_layer`, `input_sentences_layer` and `input_compound_tokens_layer`;

#### Details of VabamorfCorpusTagger

Under the hood, VabamorfCorpusTagger applies the following processing steps:

  1. **Morphological analysis** of all the words in input texts, including guessing analyses for unknown words and proper names. This is done by applying `VabamorfAnalyzer`;
  
  2. **Post-processing of analyses**, which includes correcting words that contain numbers, correcting part of speech of compound tokens (such as names with initials, emoticons, abbreviations, numerics etc.), and marking some of the words as to be ignored during the morphological disambiguation; This is done by `PostMorphAnalysisTagger`;
  
  3. **Corpus-based pre-disambiguation of proper nouns**, which involves picking out proper name analyses based on lemma counts in the corpus; For this, a special analysis component `CorpusBasedMorphDisambiguator` is applied (see below for details);
  
  4. **Morphological disambiguation**, which involves picking out the most probable analyses for each word (based on the sentence context); This is done by `VabamorfDisambiguator`;
  
  5. **Corpus-based post-disambiguation**, which involves resolving remaining ambiguities based on lemma counts in the corpus; For this, `CorpusBasedMorphDisambiguator` is applied (see below for details);

  6. **Reordering ambiguities**, which involves sorting remaining ambiguous analyses by their corpus frequency, using the frequencies [obtained from the Estonian UD corpus](https://github.com/estnltk/ambiguous-morph-reordering/). This is done by a special tagger called `MorphAnalysisReorderer`;
  
By default, all processing steps are enabled. Flags `use_postanalysis`, `use_predisambiguation`, `use_vabamorf_disambiguator`, `use_postdisambiguation`, `use_reorderer` can be used to turn corresponding steps on/off.
For instance, we can initialize VabamorfCorpusTagger without the _post-processing of analyses_:

In [12]:
vm_corpus_tagger = VabamorfCorpusTagger( use_postanalysis=False )
vm_corpus_tagger

output layer,output attributes,input layers
morph_analysis,"('normalized_text', 'lemma', 'root', 'root_tokens', 'ending', 'clitic', 'form', 'partofspeech')","['words', 'sentences']"

0,1
validate_inputs,True
use_postanalysis,False
use_predisambiguation,True
use_vabamorf_disambiguator,True
use_postdisambiguation,True
use_reorderer,True
slang_lex,False
vabamorf_analyser,"VabamorfAnalyzer(('words', 'sentences')->morph_analysis)"
postanalysis_tagger,
vabamorf_disambiguator,"VabamorfDisambiguator(('words', 'sentences', 'morph_analysis')->morph_analysis)"


Notes:
   * you can also change which taggers or components are used in the morphological analysis process. For this, input parameters `vabamorf_analyser`, `postanalysis_tagger`, `vabamorf_disambiguator`, `analysis_reorderer`, and `cb_disambiguator` can be set.
All custom taggers must be from their original class, or inherit from it. An exception is `postanalysis_tagger`, which can be a custom tagger that implements the interface class  `estnltk.taggers.Retagger` and modifies the `output_layer`;


   * if you do not provide the `vabamorf_analyser` parameter, then VabamorfCorpusTagger will create the `VabamorfAnalyzer` internally, and you can use the input parameters `propername`, `guess`, `compound` and `phonetic` to change its settings. Additionally, the parameter `slang_lex` can be used to initialize the version of lexicon extended with spoken and slang words. But you should set these parameters only when you know what you are doing -- VabamorfCorpusTagger does not guard you against errors and conflicts resulting from wrong settings;

### Two-level input corpus [ Experimental ]

Corpus-based disambiguation makes disambiguation decisions based on a broader context.
However, what is the extent of this "broader context" depends on what types of texts you have, and also requires a bit experimentation.
Sometimes, it is good to "broaden the context" step-wise, especially if the input corpus has an additional level in its structure. 
For instance, in a newspaper corpus, we can first group articles by publishing day (all articles published on a single day make up a group), and then group publishing days into a (publishing) week or month.
This follows a hypothesis that temporally closer groups should be disambiguated before temporally more distant groups.

For tagging input, `VabamorfCorpusTagger` accepts the following corpus structures: 
   * a list of Text-s;
   * a list of lists of Text-s;

Follows an example of using 2 level input corpus:

In [13]:
two_lvl_corpus = [['Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.',\
                   'Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. '+\
                   'Uus võistlus toimub 2. mail.',\
                   'Jänesega jokk.'], \
                  ['Karu paistis silma suurima punktide summaga. Uue võistluse toimumisajaks '+\
                   'on 2. mai.', \
                   'Karu ja Jänes jäävad ootama 2. maid.']]

In [14]:
# Prepare texts for morphological analysis
from estnltk import Text
nested_corpus = []
for articles in two_lvl_corpus:
    nested_corpus.append([])
    for t in articles:
        nested_corpus[-1].append( Text(t).tag_layer(['words','sentences']) )

In [15]:
# Create new VabamorfCorpusTagger
from estnltk.taggers import VabamorfCorpusTagger
vm_corpus_tagger = VabamorfCorpusTagger()

In [16]:
# Tag the nested corpus
vm_corpus_tagger.tag(nested_corpus)

[[Text(text='Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.'),
  Text(text='Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. Uus võistlus toimub 2. mail.'),
  Text(text='Jänesega jokk.')],
 [Text(text='Karu paistis silma suurima punktide summaga. Uue võistluse toimumisajaks on 2. mai.'),
  Text(text='Karu ja Jänes jäävad ootama 2. maid.')]]

What happens if two-level input corpus is used? 
Two-level input corpus changes **corpus-based post-disambiguation strategy**: firstly, corpus-based post-disambiguation is performed locally within lists of Text-s, and after that, corpus-based post-disambiguation is performed globally over the whole corpus (on the list of lists of Text-s).

### Disambiguating compound words [ Experimental ]

Lemma-counting in the post-disambiguation can be extended in a way that the last words of _compound words_ will be counted separately, and this additional information is used for disambiguating both the non-compound words and compound words.

Let's first consider an example corpus, which contains ambiguities that cannot be resolved solely based on the default disambiguation strategy:

In [17]:
corpus_texts = ['Kolmas koht Kofu katserajal ja edasipääs tagatud. Võimete lagi pole veel sugugi käes!',\
                'Poolfinaali pääsu korral oleks ta esimene naissoost poolfinalist. Kes ütles, et ei saa klaaslaest edasi?',\
                'Konkurendid rajal on sõnatud, sugu ei määra siin midagi. Aga rada ise siiski ka määrab...',\
                'See soo küsimus jäi eelnevas ikkagi poolikuks.']

In [18]:
# Create Texts and apply default morphological analysis
from estnltk import Text
corpus = [Text(t).tag_layer(['morph_analysis']) for t in corpus_texts]

In [19]:
# Inspect words with ambiguous analyses
for text in corpus:
    for word in text.words:
        if len(word.morph_analysis.annotations) > 1:
            print(word.text,[(a.root, a.partofspeech, a.form) for a in word.morph_analysis.annotations])

katserajal [('katse_rada', 'S', 'sg ad'), ('katse_raja', 'S', 'sg ad')]
tagatud [('taga', 'V', 'tud'), ('taga=tud', 'A', ''), ('taga=tud', 'A', 'pl n'), ('taga=tud', 'A', 'sg n')]
pääsu [('pääs', 'S', 'sg g'), ('pääsu', 'S', 'sg g')]
Kes [('kes', 'P', 'sg n'), ('kes', 'P', 'pl n')]
klaaslaest [('klaas_laad', 'S', 'sg el'), ('klaas_lagi', 'S', 'sg el')]
rajal [('rada', 'S', 'sg ad'), ('raja', 'S', 'sg ad')]
on [('ole', 'V', 'b'), ('ole', 'V', 'vad')]
ise [('ise', 'P', 'sg n'), ('ise', 'P', 'pl n')]
soo [('sugu', 'S', 'sg g'), ('soo', 'S', 'sg g')]


To switch on the disambiguation of compound words, we need to create an instance of `CorpusBasedMorphDisambiguator` (see below for details about the class) with the parameter `disamb_compound_words`:

In [20]:
# Create new VabamorfCorpusTagger & CorpusBasedMorphDisambiguator 
from estnltk.taggers import VabamorfCorpusTagger, CorpusBasedMorphDisambiguator

cb_disambiguator = CorpusBasedMorphDisambiguator( output_layer='morph_analysis_with_cb_disamb', 
                                                  disamb_compound_words=True )
vm_corpus_tagger = VabamorfCorpusTagger( output_layer='morph_analysis_with_cb_disamb',
                                         cb_disambiguator=cb_disambiguator)

# Tag the corpus
vm_corpus_tagger.tag( corpus )

[Text(text='Kolmas koht Kofu katserajal ja edasipääs tagatud. Võimete lagi pole veel sugugi käes!'),
 Text(text='Poolfinaali pääsu korral oleks ta esimene naissoost poolfinalist. Kes ütles, et ei saa klaaslaest edasi?'),
 Text(text='Konkurendid rajal on sõnatud, sugu ei määra siin midagi. Aga rada ise siiski ka määrab...'),
 Text(text='See soo küsimus jäi eelnevas ikkagi poolikuks.')]

In [21]:
# Inspect improvements of corpus-based disambiguation
for text in corpus:
    for word in text.words:
        # old morph analyses
        old_morph = word.morph_analysis
        # new morph analyses
        new_morph = word.morph_analysis_with_cb_disamb
        if old_morph != new_morph:
            # Output changed analyses
            old_analyses = [(a.root, a.partofspeech, a.form) for a in old_morph.annotations]
            new_analyses = [(a.root, a.partofspeech, a.form) for a in new_morph.annotations]
            print(word.text, old_analyses, '-->', new_analyses)
            print()

katserajal [('katse_rada', 'S', 'sg ad'), ('katse_raja', 'S', 'sg ad')] --> [('katse_rada', 'S', 'sg ad')]

pääsu [('pääs', 'S', 'sg g'), ('pääsu', 'S', 'sg g')] --> [('pääs', 'S', 'sg g')]

klaaslaest [('klaas_laad', 'S', 'sg el'), ('klaas_lagi', 'S', 'sg el')] --> [('klaas_lagi', 'S', 'sg el')]

rajal [('rada', 'S', 'sg ad'), ('raja', 'S', 'sg ad')] --> [('rada', 'S', 'sg ad')]

soo [('sugu', 'S', 'sg g'), ('soo', 'S', 'sg g')] --> [('sugu', 'S', 'sg g')]



In the previous example, the non-compound word `pääsu` was disambiguated using the additional information from compound word `edasipääs`. The compound word `katserajal` was disambiguated based on the information from non-compound words `rajal` and `rada`. And the compound word `klaaslaest` was disambiguated based on the information from non-compound word `lagi`.

### CorpusBasedMorphDisambiguator

Technically, `VabamorfTagger` and `VabamorfCorpusTagger` are just combining different morphological processing components into a  small corpus processing pipeline. 
If you want to make your own processing pipeline (e.g. you want to use more tools, such as `UserDictTagger` or multiple post-analysers), then you can add corpus-based disambiguation into your pipeline with the help of `CorpusBasedMorphDisambiguator`.

An example:

In [22]:
from estnltk.taggers import CorpusBasedMorphDisambiguator
cb_disambiguator = CorpusBasedMorphDisambiguator()

In [23]:
# Input corpus
corpus_texts = ['Esimesele kohale tuleb Jänes, kuigi tema punktide summa pole kõrgeim.',\
                'Lõpparvestuses läks Karule esimene koht. Teise koha sai seekord Jänes. '+\
                'Uus võistlus toimub 2. mail.', \
                'Karu paistis silma suurima punktide summaga. Jänesega jokk. Uue võistluse '+\
                'toimumisajaks on 2. mai.']

# Create Texts and add segmentation layers
from estnltk import Text
corpus = [Text(t).tag_layer(['words', 'sentences']) for t in corpus_texts]

In [24]:
# Add morph layer (without disambiguation)
from estnltk.taggers import VabamorfAnalyzer
vm_analyser = VabamorfAnalyzer()
for text in corpus:
    vm_analyser.tag( text )

In [25]:
# Perform corpus-based predisambiguation
# Note: input must be a corpus, not a single Text object
cb_disambiguator.predisambiguate( corpus )

In [26]:
# Disambiguate morph layer (with Vabamorf)
from estnltk.taggers import VabamorfDisambiguator
vm_disambiguator = VabamorfDisambiguator()
for text in corpus:
    vm_disambiguator.retag( text )

In [27]:
# Perform corpus-based postdisambiguation
# Note: input must be a corpus, not a single Text object
cb_disambiguator.postdisambiguate( corpus )

In [28]:
# Perform reordering of remaining ambiguities
from estnltk.taggers import MorphAnalysisReorderer
vm_reorderer = MorphAnalysisReorderer()
for text in corpus:
    vm_reorderer.retag( text )

Notes about CorpusBasedMorphDisambiguator:


  * `CorpusBasedMorphDisambiguator`'s constructor takes the following extra arguments:
    
     * `output_layer` -- Name of the morphological analysis layer (in the input Text objects) that needs to be disambiguated;
     
     * `input_words_layer` -- Name of the words layer (in the input Text objects);
     
     * `input_sentences_layer` -- Name of the sentences layer (in the input Text objects);
     
     * `count_position_duplicates_once` -- If set, then duplicate lemmas appearing in one word position will be counted only  once during the post-disambiguation. An example: the word 'põhja' is ambiguous between 4 analyses: `[ ('põhi', 'S', 'adt'),  ('põhi', 'S', 'sg g'), ('põhi', 'S', 'sg p'), ('põhja', 'V', 'o') ]`. If count_position_duplicates_once==False (default),       then the counter will find `{'põhi':4, 'põhja':1}` for given word, but if count_position_duplicates_once==True, then counts will be: `{'põhi':1, 'põhja':1}`. (Default value: `False`) _Note:_ this is an experimental feature, and needs further testing;
    
     * `disamb_compound_words` -- If set, then additionally counts last words inside compound words during the post-disambiguation, and uses the counting information for disambiguating both the non-compound words and compound words. For example, the non-compound word `pääsu` has a lemma ambiguity between `pääs` and `pääsu`, which can be resolved with the help from additional counts from compound words such as `edasipääs` or `finaalipääs`; (Default value: `False`) _Note:_ this is an experimental feature, and needs further testing;

     * `ignore_lemmas_in_compounds` -- set of lemmas, which should not be counted nor disambiguated as the last words of compound words (if `disamb_compound_words==True`), because their disambiguation is unreliable, likely leads to errors. Default: `{'alus','alune','mai','maa'}`;
     
     * `validate_inputs` -- If set (default), then input document collection will be validated for having the appropriate structure, and all documents will be checked for the existence of required layers. Normally, you wouldn't need to change this;


`CorpusBasedMorphDisambiguator`'s methods:

  * `predisambiguate(collections)` -- processes `collections` and tries to determine correct analysis for words that have both proper name and common noun analyses. This requires that input texts in `collections` have been segmented into words and sentences, and that morphological analysis has been done with settings `guess=True` and `propername=True`. The setting `propername=True` assures that title-cased common nouns such as _Jänes_ or _Karu_ will obtain additional proper name analysis, which will be either deleted or preserved during the pre-disambiguation process;
  
     * `collections` can be either 1) a list of Text-s or, 2) a list of lists of Text-s;


  * `postdisambiguate(collections)` -- processes `collections` and tries to a determine correct analysis for all the remaining ambigouos words in the corpus (except some hard-to-resolve cases, such as ambiguities in _nud / tud / dud_ forms or singular / plural ambiguities in pronouns, which will be always excluded from the disambiguation process). The input `collections` needs to be morphologically analysed and, disambiguated by VabamorfDisambiguator;
  
     * `collections` can be either 1) a list of Text-s or, 2) a list of lists of Text-s;


  * `_predisambiguate_detached_layers(collections, detached_layers)` -- does the same as `predisambiguate`, but expects that required input layers have been detached from `Text` objects and are available in `detached_layers`. Basically, the interface is similar to that of the `_change_layer` method of Retagger.
  
     * `collections` can be either 1) a list of Text-s or, 2) a list of lists of Text-s;
     * `detached_layers` can be either 1) a list of lists of Layer-s or, 2) a list of lists of lists of Layer-s;


  * `_postdisambiguate_detached_layers(collections, detached_layers)` -- does the same as `postdisambiguate`, but expects that required input layers have been detached from `Text` objects and are available in `detached_layers`. Basically, the interface is similar to that of the `_change_layer` method of Retagger.

     * `collections` can be either 1) a list of Text-s or, 2) a list of lists of Text-s;
     * `detached_layers` can be either 1) a list of lists of Layer-s or, 2) a list of lists of lists of Layer-s;


Note: the input structure 2) only affects `postdisambiguate` and `_postdisambiguate_detached_layers` methods, which will perform in two stage disambiguation if two-level structure is used;

## References

* Kaalep, Heiki-Jaan, and Vaino, Tarmo. "Complete morphological analysis in the linguist's toolbox." Congressus Nonus Internationalis Fenno-Ugristarum Pars V (2001): 9-16.
* Gale, William A., Kenneth W. Church, and David Yarowsky. "One sense per discourse." Proceedings of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992.
* Kaalep, Heiki-Jaan, Riin Kirt, and Kadri Muischnek. "A trivial method for choosing the right lemma." In Baltic HLT, pp. 82-89. 2012.