# Evaluation of Vabamorf's disambiguation on spellchecker's suggestions

In this tiny experiment, we use Vabamorf's spellchecker to add _**multiple normalized forms** to the **words layer**_, and examine, how this increased ambiguity affects the quality of Vabamorf's disambiguation in EstNLTK.

We use EstNLTK's version 1.6.4beta (from the commit [52c921eb3d](https://github.com/estnltk/estnltk/tree/52c921eb3d06ebc0976c0dac84bc9b9f72b0491e)), and evaluate tools on the Estonian Web Treebank (EWTB) corpus.

### Load Estonian Web Treebank corpus

You can download the UD format EWTB corpus from here: https://github.com/UniversalDependencies/UD_Estonian-EWT/ (exact commit: [6cd4d14](https://github.com/UniversalDependencies/UD_Estonian-EWT/tree/6cd4d1480c1f3dc89bcdddab56f04dc51bfa8b48)).

In [1]:
eval_data_dir = 'UD_Estonian-EWT-master'

import os, os.path
from ewtb_ud_utils import load_EWTB_ud_file_with_corrections

# Load corpus files with corrections
ud_layer_name = 'ud_syntax'
loaded_texts  = []
for fname in os.listdir( eval_data_dir ):
    if fname.endswith('.conllu'):
        fpath = os.path.join( eval_data_dir, fname )
        text = load_EWTB_ud_file_with_corrections( fpath, ud_layer_name )
        text.meta['file'] = fname
        loaded_texts.append( text )

---

## Vabamorf's analysis and disambiguation (baseline: no spelling suggestions)

### 0. No spelling suggestions + Vabamorf's analysis

In [2]:
# 'morph_0' == VabamorfAnalyzer + PostMorphAnalysisTagger
from estnltk.taggers import VabamorfAnalyzer, PostMorphAnalysisTagger

vm_analyser = VabamorfAnalyzer(output_layer='morph_0')
post_corrector = PostMorphAnalysisTagger(output_layer='morph_0')
for text in loaded_texts:
    vm_analyser.tag( text )
    post_corrector.retag( text )

In [3]:
from ewtb_ud_utils import VM2UDMorphFullDiffTagger
vm2ud_diff_tagger = VM2UDMorphFullDiffTagger('morph_0', ud_layer_name, 'morph_0_diff_layer')
# Find differences
for text in loaded_texts:
    vm2ud_diff_tagger.tag(text)

### 1. No spelling suggestions + Vabamorf's analysis with disambiguation

In [4]:
# 'morph_1' == VabamorfTagger
from estnltk.taggers import VabamorfTagger

vm_tagger = VabamorfTagger(output_layer='morph_1',
                           input_words_layer='words')
for text in loaded_texts:
    vm_tagger.tag( text )

In [5]:
from ewtb_ud_utils import VM2UDMorphFullDiffTagger
vm2ud_diff_tagger = VM2UDMorphFullDiffTagger('morph_1', ud_layer_name, 'morph_1_diff_layer')
# Find differences
for text in loaded_texts:
    vm2ud_diff_tagger.tag(text)

### Results: Vabamorf's analysis and disambiguation (baseline: no spelling suggestions)

First: define the evaluation function:

In [6]:
def eval_disambiguation_of_all_words( texts, morph_0_diff_layer_name, 
                                             morph_1_diff_layer_name,
                                             morph_2_diff_layer_name ):
    '''Evaluates the disambiguation quality (correctly vs incorrectly disambiguated) on all words of the corpus.
    '''
    words_total = 0                  # total words (including words that cannot be aligned to UD morph)
    correct_words_all = 0            # correct words (disambiguated + undisambiguated)
    words_to_disambiguate = 0        # words needing disambiguation
    words_disambiguated = 0          # words actually disambiguated
    correctly_disambiguated = 0      # correct words (only disambiguated)
    correct_analyses_total_after_disamb = 0
    correct_analyses_total_before_disamb = 0
    for text in texts:
        morph_0_diff = text[morph_0_diff_layer_name]
        morph_1_diff = text[morph_1_diff_layer_name]
        morph_2_diff = text[morph_2_diff_layer_name]
        assert len( morph_0_diff ) == len( morph_1_diff )
        assert len( morph_1_diff ) == len( morph_2_diff )
        for word_1, word_2, word_3 in zip(morph_0_diff, morph_1_diff, morph_2_diff):
            if word_1.annotations[0]['full_match'] or word_2.annotations[0]['full_match']:
                # Look only words that obtained a full match:
                # 1) with the default morphological analysis, or
                # 2) with the extended morphological analysis based on normalized word forms
                correct_analyses_total_before_disamb += 1
                if (len(word_1.annotations) > 1 or len(word_2.annotations) > 1):
                    words_to_disambiguate += 1
                    if len(word_3.annotations) == 1:
                        words_disambiguated += 1
                        if word_3.annotations[0]['full_match']:
                            correctly_disambiguated += 1
                    if word_3.annotations[0]['full_match']:
                        correct_words_all += 1
                if word_3.annotations[0]['full_match']:
                    correct_analyses_total_after_disamb += 1
        words_total += len( morph_0_diff )
    print('='*80)
    print( ' Words that needed disambiguation:               ', words_to_disambiguate, '/', words_total )
    print( '   Incorrectly disambiguated:                    ', (words_disambiguated-correctly_disambiguated), '/', words_to_disambiguate, '   {:.02f}%'.format(((words_disambiguated-correctly_disambiguated)/words_to_disambiguate)*100.0) )
    print( '   Correctly disambiguated:                      ', correctly_disambiguated, '/', words_to_disambiguate, '   {:.02f}%'.format((correctly_disambiguated/words_to_disambiguate)*100.0) )
    print( '   Disambiguation attempts:                      ', words_disambiguated, '/', words_to_disambiguate, '   {:.02f}%'.format((words_disambiguated/words_to_disambiguate)*100.0) )
    print()
    print( '   Correct words (including undisambiguated):    ', correct_words_all, '/', words_to_disambiguate, '   {:.02f}%'.format((correct_words_all/words_to_disambiguate)*100.0) )
    print('='*80)
    print( ' VM words alignable to UD morph words (before disamb): ', correct_analyses_total_before_disamb,'/', words_total, '   {:.02f}%'.format((correct_analyses_total_before_disamb/words_total)*100.0) )
    print( ' VM words alignable to UD morph words  (after disamb): ', correct_analyses_total_after_disamb,'/', words_total, '   {:.02f}%'.format((correct_analyses_total_after_disamb/words_total)*100.0) )

And apply the function:

In [7]:
# get training part of the corpus
#evaluation_texts = [text for text in loaded_texts if 'train' in text.meta['file']]

# ... or evaluate on all texts
evaluation_texts = loaded_texts

eval_disambiguation_of_all_words( evaluation_texts, 'morph_0_diff_layer', 'morph_0_diff_layer', 'morph_1_diff_layer' )

 Words that needed disambiguation:                9851 / 27286
   Incorrectly disambiguated:                     900 / 9851    9.14%
   Correctly disambiguated:                       6526 / 9851    66.25%
   Disambiguation attempts:                       7426 / 9851    75.38%

   Correct words (including undisambiguated):     8937 / 9851    90.72%
 VM words alignable to UD morph words (before disamb):  26458 / 27286    96.97%
 VM words alignable to UD morph words  (after disamb):  25544 / 27286    93.62%


---

## Vabamorf's analysis and disambiguation with spelling suggestions

### Reload the data

In [8]:
eval_data_dir = 'UD_Estonian-EWT-master'

import os, os.path
from ewtb_ud_utils import load_EWTB_ud_file_with_corrections

# Load corpus files with corrections
ud_layer_name = 'ud_syntax'
loaded_texts  = []
for fname in os.listdir( eval_data_dir ):
    if fname.endswith('.conllu'):
        fpath = os.path.join( eval_data_dir, fname )
        text = load_EWTB_ud_file_with_corrections( fpath, ud_layer_name )
        text.meta['file'] = fname
        loaded_texts.append( text )

### Create VMSpellingSuggestionsTagger

Make a tagger that creates a special words layer containing spelling suggestions:

In [9]:
from estnltk import Text, Annotation, ElementaryBaseSpan
from estnltk.layer.layer import Layer
from estnltk.taggers import Tagger
from estnltk.vabamorf import morf as vm
from estnltk.taggers.morph_analysis.morf_common import _get_word_texts

class VMSpellingSuggestionsTagger(Tagger):
    '''Creates normalized_words layer which contains spelling suggestions from Vabamorf's spellchecker.'''
    conf_param = []
    output_attributes = []
    
    def __init__(self, words_layer='words', output_layer='normalized_words'):
        self.input_layers = [words_layer]
        self.output_layer = output_layer
        self.output_attributes = ('normalized_form',)
    
    def _make_layer(self, text, layers, status):
        normalzed_words = Layer(name=self.output_layer,
                                attributes=self.output_attributes,
                                text_object=text,
                                ambiguous=True)
        words_layer = layers[self.input_layers[0]]
        for word in words_layer:
            if 'normalized_form' in words_layer.attributes:
                word_texts = _get_word_texts(word)
            else:
                word_texts = [word.text]
            suggestions = set()
            for word_text in word_texts:
                spell_check_result = vm.spellcheck([word_text], suggestions=True)
                # Check if we have a misspelled word with suggestions
                for item in spell_check_result:
                    if not item["spelling"] and len(item["suggestions"]) > 0:
                        for new_suggestion in item["suggestions"]:
                            if new_suggestion not in suggestions:
                                suggestions.add( new_suggestion )                
            if suggestions:
                for suggestion in suggestions:
                    normalzed_words.add_annotation( word.base_span, normalized_form=suggestion )
            else:
                normalzed_words.add_annotation( word.base_span, normalized_form=None )
        return normalzed_words


test_text = Text('Ma tahax teada assju.')
test_text.tag_layer(['words'])
VMSpellingSuggestionsTagger().tag(test_text).normalized_words

layer name,attributes,parent,enveloping,ambiguous,span count
normalized_words,normalized_form,,,True,5

text,normalized_form
Ma,
tahax,taha
,tahaks
,tahad
teada,
assju,asju
,assjõu
.,


### Apply VMSpellingSuggestionsTagger on the input corpus

In [10]:
spelling_suggestor = VMSpellingSuggestionsTagger()
for text in loaded_texts:
    spelling_suggestor.tag( text )

### 0. No spelling suggestions + Vabamorf's analysis

In [11]:
from estnltk.taggers import VabamorfAnalyzer, PostMorphAnalysisTagger

vm_analyser = VabamorfAnalyzer(output_layer='morph_0')
post_corrector = PostMorphAnalysisTagger(output_layer='morph_0')
for text in loaded_texts:
    vm_analyser.tag( text )
    post_corrector.retag( text )

In [12]:
from ewtb_ud_utils import VM2UDMorphFullDiffTagger
vm2ud_diff_tagger = VM2UDMorphFullDiffTagger('morph_0', ud_layer_name, 'morph_0_diff_layer')
# Find differences
for text in loaded_texts:
    vm2ud_diff_tagger.tag(text)

### 1. Spelling suggestions + Vabamorf's analysis only

In [13]:
# 'morph_1' == VabamorfAnalyzer + PostMorphAnalysisTagger
from estnltk.taggers import VabamorfAnalyzer, PostMorphAnalysisTagger

vm_analyser = VabamorfAnalyzer(output_layer='morph_1',
                               input_words_layer=spelling_suggestor.output_layer)
post_corrector = PostMorphAnalysisTagger(output_layer='morph_1')
for text in loaded_texts:
    vm_analyser.tag( text )
    post_corrector.retag( text )

In [14]:
from ewtb_ud_utils import VM2UDMorphFullDiffTagger
vm2ud_diff_tagger = VM2UDMorphFullDiffTagger('morph_1', ud_layer_name, 'morph_1_diff_layer')
# Find differences
for text in loaded_texts:
    vm2ud_diff_tagger.tag(text)

### 2. Spelling suggestions + Vabamorf's analysis with disambiguation

In [15]:
# 'morph_2' == VabamorfTagger
from estnltk.taggers import VabamorfTagger

vm_tagger = VabamorfTagger(output_layer='morph_2',
                           input_words_layer=spelling_suggestor.output_layer)
for text in loaded_texts:
    vm_tagger.tag( text )

In [16]:
from ewtb_ud_utils import VM2UDMorphFullDiffTagger
vm2ud_diff_tagger = VM2UDMorphFullDiffTagger('morph_2', ud_layer_name, 'morph_2_diff_layer')
# Find differences
for text in loaded_texts:
    vm2ud_diff_tagger.tag(text)

### Results: Vabamorf's analysis and disambiguation after spelling suggestions

In [17]:
# get training part of the corpus
#evaluation_texts = [text for text in loaded_texts if 'train' in text.meta['file']]

# ... or evaluate on all texts
evaluation_texts = loaded_texts

eval_disambiguation_of_all_words( evaluation_texts, 'morph_0_diff_layer', 'morph_1_diff_layer', 'morph_2_diff_layer' )

 Words that needed disambiguation:                10029 / 27286
   Incorrectly disambiguated:                     1112 / 10029    11.09%
   Correctly disambiguated:                       6489 / 10029    64.70%
   Disambiguation attempts:                       7601 / 10029    75.79%

   Correct words (including undisambiguated):     8899 / 10029    88.73%
 VM words alignable to UD morph words (before disamb):  26637 / 27286    97.62%
 VM words alignable to UD morph words  (after disamb):  25418 / 27286    93.15%


Now, define a function for a fine-grained statistics about disambiguation quality only on normalized words:

In [18]:
from estnltk.taggers.morph_analysis.morf_common import _get_word_texts

def eval_disambiguation_of_normalized_words( texts, normalized_words_layer, 
                                                    morph_1_diff_layer_name,
                                                    morph_2_diff_layer_name ):
    '''Evaluates the disambiguation quality (correctly vs incorrectly disambiguated) only on normalized words of the corpus.
    '''
    words_total = 0
    words_with_normalization = 0
    words_with_normalization_aligned_before_disamb = 0
    words_with_normalization_aligned_after_disamb  = 0
    for text in texts:
        norm_words   = text[normalized_words_layer]
        morph_1_diff = text[morph_1_diff_layer_name]
        morph_2_diff = text[morph_2_diff_layer_name]
        assert len( morph_1_diff ) == len( norm_words )
        assert len( morph_1_diff ) == len( morph_2_diff )
        for norm_word, word_2, word_3 in zip(norm_words, morph_1_diff, morph_2_diff):
            w_texts = _get_word_texts(norm_word)
            if len(w_texts) > 1 or (len(w_texts) == 1 and w_texts[0] != norm_word.text):
                words_with_normalization += 1
                if word_2.annotations[0]['full_match']:
                    words_with_normalization_aligned_before_disamb += 1
                if word_3.annotations[0]['full_match']:
                    words_with_normalization_aligned_after_disamb += 1        
        words_total += len( morph_1_diff )
    print('='*80)
    print( ' Only words with normalizations (spelling suggestions): ', words_with_normalization, '/', words_total )
    print( '   Correctly analysed (without disambiguation):         ', words_with_normalization_aligned_before_disamb, '/', words_with_normalization, '   {:.02f}%'.format((words_with_normalization_aligned_before_disamb/words_with_normalization)*100.0) )
    print( '       - Correctly disambiguated:                       ', words_with_normalization_aligned_after_disamb, '/', words_with_normalization, '   {:.02f}%'.format((words_with_normalization_aligned_after_disamb/words_with_normalization)*100.0) )
    incorrect = words_with_normalization_aligned_before_disamb - words_with_normalization_aligned_after_disamb
    print( '       - Incorrectly disambiguated:                     ', incorrect, '/', words_with_normalization, '   {:.02f}%'.format((incorrect/words_with_normalization)*100.0) )
    norm_words_no_ud_match = words_with_normalization - words_with_normalization_aligned_before_disamb
    print( '   Word\'s analyses cannot be matched to UD word\'s:      ', norm_words_no_ud_match, '/', words_with_normalization, '   {:.02f}%'.format((norm_words_no_ud_match/words_with_normalization)*100.0) )
    print()


And evaluate disambiguation only on normalized words:

In [19]:
# get training part of the corpus
#evaluation_texts = [text for text in loaded_texts if 'train' in text.meta['file']]

# ... or evaluate on all texts
evaluation_texts = loaded_texts

eval_disambiguation_of_normalized_words( evaluation_texts, spelling_suggestor.output_layer, 'morph_1_diff_layer', 'morph_2_diff_layer' )

 Only words with normalizations (spelling suggestions):  637 / 27286
   Correctly analysed (without disambiguation):          187 / 637    29.36%
       - Correctly disambiguated:                        151 / 637    23.70%
       - Incorrectly disambiguated:                      36 / 637    5.65%
   Word's analyses cannot be matched to UD word's:       450 / 637    70.64%



---

## Summary

    Measurements made on training & test parts of the EWTB corpus
    
    A) Vabamorf's analysis & disambiguation (baseline: no spelling corrections)
    ================================================================================
     Words that needed disambiguation:                9851 / 27286
       Incorrectly disambiguated:                      900 / 9851     9.14%
       Correctly disambiguated:                       6526 / 9851    66.25%
       Disambiguation attempts:                       7426 / 9851    75.38%

       Correct words (including undisambiguated):     8937 / 9851    90.72%
    ================================================================================
     VM words alignable to UD morph words (before disamb):  26458 / 27286    96.97%
     VM words alignable to UD morph words  (after disamb):  25544 / 27286    93.62%

    B) Vabamorf's analysis & disambiguation on words with spelling suggestions
    ================================================================================
     Words that needed disambiguation:                10029 / 27286
       Incorrectly disambiguated:                     1112 / 10029    11.09%
       Correctly disambiguated:                       6489 / 10029    64.70%
       Disambiguation attempts:                       7601 / 10029    75.79%

       Correct words (including undisambiguated):     8899 / 10029    88.73%
    ================================================================================
     VM words alignable to UD morph words (before disamb):  26637 / 27286    97.62%
     VM words alignable to UD morph words  (after disamb):  25418 / 27286    93.15%


     ================================================================================
       Only words with normalizations (spelling suggestions):  637 / 27286
         Correctly analysed (without disambiguation):          187 / 637    29.36%
             - Correctly disambiguated:                        151 / 637    23.70%
             - Incorrectly disambiguated:                       36 / 637     5.65%
         Word's analyses cannot be matched to UD word's:       450 / 637    70.64%
     ================================================================================
     