# Evaluation of spellchecker via morphological analysis

In this experiment, we provide an indirect evaluation of Vabamorf's spellchecker via measuring improvements that it provides to the quality of morphological analysis.
We first analyse the EWTB (Estonian Web TreeBank) corpus with the default _VabamorfAnalyser_, and then with _VabamorfAnalyser_ which results have been improved by the suggestions of Vabamorf's speller. 
We measure how much the quality of analysis (analyses matching to EWTB's manual annotations) improves if new analyses are added according to speller's suggestions.

## Loading EWTB corpus

First, download the UD format EWTB corpus from here: https://github.com/UniversalDependencies/UD_Estonian-EWT/ (exact commit: [6cd4d14](https://github.com/UniversalDependencies/UD_Estonian-EWT/tree/6cd4d1480c1f3dc89bcdddab56f04dc51bfa8b48)). And then set the corpus directory in the variable below:

In [1]:
eval_data_dir = 'UD_Estonian-EWT-master'

Now, you can use a function from the module `ewtb_ud_utils` which loads EWTB's _conllu_ files as Text objects, and provides annotation post-corrections that make its UD format morphological annotations comparable to Vabamorf's morphological annotations.

In [2]:
import os, os.path
from ewtb_ud_utils import load_EWTB_ud_file_with_corrections

ud_layer_name = 'ud_syntax'
loaded_texts  = []
for fname in os.listdir( eval_data_dir ):
    if fname.endswith('.conllu'):
        fpath = os.path.join( eval_data_dir, fname )
        text = load_EWTB_ud_file_with_corrections( fpath, ud_layer_name )
        text.meta['file'] = fname
        loaded_texts.append( text )

## First evaluation: default morphological annotations

In the first part of the evaluation, we'll look into how well the default morphological analyser is doing on the EWTB corpus. 
We use VabamorfAnalyser to add morphological annotations to the Text objects, and add postcorrections with the help of postanalysis tagger:

In [3]:
from estnltk.taggers import VabamorfAnalyzer, PostMorphAnalysisTagger

vm_analyser = VabamorfAnalyzer()
post_corrector = PostMorphAnalysisTagger()
for text in loaded_texts:
    vm_analyser.tag( text )
    post_corrector.retag( text )

### Finding differences between Vabamorf's annotations and UD annotations

EWTB corps contains manually corrected syntactic annotations, which also include manual corrections to (UD format) morphological analyses. 
We can use these manually corrected morphological analyses to evaluate Vabamorf's automatically provided morphological analyses.
For this purpose, the module `ewtb_ud_utils` contains a special tagger `VM2EWTBMorphDiffTagger`, which compares Vabamorf's layer against UD format syntax layer, and finds differences in morphological annotations. 

In [4]:
from ewtb_ud_utils import VM2EWTBMorphDiffTagger
vm2ud_diff_tagger = VM2EWTBMorphDiffTagger('morph_analysis', ud_layer_name, 'morph_diff_layer')
vm2ud_diff_tagger

name,output layer,output attributes,input layers
VM2EWTBMorphDiffTagger,morph_diff_layer,"('vm_root', 'ud_lemma', 'vm_pos', 'ud_pos', 'vm_form', 'ud_form', 'root_match', 'pos_match', 'form_match')","('morph_analysis', 'ud_syntax')"

0,1
vm_morph_layer,morph_analysis
ud_syntax_layer,ud_syntax
compare_function,<function ewtb_ud_utils.align_records>
count_mismatch_details,True
show_lemmas,True
show_postags,True
show_forms,True


In [5]:
# Find differences
for text in loaded_texts:
    vm2ud_diff_tagger.tag(text)

In [6]:
# get training part of the corpus
training_text = [text for text in loaded_texts if 'train' in text.meta['file']][0]

The differences between the two input layers will be output to the `'morph_diff_layer'` layer. 
By default, the tagger also outputs the values that were compared: 1) lemmas from the both layers (`vm_root` and `ud_lemma`), 2) part-of-speech tags from the layers (`vm_pos` and `ud_pos`), and 3) forms of the both layers (`vm_form` and `ud_form`).
And boolean values `root_match`, `pos_match` and `form_match` indicated which parts of the annotations were matching, and which were mismatching:

In [7]:
# Let's examine the last 10 mismatches
training_text['morph_diff_layer'][-10:]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_diff_layer,"vm_root, ud_lemma, vm_pos, ud_pos, vm_form, ud_form, root_match, pos_match, form_match",morph_analysis,,True,10

text,vm_root,ud_lemma,vm_pos,ud_pos,vm_form,ud_form,root_match,pos_match,form_match
Körin,Körin,kõrin,H,S_NOUN,sg n,"OrderedDict([('Case', 'Nom'), ('Number', 'Sing')])",False,False,True
vöib,vöib,või,S,V_AUX,sg n,"OrderedDict([('Mood', 'Ind'), ('Number', 'Sing'), ('Person', '3'), ('Tense', 'Pr ..., type: <class 'collections.OrderedDict'>, length: 6",False,False,False
voi,voi,või,S,J_CCONJ,sg g,OrderedDict(),False,False,False
,voi,või,S,J_CCONJ,sg n,OrderedDict(),False,False,False
imtingimata,imting=im,ilm_tingimata,U,D_ADV,sg ab,OrderedDict(),False,False,False
,imtingi=mata,ilm_tingimata,A,D_ADV,,OrderedDict(),False,False,True
ekraan,ekraan,ekraan,S,S_NOUN,sg n,"OrderedDict([('Case', 'Gen'), ('Number', 'Sing')])",True,True,False
konsentratsiooniks,konsentratsioo_niks,kontsentratsioon,S,S_NOUN,sg n,"OrderedDict([('Case', 'Tra'), ('Number', 'Sing')])",False,True,False
%,%,%,Y,X_SYM,?,OrderedDict(),True,False,False
kül,kül,küll,S,D_ADV,sg n,OrderedDict(),False,False,False


And, finally, the `meta` part of the layer contains some over-all statistics:

In [8]:
training_text['morph_diff_layer'].meta

{'ambiguous_words': 6454,
 'avg_variants_per_word': 1.6990279960421395,
 'matching_words': 16664,
 'mismatching_propn_words': 44,
 'mismatching_punct_words': 8,
 'mismatching_symb_words': 11,
 'mismatching_words': 517,
 'words_total': 17181}

Note: *mismatching_words* are such words that do not have any Vabamorf's morphological analyses that could be matched to UD's morphological analysis. 
If Vabamorf's annotations for a word contain multiple analyses, and at least one of them can be matched to corresponding UD's morphological analysis, the word is considered a _matching word_.

### Summarizing differences

The module `ewtb_ud_utils` also contains a method, which aggregates and summarizes statistics from differences layers of all input Text-s:

In [9]:
from ewtb_ud_utils import diff_statistics_html_table
diff_statistics_html_table( loaded_texts, 'morph_diff_layer')

texts,total_words,mismatching_words,mismatching_%,filters
2,27286,828,(3.03%),[]

0,1,2
+ ROOT | – POSTAG | – FORM,219,(26.45%)
– ROOT | – POSTAG | – FORM,193,(23.31%)
– ROOT | + POSTAG | + FORM,169,(20.41%)
+ ROOT | + POSTAG | – FORM,95,(11.47%)
+ ROOT | – POSTAG | + FORM,62,(7.49%)
– ROOT | + POSTAG | – FORM,51,(6.16%)
– ROOT | – POSTAG | + FORM,39,(4.71%)


You can also exclude mismatches by UD part-of-speech. The parameter `leave_out_udpos` takes a list of strings, which are substrings of the UD part-of-speech tags, and excludes words with these tags from the list of mismatching pairs. 
For instance, we can exclude proper names 'PROPN', punctuation 'PUNCT' and symbols 'SYM' from mismatches:

In [10]:
diff_statistics_html_table( loaded_texts, 'morph_diff_layer', leave_out_udpos=['PROPN', 'PUNCT', 'SYM'] )

texts,total_words,mismatching_words,mismatching_%,filters
2,27286,665,(2.44%),"leave_out_udpos: ['PROPN', 'PUNCT', 'SYM']"

0,1,2
– ROOT | – POSTAG | – FORM,186,(27.97%)
+ ROOT | – POSTAG | – FORM,138,(20.75%)
– ROOT | + POSTAG | + FORM,114,(17.14%)
+ ROOT | + POSTAG | – FORM,90,(13.53%)
+ ROOT | – POSTAG | + FORM,61,(9.17%)
– ROOT | – POSTAG | + FORM,39,(5.86%)
– ROOT | + POSTAG | – FORM,37,(5.56%)


## Second evaluation: morphological annotations with spelling corrections

Now, we'll run Vabamorf's spellchecker on EWTB's words, collect spelling suggestions for misspelled words, analyse suggested words with Vabamorf's analyser and add corresponding new analyses.

For that, we first create a rewriter that adds spelling suggestions (and corresponding analyses) to misspelled words:

In [11]:
from estnltk.text import Text, Layer
from estnltk.vabamorf import morf as vm
from estnltk.layer.annotation import Annotation

from estnltk.taggers import VabamorfAnalyzer, PostMorphAnalysisTagger, Tagger

class SpellingCorrectionsMorphAnalysisRewriter:
    '''Rewrites morph_analysis layer with added spelling corrections.'''
    
    def __init__(self):
        self.vm_analyser = VabamorfAnalyzer()
        self.post_corrector = PostMorphAnalysisTagger()

    def rewrite(self, records):
        suggestions = set()
        added_records = []
        for rec in records:
            assert 'text' in rec
            # Perform spellchecking
            spell_check_result = vm.spellcheck([rec['text']], suggestions=True)
            for item in spell_check_result:
                # Check if we have a misspelled word with suggestions
                if not item["spelling"] and len(item["suggestions"]) > 0:
                    for new_suggestion in item["suggestions"]:
                        if new_suggestion not in suggestions:
                            suggestions.add( new_suggestion )
        if suggestions:
            # Perform new morph analysis on suggested variants
            temp_text = Text(' '.join(list(suggestions)))
            temp_text.tag_layer(['words', 'sentences'])
            self.vm_analyser.tag( temp_text )
            self.post_corrector.retag( temp_text )
            # Fetch the new morph annotations
            for morph_span in temp_text['morph_analysis']:
                morph_records = morph_span.to_records()
                # Rewrite coordinates
                for new_morph_rec in morph_records:
                    new_morph_rec['start'] = records[0]['start']
                    new_morph_rec['end']   = records[0]['end']
                added_records.extend( morph_records )
            #print('Suggestions: {!r} -> {!r}'.format(records[0]['text'], suggestions) )
        records.extend( added_records )
        return records

class MorphSpellingCorrectionsTagger(Tagger):
    '''Creates a copy of morph layer with added spelling corrections. Uses SpellingCorrectionsMorphAnalysisRewriter.
       TODO: there should be a more straightforward way for doing this
    '''
    conf_param = ['rewriter']

    def __init__(self, input_layer, output_layer, attributes):
        self.input_layers = [input_layer]
        self.output_layer = output_layer
        self.output_attributes = attributes
        self.rewriter = SpellingCorrectionsMorphAnalysisRewriter()
    
    def _make_layer(self, text, layers, status):
        layer = layers[self.input_layers[0]]
        new_layer = layer.copy() # make a copy of old morph layer
        new_layer.name = self.output_layer
        for span in new_layer:
            records = span.to_records(with_text=True)
            span.clear_annotations()
            records = self.rewriter.rewrite( records )
            for record in records:
                record = { k: record[k] for k in record.keys() if k in new_layer.attributes }
                span.add_annotation(Annotation(span, **record))
        return new_layer

Next, we'll apply the rewriter on our Text objects:

In [12]:
# Rewrite morph analysis layer with spelling corrections
spelling_corrections_layer = 'morph_analysis_with_spelling_suggestions'
morph_attributes = loaded_texts[0]['morph_analysis'].attributes
spelling_correction_retagger = MorphSpellingCorrectionsTagger('morph_analysis', spelling_corrections_layer, morph_attributes)
for text in loaded_texts:
    spelling_correction_retagger.tag( text )

Find differences between the morph analysis with spelling corrections, and EWTB manual annotations:

In [13]:
# Find differences once again
from ewtb_ud_utils import VM2EWTBMorphDiffTagger
vm2ud_diff_tagger_2 = VM2EWTBMorphDiffTagger( spelling_corrections_layer, ud_layer_name, 'morph_spelling_diff_layer' )
for text in loaded_texts:
    vm2ud_diff_tagger_2.tag(text)

In [14]:
# Summarize results
from ewtb_ud_utils import diff_statistics_html_table
diff_statistics_html_table( loaded_texts, 'morph_spelling_diff_layer')

texts,total_words,mismatching_words,mismatching_%,filters
2,27286,649,(2.38%),[]

0,1,2
+ ROOT | – POSTAG | – FORM,200,(30.82%)
– ROOT | + POSTAG | + FORM,121,(18.64%)
– ROOT | – POSTAG | – FORM,108,(16.64%)
+ ROOT | + POSTAG | – FORM,97,(14.95%)
+ ROOT | – POSTAG | + FORM,63,(9.71%)
– ROOT | + POSTAG | – FORM,35,(5.39%)
– ROOT | – POSTAG | + FORM,25,(3.85%)


In [15]:
# Summarize results with filters
diff_statistics_html_table( loaded_texts, 'morph_spelling_diff_layer', leave_out_udpos=['PROPN', 'PUNCT', 'SYM'] )

texts,total_words,mismatching_words,mismatching_%,filters
2,27286,492,(1.80%),"leave_out_udpos: ['PROPN', 'PUNCT', 'SYM']"

0,1,2
+ ROOT | – POSTAG | – FORM,122,(24.80%)
– ROOT | – POSTAG | – FORM,101,(20.53%)
+ ROOT | + POSTAG | – FORM,92,(18.70%)
– ROOT | + POSTAG | + FORM,66,(13.41%)
+ ROOT | – POSTAG | + FORM,62,(12.60%)
– ROOT | – POSTAG | + FORM,25,(5.08%)
– ROOT | + POSTAG | – FORM,24,(4.88%)


## Checking ambiguities

Finally, we can also examine some statistics about ambiguities in the original Vabamorf's morphological analyses:

In [16]:
diff_statistics_html_table( loaded_texts, 'morph_diff_layer', show_ambiguity=True)

texts,total_words,mismatching_words,mismatching_%,filters
2,27286,828,(3.03%),[]

ambiguous_words,ambiguous_%,avg_analyses_per_word
10176,(37.29%),1.68


And same statistics for analyses with spelling suggestions:

In [17]:
diff_statistics_html_table( loaded_texts, 'morph_spelling_diff_layer', show_ambiguity=True)

texts,total_words,mismatching_words,mismatching_%,filters
2,27286,649,(2.38%),[]

ambiguous_words,ambiguous_%,avg_analyses_per_word
10477,(38.40%),1.86
