# <span style="color:blue"> B. Specific details for programmers: how it works</span>

## <span style="color:purple"> Text segmentation: Spellcheck and normalization suggestions for words </span>

Sometimes you need to analyse documents for spelling: which words are misspelled and what would be their possible correct forms.
In these situations, you can enhance the pipeline with EstNLTK's tools for spellchecking and word normalization.

### `SpellCheckRetagger`

`SpellCheckRetagger` adds normalized forms (corrected forms) to the misspelled words in text.
Let's consider an example:

In [1]:
from estnltk import Text
from estnltk.taggers import SpellCheckRetagger

# Create a text containing spelling mistakes
text=Text('Vikastes lausetes on trügivigasid!')
# Add words layer
text.tag_layer(['words'])

# Create spellchecker
spelling_tagger=SpellCheckRetagger()
# Add normalizations to misspelled words
spelling_tagger.retag(text)

# Check the results
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,5

text,normalized_form
Vikastes,Vigastes
lausetes,
on,
trügivigasid,trükivigasid
!,


As the results show, misspelled words `'Vikastes'` and `'trügivigasid'` obtained `normalized_form`-s with suggestions for correct spellings.

Under the hood, `SpellCheckRetagger` uses [Vabamorf](https://github.com/Filosoft/vabamorf)'s speller tool to provide the functionality.

#### Flag `add_spellcheck`

The flag `add_spellcheck` can be used to make spellchecking results explicit: for each annotation, there will be boolean `spelling` indicating whether the original word was spelled correctly.
Example:

In [2]:
# Create a text containing spelling mistakes
text=Text('Vikastes lausetes on trügivigasid!')
text.tag_layer(['words'])

# Create spellchecker
spelling_tagger=SpellCheckRetagger(add_spellcheck=True)
# Add normalizations to misspelled words
spelling_tagger.retag(text)

# Check the results
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,"normalized_form, spelling",,,True,5

text,normalized_form,spelling
Vikastes,Vigastes,False
lausetes,,True
on,,True
trügivigasid,trükivigasid,False
!,,True


<div class="alert alert-block alert-warning">
<h4><i>Remark about attribute <code>spelling</code></i></h4>
<br>
Please keep in mind that <code>spelling</code> only shows if the surface form (<code>text</code>) is  a correctly spelled word. 
A misreading would be to interpret it as indicating the correctness of <code>normalized_form</code>. 
</div>

#### Spelling corrections and morphological analysis 

Morphological analysis takes account of the spelling corrections. If word's `normalized_form` contains a spelling correction, it will be processed by the morphological analyser, and the surface form (`text`) will be ignored. An example:

In [3]:
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Vikastes,Vigastes,vigane,vigane,['vigane'],tes,,pl in,A
lausetes,lausetes,lause,lause,['lause'],tes,,pl in,S
on,on,olema,ole,['ole'],0,,b,V
,on,olema,ole,['ole'],0,,vad,V
trügivigasid,trükivigasid,trükiviga,trüki_viga,"['trüki', 'viga']",sid,,pl p,S
!,!,!,!,['!'],,,,Z


The 'morph_analysis' layer shows the spelling-corrected word in the attribute `normalized_text`. This corresponds to the `normalized_form` on the words layer.

#### Flag `add_all_suggestions`

The spellchecker can give multiple suggestions for a misspelled word.
By default, only first of the suggestions is picked by `SpellCheckRetagger`, because there is a risk of lowering the quality of morphological disambiguation when multiple normalizations are included.

For experimenting, you can enable multiple normalizations by switching on the flag `add_all_suggestions`:

In [4]:
# Create a text containing spelling mistakes
text=Text('Vikastes lausetes on trügivigasid!')
# Add words layer
text.tag_layer(['words'])

# Create spellchecker that can give multiple suggestions
spelling_tagger=SpellCheckRetagger(add_all_suggestions=True)
# Add normalizations to misspelled words
spelling_tagger.retag(text)

# Check the results
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,5

text,normalized_form
Vikastes,Vigastes
,Vihastes
lausetes,
on,
trügivigasid,trükivigasid
!,


EstNLTK's `VabamorfAnalyzer` can also handle multiple normalizations.
If word's `normalized_form` contains more than one spelling corrections, all of these will be processed by `VabamorfAnalyzer`, and the surface form (`text`) will be ignored. 
Only words without spelling corrections (`normalized_form == None`) will be analysed morphologically by the surface form.

Let's consider an example: add morphological analysis layer to the spelling corrected text:

In [5]:
# import Vabamorf's analyser
from estnltk.taggers import VabamorfAnalyzer
vm_analyser = VabamorfAnalyzer()

# add required layers to text
text.tag_layer(['sentences'])
# add morph_analysis and check results
vm_analyser.tag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Vikastes,Vigastes,Vigane,Vigane,['Vigane'],tes,,pl in,H
,Vigastes,Vigas,Vigas,['Vigas'],tes,,pl in,H
,Vigastes,Vigaste,Vigaste,['Vigaste'],s,,sg in,H
,Vigastes,Vigastes,Vigastes,['Vigastes'],0,,sg n,H
,Vigastes,vigane,vigane,['vigane'],tes,,pl in,A
,Vigastes,vigane,vigane,['vigane'],tes,,pl in,S
,Vihastes,Vihane,Vihane,['Vihane'],tes,,pl in,H
,Vihastes,Vihas,Vihas,['Vihas'],tes,,pl in,H
,Vihastes,Vihaste,Vihaste,['Vihaste'],s,,sg in,H
,Vihastes,Vihastes,Vihastes,['Vihastes'],0,,sg n,H


Naturally, the resulting 'morph_analysis' layer is also ambiguous. 
But if multiple normalizations are used, you should be careful with applying morphological disambiguation ...

<div class="alert alert-block alert-warning">
<h4>Warning: <i>correct morphological disambiguation not guaranteed with <code>add_all_suggestions</code>!</i></h4>
<br>
Please keep in mind that while Vabamorf's morphological disambiguation (<code>VabamorfDisambiguator</code>) can also be applied on a text where words have multiple normalizations, there is no guarantee on the high quality of disambiguation results.
The reason is that the disambiguator has only been trained on the corpus of standard language, and the tool is not aware of texts, where each word can have multiple alternative normalizations. 
Therefore, we do not recommend applying disambiguation if <code>add_all_suggestions</code> is turned on. 
If you really need to do it, you should definitely check first if the disambiguation quality is satisfactory.
</div>

---

### [Legacy] The old interface of spellchecker

EstNLTK 1.6 also contains the old interface of Vabamorf's spellchecker.
This provides the raw spellchecking functionality, without any tokenization and sentence segmentation corrections available in the new pipeline.

The old function `spellcheck()` accepts the following parameters:
  1. list of the word tokens (strings) that will be spell-checked;
  2. optional parameter `suggestions` indicating if spelling suggestions should be provided for misspelled words. By default, `suggestions==True`;

An usage example:

In [6]:
text_str = 'Vikastes lausetes on trügivigasid !'
text_tokens = text_str.split()

In [7]:
# NBVAL_IGNORE_OUTPUT
from estnltk.vabamorf.morf import spellcheck
spellcheck(text_tokens)

[{'text': 'Vikastes',
  'spelling': False,
  'suggestions': ['Vigastes', 'Vihastes']},
 {'text': 'lausetes', 'spelling': True, 'suggestions': []},
 {'text': 'on', 'spelling': True, 'suggestions': []},
 {'text': 'trügivigasid', 'spelling': False, 'suggestions': ['trükivigasid']},
 {'text': '!', 'spelling': True, 'suggestions': []}]

The result is a list of dictionaries with spellchecking results.
Each dictionary contains `'text'` (the original token), `'spelling'` (if the spelling was correct) and `'suggestions'` (list of spelling suggestions for the misspelled word).

If not required, spelling suggestions for words can also be turned off:

In [8]:
# NBVAL_IGNORE_OUTPUT
spellcheck(text_tokens, suggestions=False)

[{'text': 'Vikastes', 'spelling': False, 'suggestions': []},
 {'text': 'lausetes', 'spelling': True, 'suggestions': []},
 {'text': 'on', 'spelling': True, 'suggestions': []},
 {'text': 'trügivigasid', 'spelling': False, 'suggestions': []},
 {'text': '!', 'spelling': True, 'suggestions': []}]