## Morphological analysis with user dictionary

If you need to analyse non-standard Estonian texts (such as the Internet language, transcribed spoken language, or written texts heavily influenced by regional dialects), the standard morphological analyser will probably have suboptimal performance. 
But if the errors are regular enough, you can compose (either manually or semi-automatically) a user dictionary with corrections.
You can apply the dictionary to rewrite `'morph_analysis'` layer, so that words with erroneous analyses will have correct analyses from the dictionary.

Now, there are two ways for correcting morphological analyses, depending on the types of errors you have:
  * if non-standard words can be mapped to standard words (e.g. words with incorrect spelling can be mapped to words with correct spelling, such as 'sellged' => 'selged' or 'vxi' => 'või'), then you can use `make_userdict` function to automatically create `UserDictTagger` based on given mappings, and you can use it to make corrections;
  
  
  * if spelling of words is correct, but the morphological analyser does not produce correct analyses (e.g. compound word 'abieluettepanek' is analysed with root 'abi_elu_ette_panek', although root 'abielu_ettepanek' would be more preferable), then you can manually create `UserDictTagger` which makes specific corrections for analyses;

## `make_userdict` function

### Basic usage

The function `make_userdict` can be used to automatically create `UserDictTagger` based on given mappings from non-standard words to standard words.

Let's consider an example sentence from the Internet language:

In [1]:
text_str = "see onn hädavajalik vajd merel, xhus vxi metsas"

First, let's try to analyse it with the standard morphological analyser:

In [2]:
from estnltk import Text
text = Text(text_str)
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
see,see,see,see,['see'],0,,sg n,P
onn,onn,onn,onn,['onn'],0,,sg n,S
hädavajalik,hädavajalik,hädavajalik,häda_vajalik,"['häda', 'vajalik']",0,,sg n,A
vajd,vajd,vajd,vajd,['vajd'],0,,sg n,S
merel,merel,meri,meri,['meri'],l,,sg ad,S
",",",",",",",","[',']",,,,Z
xhus,xhus,xhu,xhu,['xhu'],s,,sg in,S
vxi,vxi,vxi,vxi,['vxi'],0,,sg g,S
metsas,metsas,mets,mets,['mets'],s,,sg in,S


Ok, the results were not so good.

But we can create a dictionary that maps each misspelled word to a standard word (a correctly spelled word):

In [3]:
from estnltk.taggers import make_userdict

# Create UserDictTagger based on given corrections
userdict = make_userdict({'onn': 'on',
                          'vajd': 'vaid',
                          'xhus': 'õhus',
                          'vxi': 'või'}, ignore_case=True)

`make_userdict` returns an instance of `UserDictTagger`. The parameter `ignore_case` specifies that case differences will be ignored when searching misspelled words from text. 
Now, we can apply `UserDictTagger` to correct the analyses ("retag morphological analyses"):

In [4]:
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
see,see,see,see,['see'],0,,sg n,P
onn,on,olema,ole,['ole'],0,,b,V
,on,olema,ole,['ole'],0,,vad,V
hädavajalik,hädavajalik,hädavajalik,häda_vajalik,"['häda', 'vajalik']",0,,sg n,A
vajd,vaid,vaid,vaid,['vaid'],0,,,D
,vaid,vaid,vaid,['vaid'],0,,,J
merel,merel,meri,meri,['meri'],l,,sg ad,S
",",",",",",",","[',']",,,,Z
xhus,õhus,õhk,õhk,['õhk'],s,,sg in,S
vxi,või,või,või,['või'],0,,sg g,S


And — _voilà_ — we have obtained corrected analyses for misspelled words.

### Inspecting dictionary. Saving and loading dictionary's contents

If you need to inspect the content of the user dictionary (list all words and their corrected analyses), you can use `UserDictTagger`'s method `save_as_csv( None )`, which returns dictionary's content as a CSV formatted string:

In [5]:
# NBVAL_IGNORE_OUTPUT
print( userdict.save_as_csv( None ) )

text	normalized_text	root	ending	clitic	form	partofspeech
onn	on	ole	0		b	V
onn	on	ole	0		vad	V
vajd	vaid	vaid	0			D
vajd	vaid	vaid	0			J
vxi	või	või	0		sg g	S
vxi	või	või	0		o	V
vxi	või	või	0			D
vxi	või	või	0			J
vxi	või	või	0		sg n	S
xhus	õhus	õhk	s		sg in	S



If you pass a file name to the method `save_as_csv`, then dictionary content will be saved to the corresponding file:

In [6]:
# Save user dictionary into file 'my_corrections.csv'
userdict.save_as_csv( 'my_corrections.csv' )

In order to create `UserDictTagger` from a CSV file, you need to import the tagger and initialize it with `csv_file` parameter pointing to the name of the file:

In [7]:
# Load user dictionary from file 'my_corrections.csv'
from estnltk.taggers import UserDictTagger
userdict2 = UserDictTagger(csv_file='my_corrections.csv', ignore_case=True)

In [8]:
# Check contents of the loaded user dictionary
# NBVAL_IGNORE_OUTPUT
print( userdict2.save_as_csv( None ) )

text	normalized_text	root	ending	clitic	form	partofspeech
onn	on	ole	0		b	V
onn	on	ole	0		vad	V
vajd	vaid	vaid	0			D
vajd	vaid	vaid	0			J
vxi	või	või	0		sg g	S
vxi	või	või	0		o	V
vxi	või	või	0			D
vxi	või	või	0			J
vxi	või	või	0		sg n	S
xhus	õhus	õhk	s		sg in	S



While saving dictionary to a file or loading dictionary from a file, you can also change the formatting parameters of the CSV file. See the sections "Loading analyses from CSV file" and "Saving analyses to CSV file" below for details.

### Configuration of morphological analysis

For creating morphological analyses, the function `make_userdict` uses `VabamorfAnalyzer` with default settings. 
If you want to change the parameters of morphological analysis, you can create an instance of `VabamorfAnalyzer` with desired settings and pass it to `make_userdict` as an argument:

In [9]:
# Create VabamorfAnalyzer that does not guess proper names
from estnltk.taggers import VabamorfAnalyzer
vm_analyzer = VabamorfAnalyzer(propername=False)

# Create UserDictTagger based on given corrections
# and using given VabamorfAnalyzer
userdict = make_userdict({'Jänenene': 'Jänes',
                          'Kissu': 'Kiisu',
                          'Tsuksu':'Suksu'}, 
                          vm_analyzer=vm_analyzer)

In [10]:
# NBVAL_IGNORE_OUTPUT
# Inspect user dictionary content
print( userdict.save_as_csv( None ) )

text	normalized_text	root	ending	clitic	form	partofspeech
Jänenene	Jänes	jänes	0		sg n	S
Kissu	Kiisu	kiisu	0		sg g	S
Kissu	Kiisu	kiisu	0		sg n	S
Tsuksu	Suksu	suksu	0		sg g	S
Tsuksu	Suksu	suksu	0		sg n	S



<div class="alert alert-block alert-warning">
<h4>Notes on morphological ambiguity and disambiguation</h4>
<br>
While creating an user dictionary with <code>make_userdict</code>, you can also define a mapping from a non-standard word to a list of corresponding standard words in case there is an ambiguity.
For instance, in the previous example, we could define:
<pre>
userdict = make_userdict({'Jänenene': ['Jänes', 'Jänku'],
                          'Kissu': 'Kiisu'
                          'Tsuksu':'Suksu'}, 
                          vm_analyzer=vm_analyzer)
</pre>
As a result, <code>VabamorfAnalyzer</code> generates analyses for each of the listed words, and the entry for word 'Jänenene' will be:
<pre>
text	normalized_text	root	ending	clitic	form	partofspeech
Jänenene	Jänes	jänes	0		sg n	S
Jänenene	Jänku	jänku	0		sg g	S
Jänenene	Jänku	jänku	0		sg n	S
</pre>
<i>Be aware</i> that applying this correction on <code>'morph_analysis'</code> layer produces words with multiple <code>normalized_text</code>-s.
However, with multiple <code>normalized_text</code>-s, there is no guarantee for correct morphological disambiguation. Therefore, if you have made corrections that produce multiple normalized forms, we do not recommend applying disambiguation.
If you really need to do it, you should definitely check first if the disambiguation quality is satisfactory.
</div>

## `UserDictTagger`

### Basic usage: partial overwriting

If you want a more fine-grained control over corrections made on morphological analyses, then you can manually initialize `UserDictTagger` with the corrections you want to make.

Let's consider an example when we want to correct root of a compound word:

In [11]:
from estnltk import Text
text = Text('Abieluettepanek lükati tagasi')
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Abieluettepanek,Abieluettepanek,abieluettepanek,abi_elu_ette_panek,"['abi', 'elu', 'ette', 'panek']",0,,sg n,S
lükati,lükati,lükkama,lükka,['lükka'],ti,,ti,V
tagasi,tagasi,tagasi,tagasi,['tagasi'],0,,,D


We want to change `root` of the word _Abieluettepanek_ from `abi_elu_ette_panek` to `abielu_ettepanek`.
For this, we can create a mapping from _'abieluettepanek'_ to a dictionary that specifies correct attribute values.
In addition to specifying new value for `root`, we also need to specify `partofspeech`, as this is required for automatically creating new values for `lemma` and `root_tokens`:

In [12]:
# Define corrections for root (and partofspeech)
my_corrections = {
    'abieluettepanek': { 'root': 'abielu_ettepanek', 'partofspeech': 'S' } 
}

In [13]:
# Create UserDictTagger based on given corrections
from estnltk.taggers import UserDictTagger
userdict = UserDictTagger( words_dict = my_corrections, ignore_case=True )

In [14]:
# Apply corrections
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Abieluettepanek,Abieluettepanek,abieluettepanek,abielu_ettepanek,"['abielu', 'ettepanek']",0,,sg n,S
lükati,lükati,lükkama,lükka,['lükka'],ti,,ti,V
tagasi,tagasi,tagasi,tagasi,['tagasi'],0,,,D


Note:

  * If you need to change any of the attributes `root`, `lemma` or `root_tokens`, you should update all of them in order to keep the data consistent. The systematic way how to do it is to restrict your dictionary entries only to `root` and `partofspeech` (and `ending`, if required). Attributes `lemma` and `root_tokens` will then be automatically generated based on `root` and `partofspeech`. So, no need to manually provide entries for `lemma` and `root_tokens`.

Mapping a changeable word to a dictionary of new attribute values initiates **partial overwriting** -- only specific attributes will be corrected and other attributes will remain as they are.

Let's consider another example of partial overwriting:

In [15]:
# Example: word thad needs corrections only in the root and partofspeech
text = Text('igapäävased toimetused')
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
igapäävased,igapäävased,igapäävask,igapää_vask,"['igapää', 'vask']",d,,pl n,S
toimetused,toimetused,toimetus,toimetus,['toimetus'],d,,pl n,S


In [16]:
# Corrections only for 'root' and 'partofspeech' of the word (leave other attributes as they are)
my_corrections = {
    'igapäävased': { 'root': 'iga_päevane', 'partofspeech': 'A'} 
}
# Create new user dictionary
userdict = UserDictTagger( words_dict=my_corrections )

In [17]:
# Apply corrections:
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
igapäävased,igapäävased,igapäevane,iga_päevane,"['iga', 'päevane']",d,,pl n,A
toimetused,toimetused,toimetus,toimetus,['toimetus'],d,,pl n,S


The minimum requirement for the dictionary of partial overwriting: it must specify at least one of the fields: `'root'`, `'ending'`, `'clitic'`, `'form'`, and `'partofspeech'`. Note: if `'root'` is provided, then `'partofspeech'` must also be provided (so that `'root_tokens'` and `'lemma'` can be automatically created).

### Complete overwriting

If the correction entry maps a word to a list of dictionaries, then all old anayses of the word will be replaced by the listed analyses. 
A list with a single dictionary means that the word is unambiguous, and multiple dictionaries represent different analysis variants of an ambiguous word.

In [18]:
# Create new user dictionary with multiple analysis variants
my_corrections = {
    'onn': [{'form': 'b', 'root': 'ole', 'ending':'0', 'partofspeech': 'V', 'clitic':''},\
            {'form': 'vad', 'root': 'ole', 'ending':'0', 'partofspeech': 'V', 'clitic':''} ]
}
userdict = UserDictTagger( words_dict = my_corrections )

In [19]:
# Example: verb needs corrections, but the ambiguities should remain
#          ( because it is not clear from the context, which form is correct )
text = Text('vist onn rahul')
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
vist,vist,vist,vist,['vist'],0,,,D
onn,onn,onn,onn,['onn'],0,,sg n,S
rahul,rahul,rahul,rahul,['rahul'],0,,,D


In [20]:
# Apply corrections
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
vist,vist,vist,vist,['vist'],0,,,D
onn,,olema,ole,['ole'],0,,b,V
,,olema,ole,['ole'],0,,vad,V
rahul,rahul,rahul,rahul,['rahul'],0,,,D


The minimum requirement for the dictionary used in complete overwriting: it must specify all the fields `'root'`, `'ending'`, `'clitic'`, `'form'`, and `'partofspeech'`.

### More details on `UserDictTagger`

#### About `normalized_text` attribute

Dictionary's entry for a word may contain `normalized_text` value, but this is not mandatory. Note, however, that if  `normalized_text` is missing from the entry (and you are using complete overwriting), then by default, the value of `normalized_text` will be set to `None` in the `'morph_analysis'` layer;

You can initialize `UserDictTagger` with the parameter `replace_missing_normalized_text_with_text=True`. After that, if a `normalized_text` is missing from the dictionary entry, then its value will be replaced with word's text. Note, however, that if word's text is a non-standard word form (such as _vajd_, _xhus_, _vxi_ in the previous example), then the outcome will be misleading (a normalized_text is actually the non-standard one). So, you should use this option if you are correcting analyses of words which already follow orthographic standard;

#### About dictionary lookup

Words that you add to `UserDictTagger` will be matched against `normalized_text` values of text's morphological analyses. 
If a morphological analysis has `normalized_text` equal to `None`, then the dictionary word will be matched against `text` of the morphological analysis. By default, the matching is case sensitive, but you can turn it off by setting `ignore_case=True` when initializing `UserDictTagger`.

#### About matching and matching priorities

If a match is found, and `UserDictTagger`'s entry for the word corresponds to a dictionary, then the _partial overwriting strategy_ will be applied: only those morphological analysis' attributes that are in the dictionary will overwritten, and all other attributes remain as they are.
If matching word's entry is a list of dictionaries, the _complete overwriting strategy_ will be applied: all morphological analyses of the word will be replaced by analyses from the corresponding `UserDictTagger`'s entry.

If some of word's morphological analyses obtain _partial overwriting_ matches, and some obtain _complete overwriting_ matches, then the final result will be complete overwriting according to the last complete overwriting match.
In similar vein, if there is more than one morphological analysis that obtains a complete overwriting match, then the final result will be overwriting according to the last complete overwriting match (so, all matches previous to the last will be ignored).

#### How to overwrite only unknown words

By default, `UserDictTagger` overwrites all words that can be matched to the user dictionary. This means that unknown words with `None` analyses are overwritten as well as known words with existing analyses.
However, you can use the setting `overwrite_existing=False` on initializing `UserDictTagger` to turn off overwriting of existing analyses.
With this setting, only words with `None` analyses will obtain analyses from the user dictionary, and all words with existing analyses will remain as they are.

#### How to turn off category validation [Advanced]

Analyses added to the `UserDictTagger` will be checked for validity of partofspeech and form categories. 
The validation checks if the respective category values are valid category values for Vabamorf. 
If you want to introduce new categories, then you can turn off category validation with the flag `validate_vm_categories=False` upon initialization.

### Loading analyses from CSV file

`UserDictTagger` takes a parameter `csv_file`, which specifies the name of the CSV file from which the content of the dictionary will be loaded.

Let's consider an example of loading corrections from a customized CSV file. First, create the file:

In [21]:
# Create a CSV file with correct analyses
import tempfile
fp = tempfile.NamedTemporaryFile(mode='w', encoding='utf-8', suffix='.csv', delete=False)
# Add header
fp.write('text,form,root,ending,partofspeech,clitic\n')
# Add analyses
fp.write('mxnel,sg ad,mõni,l,P,\n')
fp.write('igapäävased,pl n,iga_päevane,d,A,\n')
fp.write('kxnekeeleväljändid,pl n,kõne_keele_väljend,d,S,\n')
fp.write('sellged,pl n,selge,d,A,\n')
fp.close()

It is required that the first line of the CSV file is the header, and uses the heading names `'root'`, `'ending'`, `'clitic'`, `'form'`, `'partofspeech'`, `'text'`. This is required to determine in which order the data has to be loaded from the file.

Each line following the heading specifies a single analysis for a word. The word itself must be under the column `'text'`. Note that like in case of the _complete overwriting_, all the fields `'root'`, `'ending'`, `'clitic'`, `'form'` and `'partofspeech'` must be specified. You can also provide multiple lines describing a single word: these will be then considered as different analysis variants of an ambiguous word.

In [22]:
# Create new user dictionary with the analyses loaded from the CSV file
userdict = UserDictTagger( csv_file=fp.name, encoding='utf-8', delimiter=',' )

Note: you can pass optional parameters, such as `dialect` and `delimiter`, to the constructor in order to specify the formatting of the CSV file. Basically, you can use the same parameters as can be used with the method `csv.reader`: https://docs.python.org/3/library/csv.html#csv.reader

In [23]:
# Example: a difficult-to-analyse sentence from the Internet language
text = Text("mxnel ka igapäävased kxnekeeleväljändid sellged")
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
mxnel,mxnel,mxne,mxne,['mxne'],l,,sg ad,S
ka,ka,ka,ka,['ka'],0,,,D
igapäävased,igapäävased,igapäävask,igapää_vask,"['igapää', 'vask']",d,,pl n,S
kxnekeeleväljändid,kxnekeeleväljändid,kxnekeeleväljänd,kxnekeeleväljänd,['kxnekeeleväljänd'],d,,pl n,S
,kxnekeeleväljändid,kxnekeeleväljändi,kxnekeeleväljändi,['kxnekeeleväljändi'],d,,pl n,S
,kxnekeeleväljändid,kxnekeeleväljänt,kxnekeeleväljänt,['kxnekeeleväljänt'],d,,pl n,S
sellged,sellged,sellge,sellge,['sellge'],d,,pl n,S
,sellged,sellged,sellged,['sellged'],0,,sg n,S


In [24]:
# Apply corrections
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
mxnel,,mõni,mõni,['mõni'],l,,sg ad,P
ka,ka,ka,ka,['ka'],0,,,D
igapäävased,,igapäevane,iga_päevane,"['iga', 'päevane']",d,,pl n,A
kxnekeeleväljändid,,kõnekeeleväljend,kõne_keele_väljend,"['kõne', 'keele', 'väljend']",d,,pl n,S
sellged,,selge,selge,['selge'],d,,pl n,A


In [25]:
# Clean-up: remove temporary file
import os
os.remove(fp.name)

### Saving analyses to CSV file

`UserDictTagger`'s method `save_as_csv( filename )` can be used for saving the content of the dictionary to a CSV format file. Once the data is saved via `save_as_csv`, it can be loaded via initializing `UserDictTagger` with the parameter `csv_file`. 

Note 1: you can also pass optional parameters, such as `dialect` and `delimiter`, to the `save_as_csv( filename )` in order to change the formatting of the CSV. Basically, you can use the same parameters as can be used with the method `csv.writer`: https://docs.python.org/3/library/csv.html#csv.writer 

Note 2: if you use `None` in place of _filename_, then the method constructs and returns a CSV formatted string instead:

In [26]:
# NBVAL_IGNORE_OUTPUT
print( userdict.save_as_csv( None ) )

text	root	ending	clitic	form	partofspeech
igapäävased	iga_päevane	d		pl n	A
kxnekeeleväljändid	kõne_keele_väljend	d		pl n	S
mxnel	mõni	l		sg ad	P
sellged	selge	d		pl n	A



Note 3: if the dictionary contains partial overwriting entries, then the output CSV will have `'----------'` in places of attribute values that were not described in the (partial overwriting) dictionary.