## Morphological analysis with user dictionary

If you need to analyse non-standard Estonian texts (such as the Internet language, transcribed spoken language, or written texts heavily influenced by regional dialects), the standard morphological analyser will probably have suboptimal performance. 
But if the errors are regular enough, you can compose (either manually or semi-automatically) a user dictionary with corrections.
You can apply the dictionary to rewrite `'morph_analysis'` layer, so that words with erroneous analyses will have correct analyses from the dictionary.

Let's consider an example sentence from the Internet language:

In [1]:
text_str = "see onn hädavajalik vajd merel, xhus vxi metsas"

First, let's try to analyse it with the standard morphological analyser:

In [2]:
from estnltk import Text
text = Text(text_str)
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
see,see,see,see,['see'],0,,sg n,P
onn,onn,onn,onn,['onn'],0,,sg n,S
hädavajalik,hädavajalik,hädavajalik,häda_vajalik,"['häda', 'vajalik']",0,,sg n,A
vajd,vajd,vajd,vajd,['vajd'],0,,sg n,S
merel,merel,meri,meri,['meri'],l,,sg ad,S
",",",",",",",","[',']",,,,Z
xhus,xhus,xhu,xhu,['xhu'],s,,sg in,S
vxi,vxi,vxi,vxi,['vxi'],0,,sg g,S
metsas,metsas,mets,mets,['mets'],s,,sg in,S


Ok, the results were not so good.

But we can create an user dictionary:

In [3]:
from estnltk.taggers import UserDictTagger

# Create new user dictionary (stores words in case insensitive manner)
userdict = UserDictTagger(ignore_case=True)

... and populate it with correct analyses:

In [4]:
userdict.add_word('onn', [{'normalized_text':'on', 'form': 'b', 'root': 'ole', 'ending':'0', 'partofspeech': 'V', 'clitic':''}] )
userdict.add_word('vajd', [{'normalized_text':'vaid', 'form': '', 'root': 'vaid', 'ending':'0', 'partofspeech': 'D', 'clitic':''}] )
userdict.add_word('xhus', [{'normalized_text':'õhus', 'form': 'sg in', 'root': 'õhk', 'ending':'s', 'partofspeech': 'S', 'clitic':''}] )
userdict.add_word('vxi', [{'normalized_text':'või', 'form': '', 'root': 'või', 'ending':'0', 'partofspeech': 'J', 'clitic':''}] )

... and apply it to correct the analyses:

In [5]:
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
see,see,see,see,['see'],0,,sg n,P
onn,on,olema,ole,['ole'],0,,b,V
hädavajalik,hädavajalik,hädavajalik,häda_vajalik,"['häda', 'vajalik']",0,,sg n,A
vajd,vaid,vaid,vaid,['vaid'],0,,,D
merel,merel,meri,meri,['meri'],l,,sg ad,S
",",",",",",",","[',']",,,,Z
xhus,õhus,õhk,õhk,['õhk'],s,,sg in,S
vxi,või,või,või,['või'],0,,,J
metsas,metsas,mets,mets,['mets'],s,,sg in,S


_Voilà !_

Notes about `normalized_text`:
   * Dictionary's entry for a word may contain `normalized_text` value, but this is not mandatory. Note, however, that if  `normalized_text` is missing from the entry (and you are using complete overwriting), then by default, the value of `normalized_text` will be set to `None`;


   * You can initialize `UserDictTagger` with the parameter `replace_missing_normalized_text_with_text=True`. After that, if a `normalized_text` is missing from the dictionary entry, then its value will be replaced with word's text. Note, however, that if word's text is a non-standard word form (such as _vajd_, _xhus_, _vxi_ in the previous example), then the outcome will be misleading (a normalized_text is actually the non-standard one). So, you should use this option only if all word texts in your dictionary correspond to standard / normalized words;


---

### Partial overwriting

If the existing analysis only needs partial corrections (e.g. only root is incorrect), you can pass a dictionary with specific corrections to the `add_word` method. Existing analyses of the word will then be merged with the new attributes from the dictionary -- attributes present in the user dictionary will be overwritten, and attributes not present in the dictionary will remain as they are.

In [6]:
# Example: word thad needs corrections in the root and partofspeech
text = Text('igapäävased')
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,1

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
igapäävased,igapäävased,igapäävask,igapää_vask,"['igapää', 'vask']",d,,pl n,S


In [7]:
# Create new user dictionary
userdict = UserDictTagger()
# Correct only 'root' and 'partofspeech' of the word (leave other attributes as they are)
userdict.add_word('igapäävased', { 'root': 'iga_päevane', 'partofspeech': 'A'} )

In [8]:
# Apply corrections:
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,1

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
igapäävased,igapäävased,igapäevane,iga_päevane,"['iga', 'päevane']",d,,pl n,A


The minimum requirement for the dictionary of partial overwriting: it must specify at least one of the fields: `'root'`, `'ending'`, `'clitic'`, `'form'`, and `'partofspeech'`. Note: if `'root'` is provided, then `'partofspeech'` must also be provided (see below for details).

---

#### Overwriting `'root'`, `'lemma'` and `'root_tokens'`

Attributes `'root'` and `'lemma'` and `'root_tokens'` all record information about the morphological base form of the word, and its segmentation. 
If you need to change one of these attributes, you should update all in order to keep the data consistent. 
There is a systematic way how to do it.
You should restrict your dictionary entries only to `'root'` and `'partofspeech'` (and `'ending'`, if required). Attributes `'lemma'` and `'root_tokens'` will then be automatically generated based on `'root'` and `'partofspeech'` (so, no need to manually provide entries for `'lemma'` and `'root_tokens'`).

In [9]:
# Example: compound word thad needs corrections in the root
text = Text('abieluettepanek')
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,1

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
abieluettepanek,abieluettepanek,abieluettepanek,abi_elu_ette_panek,"['abi', 'elu', 'ette', 'panek']",0,,sg n,S


In [10]:
# Create new user dictionary with corrections to root and pos
userdict = UserDictTagger()
userdict.add_word('abieluettepanek', { 'root': 'abielu_ettepanek', 'partofspeech': 'S' } )

In [11]:
# Apply corrections
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,1

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
abieluettepanek,abieluettepanek,abieluettepanek,abielu_ettepanek,"['abielu', 'ettepanek']",0,,sg n,S


---

### Complete overwriting

If you pass a list of dictionaries to the method `add_word`, then all old anayses of the word will be replaced by the analyses in the user dictionary. Adding a list with a single dictionary means that the word is unambiguous, and multiple dictionaries represent different analysis variants of an ambiguous word.

In [12]:
# Example: verb needs corrections, but the ambiguities should remain
#          ( because it is not clear from the context, which form is correct )
text = Text('vist onn rahul')
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
vist,vist,vist,vist,['vist'],0,,,D
onn,onn,onn,onn,['onn'],0,,sg n,S
rahul,rahul,rahul,rahul,['rahul'],0,,,D


In [13]:
# Create new user dictionary with multiple analysis variants
userdict = UserDictTagger()
userdict.add_word('onn', [{'form': 'b', 'root': 'ole', 'ending':'0', 'partofspeech': 'V', 'clitic':''},\
                          {'form': 'vad', 'root': 'ole', 'ending':'0', 'partofspeech': 'V', 'clitic':''} ] )

In [14]:
# Apply corrections
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
vist,vist,vist,vist,['vist'],0,,,D
onn,,olema,ole,['ole'],0,,b,V
,,olema,ole,['ole'],0,,vad,V
rahul,rahul,rahul,rahul,['rahul'],0,,,D


The minimum requirement for the dictionary used in complete overwriting: it must specify all the fields `'root'`, `'ending'`, `'clitic'`, `'form'`, and `'partofspeech'`.

---


### Some details

#### About dictionary lookup

Words that you add to `UserDictTagger` will be matched against `normalized_text` values of text's morphological analyses. 
If a morphological analysis has `normalized_text` equal to `None`, then the dictionary word will be matched against `text` of the morphological analysis. By default, the matching is case sensitive, but you can turn it off by setting `ignore_case=True` when initializing `UserDictTagger`.

#### About matching and matching priorities

If a match is found, and `UserDictTagger`'s entry for the word corresponds to a dictionary, then the _partial overwriting strategy_ will be applied: only those morphological analysis' attributes that are in the dictionary will overwritten, and all other attributes remain as they are.
If matching word's entry is a list of dictionaries, the _complete overwriting strategy_ will be applied: all morphological analyses of the word will be replaced by analyses from the corresponding `UserDictTagger`'s entry.

If some of word's morphological analyses obtain _partial overwriting_ matches, and some obtain _complete overwriting_ matches, then the final result will be complete overwriting according to the last complete overwriting match.
In similar vein, if there is more than one morphological analysis that obtains a complete overwriting match, then the final result will be overwriting according to the last complete overwriting match (so, all matches previous to the last will be ignored).

#### Browsing the content of dictionary

You can use tagger's method `save_as_csv( None )` to get the full content of the user dictionary as a CSV formatted string:

In [15]:
# NBVAL_IGNORE_OUTPUT
print( userdict.save_as_csv(None) )

text	root	ending	clitic	form	partofspeech
onn	ole	0		b	V
onn	ole	0		vad	V



See the section "Saving analyses to CSV file" below for details about the method.

---

## Loading analyses from CSV file

Instead of specifying analyses via method `add_word`, you can also use method `add_words_from_csv_file` to load analyses from CSV file.

In [16]:
# Example: a difficult-to-analyse sentence from the Internet language
text = Text("mxnel ka igapäävased kxnekeeleväljändid sellged")
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
mxnel,mxnel,mxne,mxne,['mxne'],l,,sg ad,S
ka,ka,ka,ka,['ka'],0,,,D
igapäävased,igapäävased,igapäävask,igapää_vask,"['igapää', 'vask']",d,,pl n,S
kxnekeeleväljändid,kxnekeeleväljändid,kxnekeeleväljänd,kxnekeeleväljänd,['kxnekeeleväljänd'],d,,pl n,S
,kxnekeeleväljändid,kxnekeeleväljändi,kxnekeeleväljändi,['kxnekeeleväljändi'],d,,pl n,S
,kxnekeeleväljändid,kxnekeeleväljänt,kxnekeeleväljänt,['kxnekeeleväljänt'],d,,pl n,S
sellged,sellged,sellge,sellge,['sellge'],d,,pl n,S
,sellged,sellged,sellged,['sellged'],0,,sg n,S


In [17]:
# Create a CSV file with correct analyses
import tempfile
fp = tempfile.NamedTemporaryFile(mode='w', encoding='utf-8', suffix='.csv', delete=False)
# Add header
fp.write('text,form,root,ending,partofspeech,clitic\n')
# Add analyses
fp.write('mxnel,sg ad,mõni,l,P,\n')
fp.write('igapäävased,pl n,iga_päevane,d,A,\n')
fp.write('kxnekeeleväljändid,pl n,kõne_keele_väljend,d,S,\n')
fp.write('sellged,pl n,selge,d,A,\n')
fp.close()

It is required that the first line of the CSV file is the header, and uses the heading names `'root'`, `'ending'`, `'clitic'`, `'form'`, `'partofspeech'`, `'text'`. This is required to determine in which order the data has to be loaded from the file.

Each line following the heading specifies a single analysis for a word. The word itself must be under the column `'text'`. Note that like in case of the _complete overwriting_, all the fields `'root'`, `'ending'`, `'clitic'`, `'form'` and `'partofspeech'` must be specified. You can also provide multiple lines describing a single word: these will be then considered as different analysis variants of an ambiguous word.

In [18]:
# Create new user dictionary with the analyses loaded from the CSV file
userdict = UserDictTagger()
userdict.add_words_from_csv_file(fp.name, encoding='utf-8', delimiter=',')

Note: you can pass optional parameters, such as `dialect` and `delimiter`, to the method `add_words_from_csv_file( filename )` in order to specify the formatting of the CSV file. Basically, you can use the same parameters as can be used with the method `csv.reader`: https://docs.python.org/3/library/csv.html#csv.reader

In [19]:
# Apply corrections
userdict.retag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
mxnel,,mõni,mõni,['mõni'],l,,sg ad,P
ka,ka,ka,ka,['ka'],0,,,D
igapäävased,,igapäevane,iga_päevane,"['iga', 'päevane']",d,,pl n,A
kxnekeeleväljändid,,kõnekeeleväljend,kõne_keele_väljend,"['kõne', 'keele', 'väljend']",d,,pl n,S
sellged,,selge,selge,['selge'],d,,pl n,A


In [20]:
# Clean-up: remove temporary file
import os
os.remove(fp.name)

## Saving analyses to CSV file

Method `save_as_csv( filename )` can be used for saving the content of the user dictionary to a CSV format file. The data is saved in a way that it can be loaded with the method `add_words_from_csv_file( filename )`. 

Note 1: you can also pass optional parameters, such as `dialect` and `delimiter`, to the `save_as_csv( filename )` in order to change the formatting of the CSV. Basically, you can use the same parameters as can be used with the method `csv.writer`: https://docs.python.org/3/library/csv.html#csv.writer 

Note 2: if you use `None` in place of _filename_, then the method constructs and returns a CSV formatted string instead:

In [21]:
# NBVAL_IGNORE_OUTPUT
print( userdict.save_as_csv( None ) )

text	root	ending	clitic	form	partofspeech
igapäävased	iga_päevane	d		pl n	A
kxnekeeleväljändid	kõne_keele_väljend	d		pl n	S
mxnel	mõni	l		sg ad	P
sellged	selge	d		pl n	A



Note 3: if the dictionary contains partial overwriting entries, then the output CSV will have `'----------'` in places of attribute values that were not described in the (partial overwriting) dictionary.