## Morphological analysis with user dictionary

If you are to analyse non-standard Estonian texts (such as the Internet language, transcribed spoken language, or written texts heavily influenced by regional dialects), the standard morphological analyser will probably have suboptimal performance. 
But if the errors are regular enough, you can compose (either manually or semi-automatically) a user dictionary with corrections.
The dictionary is then used to rewrite 'morph_analysis' layer in a way that words with erroneous analyses will have new analyses from the dictionary.

Let's consider an example sentence from Internet language:

In [1]:
text_str = "see onn hädavajalik vajd merel, xhus vxi metsas"

First, try to analyse it with the standard morphological analyser:

In [2]:
from estnltk import Text
text = Text(text_str)
text.tag_layer(['morph_analysis'])
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,start,end,lemma,root,root_tokens,ending,clitic,form,partofspeech
see,0,3,see,see,"(see,)",0,,sg n,P
onn,4,7,onn,onn,"(onn,)",0,,sg n,S
hädavajalik,8,19,hädavajalik,häda_vajalik,"(häda, vajalik)",0,,sg n,A
vajd,20,24,vajd,vajd,"(vajd,)",0,,sg n,S
merel,25,30,meri,meri,"(meri,)",l,,sg ad,S
",",30,31,",",",","(,,)",,,,Z
xhus,32,36,xhu,xhu,"(xhu,)",s,,sg in,S
vxi,37,40,vxi,vxi,"(vxi,)",0,,sg g,S
metsas,41,47,mets,mets,"(mets,)",s,,sg in,S


Ok, the results were not so good.

But we can create an user dictionary:

In [3]:
from estnltk.taggers.morph.userdict_tagger import UserDictTagger

# Create new user dictionary (stores words in case insensitive manner)
userdict = UserDictTagger(ignore_case=True)

... and populate it with correct analyses:

In [4]:
userdict.add_word('onn', [{'form': 'b', 'root': 'ole', 'ending':'0', 'partofspeech': 'V', 'clitic':''}] )
userdict.add_word('vajd', [{'form': '', 'root': 'vaid', 'ending':'0', 'partofspeech': 'D', 'clitic':''}] )
userdict.add_word('xhus', [{'form': 'sg in', 'root': 'õhk', 'ending':'s', 'partofspeech': 'S', 'clitic':''}] )
userdict.add_word('vxi', [{'form': '', 'root': 'või', 'ending':'0', 'partofspeech': 'J', 'clitic':''}] )

... and apply it to correct the analyses:

In [5]:
userdict.tag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,start,end,lemma,root,root_tokens,ending,clitic,form,partofspeech
see,0,3,see,see,"(see,)",0,,sg n,P
onn,4,7,olema,ole,"(ole,)",0,,b,V
hädavajalik,8,19,hädavajalik,häda_vajalik,"(häda, vajalik)",0,,sg n,A
vajd,20,24,vaid,vaid,"(vaid,)",0,,,D
merel,25,30,meri,meri,"(meri,)",l,,sg ad,S
",",30,31,",",",","(,,)",,,,Z
xhus,32,36,õhk,õhk,"(õhk,)",s,,sg in,S
vxi,37,40,või,või,"(või,)",0,,,J
metsas,41,47,mets,mets,"(mets,)",s,,sg in,S


_Voilà !_