## NormalizeWordsRetagger

NormalizeWordsRetagger is a retagger for adding possible normalized forms for words with features that occur mostly in Estonian Internet language.
<br>
Retagger detects and deals with such features as **repetitions** (*letter repetitions* (e.g. toreeeeee), *repetitive parts* (e.g. midagigigigi)) and **diacritics** (*missing diacritics* (e.g. oun - õun), *diacritics replaced with other characters* (e.g. l2hme - lähme)). If any of these features is detected in a word, the word will be processed and if possible, one or more normalized forms will be added under an attribute *normalized_form* of the words layer.

In [4]:
from estnltk import Text
from NormalizeWordsRetagger import NormalizeWordsRetagger

In [5]:
normalize_words_retagger =  NormalizeWordsRetagger()
normalize_words_retagger

name,output layer,output attributes,input layers
NormalizeWordsRetagger,words,(),"('words',)"

0,1
use_letter_reps,True
use_diacritics_fixes,True
use_diacritics_fixes_1,True
use_diacritics_fixes_2,False
use_diacritics_fixes_3,False
use_vabamorf_speller,True


Before applying NormalizeWordsRetagger, the input Text object must have a layer "words".

As stated before, NormalizeWordsRetagger has 2 groups of features that are observed, **repetitions** and **diacritics**. Both of these features can be turned on/off with corresponding flags: *use_letter_reps* for repetitions and *use_diacritics_fixes* for diacritics. 
<br>
For example, NormalizeWordsRetagger(use_letter_reps=False) deactivates the feature **use_letter_reps** that by default is set to True.
<br>
Retagger has altogether 6 flags of which 4 of them are set to True by default. 
<br>


#### Description of flags -- what kind of words are detected and normalized:

- **use_letter_reps** -- words with repetitive letters (min. three letters in a row; e.g. *jeeeee, lõõõppppukssss*); repetitive parts (e.g. *hahaha blaaaablaaa*).
<br>
- **use_diacritics_fixes** -- words with missing diacritics (e.g. *mote - mõte*, *vahemalt - vähemalt*); replaced diacritics (e.g. *möödukalt - mõõdukalt*, *ykskoik - ükskõik*).
<br>
- **use_diacritics_fixes_1** -- following replacements will be tried: {"y":"ü","6":"õ","2":"ä","å":"ä","ô":"õ","ó":"õ","ō":"õ","û":"ü","ú":"ü"}, e.g. *tyyp - tüüp*.
<br>
- **use_diacritics_fixes_2** -- following replacements will be tried: {"a":"ä","o":["õ","ö"],"u":"ü"}, e.g. *ragastik - rägastik*.
<br>
- **use_diacritics_fixes_3** -- following replacements will be tried: {"ö":["ü","õ","ö","ä"],"õ":["ü","õ","ö","ä"],"ü":["ü","õ","ö","ä"], "ä":["ü","õ","ö","ä"], "e":["ä","ö","õ"],"?":["ü","õ","ö","ä"]}, e.g. *tõdruk - tüdruk*.
<br>
- **use_vabamorf_speller** -- if word does not have and did not get a normalized form after the previous fixes, speller suggestions can be used.
<br>

Note that flags **use_diacritics_fixes_1**, **use_diacritics_fixes_2** and **use_diacritics_fixes_3** are subparts of the **use_diacritics_fixes** - **diacritics** part. 
<br>
If **use_diacritics_fixes** is set to False, these will be left unused aswell. 
<br>
In case **use_diacritics_fixes** is set to True, it is possible to choose which of the fixes to use. By default only **use_diacritics_fixes_1** is used as the replacements are less prone to generate false positive new forms compared to the other two.

### Example #1

First example tests NormalizeWordsRetagger on some incorrect words with features that the Retagger should detect and hopefully give normalized forms for.
<br>
This will give an overview of what NormalizeWordsRetagger does and how are the words normalized.

In [6]:
normalize_words_retagger =  NormalizeWordsRetagger(use_diacritics_fixes_2=True, use_diacritics_fixes_3=True)

text = Text('lalalala aaahjaaa määä eee oooot muhahaaaa ahhhaaa väää prrrr blaaaaablaaa mote keige mooda roomus tõdruk nüiteks niioelda körgharidus ykskoik komeedia peerama tõli vötma mõõdas usaldusvaarne vahemalt tyyp raaagastik').tag_layer(['words'])

normalize_words_retagger.retag(text)

for i in text["words"]:
    print("Text form:",i.text,"\nNormalized form(s):",i.normalized_form,"\n")

Text form: lalalala 
Normalized form(s): ['lala'] 

Text form: aaahjaaa 
Normalized form(s): ['ahjaa'] 

Text form: määä 
Normalized form(s): ['mää', 'mä'] 

Text form: eee 
Normalized form(s): ['ee', 'e'] 

Text form: oooot 
Normalized form(s): ['oot', 'ot'] 

Text form: muhahaaaa 
Normalized form(s): ['muhaha'] 

Text form: ahhhaaa 
Normalized form(s): ['ahhaa', 'aha'] 

Text form: väää 
Normalized form(s): ['vä', 'vää', 'vöö'] 

Text form: prrrr 
Normalized form(s): ['prr', 'pr'] 

Text form: blaaaaablaaa 
Normalized form(s): ['blabla'] 

Text form: mote 
Normalized form(s): ['mõte'] 

Text form: keige 
Normalized form(s): ['käige', 'kõige'] 

Text form: mooda 
Normalized form(s): ['mõõda', 'mööda'] 

Text form: roomus 
Normalized form(s): ['rõõmus'] 

Text form: tõdruk 
Normalized form(s): ['tüdruk'] 

Text form: nüiteks 
Normalized form(s): ['näiteks'] 

Text form: niioelda 
Normalized form(s): ['niiöelda'] 

Text form: körgharidus 
Normalized form(s): ['kõrgharidus', 'kärgharidus

In this example all flags were set to True. 
<br>
- With the final example word *raaagastik* it can be seen, that the order in Retagger is defined so, that first are found problems with repetitions (r**aaa**gastik -- ragastik). 
    If a new form is generated in that part, but it still is not a correct Estonian word, it will be given over to the next part. With replacement rules in the diacritics part it was found, that *ragastik* could be missing a letter ä, and as *rägastik* is a correct Estonian word, it is added as another normalized form.
<br>
- Here it is also possible to see some false positive normalized forms generated by the use_diacritics_fixes_3 part: e.g. *körgharidus* - ['kõrgharidus', 'kärgharidus']. 
<br>
- Sometimes it is not possible to give only one normalized form without knowing the context, so all possible forms will be added: e.g. *mooda* - ['mõõda', 'mööda'].

### Example #2

In the previous example it was shown how incorrect text forms will get normalized forms. These normalized forms will then, instead of the incorrect forms, be used in morphological analysis.
<br>
Let's see how NormalizeWordsRetagger works on some real tweets from Twitter.

In [None]:
normalize_words_retagger =  NormalizeWordsRetagger(use_diacritics_fixes_2=True, use_diacritics_fixes_3=True)

tweets = ["nagu kas sa toesti arvad et su kaitumine on oige voi mis kohaga sa tapselt motled",
          "okei mul kiiremas korras uut tatoveeringut vaja",
         "Ma kogu aeg teen endale peas igasugu stsenaariumeid, mida neverever niikuinii ei juhtu mhhhhhhhhh",
         "ma tahan et niiiii palju asju oleks teisiti.....",
         "kuidas oelda kindel ei???? mind nii lihtne umber veenda"]

for tweet in tweets:
    text_tweet=Text(tweet)
    text_tweet.tag_layer(['words'])
    normalize_words_retagger.retag(text_tweet)
    text_tweet.tag_layer(['morph_analysis'])
    print("-----------------------------------")
    display(text_tweet["morph_analysis"][['text', 'lemma', 'partofspeech']])

If in the previous example values of the normalized form attribute were printed out, then in this one lemmas along with part-of-speech tags are shown. It can be seen, that for some words the lemma is slightly different to the text form (e.g. *toesti - tõesti*), which means that these words got new forms under the normalized form attribute. 
<br>
<br>
We can check, how the output differs for word *toesti* if we decide to use or not NormalizeWordsRetagger.

In [None]:
normalize_words_retagger =  NormalizeWordsRetagger(use_diacritics_fixes_2=True, use_diacritics_fixes_3=True)

word=Text("toesti")
word_original=word.tag_layer(['words'])
word_original.tag_layer(['morph_analysis'])

for word_orig in word_original["morph_analysis"]:
    print("Without NormalizeWordsRetagger:",word_orig.text,word_orig.lemma,word_orig.partofspeech)

word=Text("toesti")
word_normalized=word.tag_layer(['words'])
normalize_words_retagger.retag(word_normalized)
word_normalized.tag_layer(['morph_analysis'])

for word_norm in word_normalized["morph_analysis"]:
    print("With NormalizeWordsRetagger:",word_norm.text,word_norm.lemma,word_norm.partofspeech)

### Example #3

Many of the words that NormalizeWordsRetagger detects, are particles. Morphological analysis does not have a certain part-of-speech (POS) tag for just particles, so these are usually analysed into very different and also incorrect categories. To solve the problem and gather all such words together, it is possible to use **UserDictTagger**.

In [9]:
from estnltk.taggers import UserDictTagger

normalize_words_retagger =  NormalizeWordsRetagger(use_diacritics_fixes_2=True, use_diacritics_fixes_3=True)

text = Text("vauuu oehhh, ma arvan, et see pole hea mote... Aga mhhhh, ma ei teaaaaa, ahjaaa äkki tahvad lihtsalt head nou! hihihihi")

text.tag_layer(['words'])
normalize_words_retagger.retag(text)
text.tag_layer(['morph_analysis'])

display( text["morph_analysis"][['text', 'lemma', 'partofspeech']] )


Unnamed: 0,text,lemma,partofspeech
0.0,vauuu,vau,S
1.0,oehhh,oeh,I
2.0,",",",",Z
3.0,ma,mina,P
4.0,arvan,arvama,V
5.0,",",",",Z
6.0,et,et,J
7.0,see,see,P
8.0,pole,olema,V
9.0,hea,hea,A


Words are normalized and morphologically analysed. As we can see, common Estonian words have all got a correct morphological analysis, but particles have got different POS-tags - either I or S. Next we will try and categorize all such particles under a same POS-tag - B. 

In [10]:
userdict = UserDictTagger(validate_vm_categories=False)
userdict.add_words_from_csv_file("particles_userdict.csv", encoding='ISO-8859-1', delimiter=',')
userdict.retag(text)

display( text["morph_analysis"][['text', 'lemma', 'partofspeech']] )

Unnamed: 0,text,lemma,partofspeech
0.0,vauuu,vau,B
1.0,oehhh,oeh,I
2.0,",",",",Z
3.0,ma,mina,P
4.0,arvan,arvama,V
5.0,",",",",Z
6.0,et,et,J
7.0,see,see,P
8.0,pole,olema,V
9.0,hea,hea,A


Now all the particles have got a new postag B. New postag has been given with a help of UserDictTagger, that needs a userdict in a form of a csv-fail for functioning. Userdict has to contain words and their new correct analyses in a form that is readable for UserDictTagger.