# <span style="color:blue"> B. Specific details for programmers: how it works</span>

## <span style="color:purple"> Text segmentation: Words </span>

### Words

Words are often considered as the smallest meaningful units of language, especially from the perspective of syntactic or semantic analysis.
In order to get words, outputs of the `TokensTagger` and `CompoundTokenTagger` have to be combined. 
This is done by `WordTagger` and it is quite straightforward: every compound token is a word, and every token that is not a part of a compound token is also a word. The words are tagged on the raw text the same way as the tokens were. It means that the `words` layer does not depend on `tokens` layer or `compound_tokens` layer and so these layers may be deleted after the words are tagged.

In the following example, a text object is created, prerequisite layers (`tokens`, `compound_tokens`) are added to it, and then the layer `words` is tagged:

In [1]:
from estnltk import Text

# Prepare text: add tokens and compound tokens
text = Text('See on v-vä-väga huvitav, aga kas ka ka-su-lik?!')
text.tag_layer(['tokens', 'compound_tokens'])

# Add words
from estnltk.taggers import WordTagger
WordTagger().tag(text)
text['words']

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,10

text,normalized_form
See,
on,
v-vä-väga,väga
huvitav,
",",
aga,
kas,
ka,
ka-su-lik,kasulik
?!,


### Normalized word forms. Ambiguity of words

The `words` layer has an attribute `normalized_form`, which can contain normalized forms of the surface word. 
By default, this information is taken from the layer `compound_tokens`: if a compound token has the attribute `normalized` filled in, then this information is also carried over to the `normalized_form` of the corresponding word. Otherwise, `normalized_form` remains `None`.

The words layer is _ambiguous_: it can hold multiple normalized forms for each word. This is useful in analysing non-standard varieties of Estonian (such as the Internet slang, or texts written in a dialect): all normalized form candidates that you provide for an unknown  word will be analysed by the downstream morphological analyzer.

**How normalized forms affect morphological analysis.** If a word has `normalized_form` set to `None`, then only its surface form (`text`) will be analysed morphologically. But if `normalized_form` contains one or more alternative forms (strings), all of these alternatives will be processed by the morphological analyser (`VabamorfAnalyzer`), and the surface form (`text`) will be ignored. 

An example. Let's first change the normalized forms of a word, and introduce new alternative forms:

In [2]:
from estnltk import Text, Annotation
text=Text('Üsna hää!')
text.tag_layer(['tokens', 'compound_tokens', 'words'])

for word in text.words:
    if word.text=='hää':
        # Change word's annotations
        word.clear_annotations()
        word.add_annotation( Annotation(word, normalized_form='hea') )
        word.add_annotation( Annotation(word, normalized_form='head') )
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,3

text,normalized_form
Üsna,
hää,hea
,head
!,


Now, let's use `VabamorfAnalyzer` to provide analyses for all variants:

In [3]:
from estnltk.taggers import VabamorfAnalyzer
vm_analyser = VabamorfAnalyzer()
text.tag_layer(['sentences'])
vm_analyser.tag(text)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Üsna,Üsna,üsna,üsna,['üsna'],0,,,D
hää,hea,hea,hea,['hea'],0,,sg g,A
,hea,hea,hea,['hea'],0,,sg n,A
,hea,hea,hea,['hea'],0,,sg g,S
,hea,hea,hea,['hea'],0,,sg n,S
,head,hea,hea,['hea'],d,,pl n,A
,head,hea,hea,['hea'],d,,sg p,A
,head,hea,hea,['hea'],d,,pl n,S
,head,hea,hea,['hea'],d,,sg p,S
!,!,!,!,['!'],,,,Z


Note that the `morph_analysis` layer has as a special attribute `normalized_text` which holds the string value of the `normalized_form` (or the surface form) that was used as a basis on generating the analysis.
From the previous example, we can see that the surface word _'hää'_ has both analyses of the word _'hea'_ and the word _'head'_.

**(!) How normalized forms affect morphological disambiguation.** 
If all words in text have at most one `normalized_form` (that is: all analyses of a word in the `morph_analysis` layer correspond to analyses of a single normalized form), then `VabamorfDisambiguator` should be able to provide a high quality morphological disambiguation.
However, the morphological disambiguation in the context of words having multiple `normalized_form`-s has not been thoroughly tested, and we have a reason to suspect that such settings may lower the quality of disambiguation. 
So, be careful when adding more than one normalization to a word (check the disambiguation quality!), and if possible, avoid adding multiple normalized word forms if Vabamorf's disambiguation is required.

---