# <span style="color:blue"> B. Specific details for programmers: how it works</span>

## <span style="color:purple"> Text segmentation: Words </span>

### Words

Words are often considered as the smallest meaningful units of language, especially from the perspective of syntactic or semantic analysis.
In order to get words, outputs of the `TokensTagger` and `CompoundTokenTagger` have to be combined. 
This is done by `WordTagger` and it is quite straightforward: every compound token is a word, and every token that is not a part of a compound token is also a word. The words are tagged on the raw text the same way as the tokens were. It means that the `words` layer does not depend on `tokens` layer or `compound_tokens` layer and so these layers may be deleted after the words are tagged.

In the following example, a text object is created, prerequisite layers (`tokens`, `compound_tokens`) are added to it, and then the layer `words` is tagged:

In [1]:
from estnltk import Text

# Prepare text: add tokens and compound tokens
text = Text('See on v-vä-väga huvitav, aga kas ka ka-su-lik?!')
text.tag_layer(['tokens', 'compound_tokens'])

# Add words
from estnltk.taggers import WordTagger
WordTagger().tag(text)
text['words']

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,False,10

text,normalized_form
See,
on,
v-vä-väga,väga
huvitav,
",",
aga,
kas,
ka,
ka-su-lik,kasulik
?!,


##### Normalized word forms

The `words` layer has an attribute `normalized_form`, which can contain a normalized (surface) form of the word. 
This information is currently taken from the layer `compound_tokens`: if a compound token has the attribute `normalized` filled in, then this information is also carried over to the `normalized_form` of the corresponding word. Otherwise, `normalized_form` remains `None`.

Before the morphological analysis, words that have `normalized_form != None` will have their surface forms replaced with `normalized_form`-s, so that the analysis can use the correct form. But all other words will be morphologically analysed according to their surface forms (`text`).

---