## <span style="color:purple">Compound word detection</span>

EstNLTK provides `CompoundWordTagger`, which annotates linguistic compound word boundaries on words. 
For instance, word `'Kaubahoovi'` can be split into subwords `['Kauba', 'hoovi']`, and `'rehepeksumasina'` into subwords `['rehe', 'peksu', 'masina']`. 
`CompoundWordTagger` uses on Vabamorf's stem-based morphological analysis for finding compound word boundaries.

An example:

In [1]:
from estnltk import Text
from estnltk.taggers import CompoundWordTagger

# Create an example text
text = Text('Kaubahoovi edelanurgas, rehepeksumasina tagaluugi kõrval. Seal on kosmoselaev.')
# Add pre-requisite layers
text.tag_layer(['words', 'sentences'])

# Create compound word tagger
compound_words_tagger = CompoundWordTagger()
# Add compound word layer
compound_words_tagger.tag(text)

# Check the results
text.compound_words

layer name,attributes,parent,enveloping,ambiguous,span count
compound_words,"normalized_text, subwords",words,,True,11

text,normalized_text,subwords
Kaubahoovi,Kaubahoovi,"['Kauba', 'hoovi']"
edelanurgas,edelanurgas,"['edela', 'nurgas']"
",",",","[',']"
rehepeksumasina,rehepeksumasina,"['rehe', 'peksu', 'masina']"
tagaluugi,tagaluugi,"['taga', 'luugi']"
kõrval,kõrval,['kõrval']
.,.,['.']
Seal,Seal,['Seal']
on,on,['on']
kosmoselaev,kosmoselaev,"['kosmose', 'laev']"


There are two attributes in the output layer: `normalized_text` and `subwords`.
Attribute `normalized_text` is derived from the `normalized_form` attribute of the `words` layer; so, for instance, if the `words` layer contains spelling corrections, then compound word detection is applied on the spelling corrections (rather than on the surface word form).
Attribute `subwords` contains the surface word form (or its normalized variant) split into (linguistic) compound word tokens. In case of non-compound words, such as functional words or punctuation, subwords list will just contain the word itself.

Note that the output layer is ambiguous, because (due to ambiguities in morphological analysis) there can be multiple compound word interpretations for a word.

#### Flag `disambiguate`

Flag `disambiguate` (which is set to `True` by default) can be used to make switch off morphological disambiguation during the analysis.
This can help to reveal more compound word interpretations for ambiguous words.

#### Flag `correct_case`

Flag `correct_case` (which is set to `True` by default) post-corrects upper-lower case in `subwords` to match case distinctions in the `'normalized_text'` attribute.
This is required because Vabamorf's stem-based morphological analysis alters cases in its ouput (e.g. converts titlecase words that are not proper nouns into lowercase).