# Basic natural language processing toolchain
## Text segmentation
To segment a raw text we do the following:
1. tag tokens,
2. tag compound tokens,
3. tag words,
4. tag sentences,
5. tag paragraphs.

### Tokens
Every text consists of whitespaces and tokens. Most of the tokens are words and punctuation, but a token might also be a part of a word, an abbreviation or some symbol. There are many whitespace symbols but most frequently space, tab and next line occur. We tag the tokens of the text and don't care much about the whitespace except that we may consider in the later analysis if there is a whitespace between the words or not. Tagging the tokens means that we determine the start and end position of each token. Also, in later analysis we don't brake the tokens into smaller parts, but only join them to form words.

In the following example we create a text object with the tokens layer and print out the tokens layer.

In [1]:
from estnltk import Text
from estnltk.taggers import TokensTagger
text = TokensTagger().tag(Text('Mis aias sa-das 3me sorti s-saia?'))
text['tokens']

layer,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


Here we have 11 tokens in the text. In case of interest we can also print out the start and end position of each token:

In [2]:
[(span.start, span.end) for span in text['tokens']]

[(0, 3),
 (4, 8),
 (9, 11),
 (11, 12),
 (12, 15),
 (16, 19),
 (20, 25),
 (26, 27),
 (27, 28),
 (28, 32),
 (32, 33)]

### Compound tokens
Now it's time to recognise the tokens that form a word. It's up to the compount token tagger how to get this work done but the result should be `compound_tokens` layer that envelopes the `tokens` layer. It means that every element of the `compound_tokens` layer is a list of tokens layer elements, that is, tokens. That makes it easy to glue the tokens together to form the words later on.

No compound token may have common tokens with another compound token.

In [3]:
from estnltk.taggers import CompoundTokenTagger
CompoundTokenTagger().tag(text)
text['compound_tokens']

layer,attributes,parent,enveloping,ambiguous,span count
compound_tokens,type,,tokens,False,2

text,type
sa-das,hyphenation
s-saia,hyphenation


In this example two compound tokens are found both of wich consist of three tokens.

Note that the type 'hyphenation' for 's-saia' is incorrect, it should be 'stammer'. It is to be fixed.

Here we can see the list of lists of tokens that make up the compound tokens.

In [4]:
text.compound_tokens.text

[['sa', '-', 'das'], ['s', '-', 'saia']]

In addition to the hyphenated words as in the previous example compound token tagger has to glue together some numbers (10,000), e-mail addresses (example@example.com), abbreviations (s.t.) and other stuff that tokens tagger has broken into pieces.

### Words
The work of the word tokenizer is quite straightforward: every compound token is a word and every token that is not a part of a compound token is also a word. The words are tagged on the raw text the same way as the tokens were. It means that the words layer does not depend on tokens layer or compound_tokens layer and so these layers may be deleted after the words are tagged.

Note that here we have extended the meaning of the word 'word' as the words layer also contains punctuation, numbers,  e-mail addresses and so on. 

In [5]:
from estnltk.taggers import WordTokenizer
WordTokenizer().tag(text)
text['words']

layer,attributes,parent,enveloping,ambiguous,span count
words,,,,False,7

text
Mis
aias
sa-das
3me
sorti
s-saia
?


### Sentences
A sentence is a list of sequential words and so the sentence layer is a list of lists of words. The sentence tagger first looks for the sentence end points in the raw text and then leaves out all the points that are not the ending points of the words. The remaining points are used to split the list of words into the list of sentences. This avoids many common mistakes of sentence tagging provided that the compound token tagger has done a good job.

In [6]:
from estnltk.taggers import SentenceTokenizer
text = Text('''Esimene lõik. Teine lause.

Teine lõik.''')
text.tag_layer(['words'])
SentenceTokenizer().tag(text)
text['sentences']

layer,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,3

text
Esimene lõik.
Teine lause.
Teine lõik.


### Paragraphs
A paragraph is a list of sequential sentences. The process of tagging paragraphs is similar to the sentence tagging. First the possible ending points of the paragraphs are searched from the raw text and then the list of the sentences is split into the list of paragraphs taking into account only those points that are ending points of the sentences.

In [7]:
from estnltk.taggers import ParagraphTokenizer
ParagraphTokenizer().tag(text)
text['paragraphs']

layer,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,2

text
Esimene lõik. Teine lause.
Teine lõik.


## Morphological analysis
The core of morphological analysis is the VabamorfTagger. Before VabamorfTagger we run the WordNormalizingTagger that creates a 'normalized_words' layer. VabamorfTagger then uses 'words' and 'normalized_words' as the input layers and tags the 'morph_analysis' layer on the 'words' layer.
### Premorph
Currently we have a WordNormalizingTagger which tags words with extra hyphens and stammer but I think this functionality should be incorporated into CompoundTokenTagger.

In [8]:
from estnltk.taggers import WordNormalizingTagger
text = Text('Mis aias sa-das 2te sorti s-saia?')
text.tag_layer(['words']) 
WordNormalizingTagger().tag(text)
text['normalized_words']

layer,attributes,parent,enveloping,ambiguous,span count
normalized_words,normal,words,,False,2

text,normal
sa-das,sadas
s-saia,saia


### VabamorfTagger
The central part of the VabamorfTagger is the Vabamorf of estnltk. VabamorfTagger creates a morphological analysis layer on the words layer. This layer is ambiguous. It means that one word can have more than one analysis. Inside the VabamorfTagger the output of Vabamorf is corrected for some words that contain numbers. The result is written to the 'morph_analysis' layer.

In [9]:
from estnltk.taggers import VabamorfTagger
VabamorfTagger().tag(text)
text['morph_analysis']

layer,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,mis,mis,"(mis,)",0,,pl n,P
,mis,mis,"(mis,)",0,,sg n,P
aias,aed,aed,"(aed,)",s,,sg in,S
sa-das,sadama,sada,"(sada,)",s,,s,V
2te,2.,2.,[2.],te,,pl g,O
,2,2,[2],0,,adt,N
,2,2,[2],0,,sg p,N
sorti,sort,sort,"(sort,)",0,,sg p,S
s-saia,sai,sai,"(sai,)",0,,sg p,S
?,?,?,"(?,)",,,,Z


In the following example we swich off vabamorf corrector.

In [10]:
del text.morph_analysis
VabamorfTagger(postmorph_rewriter=None).tag(text)
text['morph_analysis']

layer,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,mis,mis,"(mis,)",0,,pl n,P
,mis,mis,"(mis,)",0,,sg n,P
aias,aed,aed,"(aed,)",s,,sg in,S
sa-das,sadama,sada,"(sada,)",s,,s,V
2te,2sina,2_sina,"(2, sina)",0,,pl g,P
sorti,sort,sort,"(sort,)",0,,sg p,S
s-saia,sai,sai,"(sai,)",0,,sg p,S
?,?,?,"(?,)",,,,Z
