# <span style="color:darkblue"> Basic NLP toolchain</span> 

## <span style="color:blue"> A. Short Introduction and Tutorial for Linguists</span> 

This tutorial gives an overview of how to use EstNLTK for the basic analysis of text: splitting it into linguistically meaningful units - words, sentences, - and performing morphological analysis. These steps are necessary in tackling most language-related problems: if we are able to extract words and sentences from text and filter them by lemmas, part-of-speech tags and morphological forms, we can solve numerous tasks, e.g. automatically find example sentences of different grammatical constructions from large corpora, compose word/lemma frequency lists, compare texts in terms of sentence lengths/structures, etc.

The most important class in Estnltk is Text, which is essentally the main interface for doing everything Estnltk is capable of. To use it, we have to import it:

In [1]:
from estnltk import Text

To start working on our text, we have to create a new Text class object of it. Let's use a simple sentence  as an example:

In [2]:
text = Text("Müüja tatsas rahulikult külmiku juurde.")

In [3]:
text

text
Müüja tatsas rahulikult külmiku juurde.


The basic way to use EstNLTK toolchain is to use the tag_layer() method that automatically segments the text and performs morphological analysis. From its output, we can see which layers have been tagged on text, which attributes the layers have and how many elements belong to every layer (column span count). The details about each layer come below.

In [4]:
text.tag_layer()

text
Müüja tatsas rahulikult külmiku juurde.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,6
compound_tokens,"type, normalized",,tokens,False,0
normalized_words,normal,words,,False,0
words,,,,False,6
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,6


##  <span style="color:purple"> Text segmentation </span>

One of the most basic tasks of any NLP pipeline is text segmentation: splitting the text into smaller meaningful units - words, sentences, paragraphs, etc. This might seem like a trivial task at first - aren't words separated by spaces and sentences by full stops? And yes, question marks and exclamation marks. However, if we take an existing text, we will see that there are lots of exceptions to these rules. Therefore, EstNLTK has dedicated methods for these kinds of tasks that try to tackle the frequent segmentation issues.

###  Tokens vs words
To make a distinction between properly tagged words (incl. punctuation, abbreviations, e-mail addresses, etc) and elements in text that are separated from each other by whitespace (or not... in case of punctuation), we use the term 'tokens' for the latter. For the most part, tokens overlap with words, but a token might also be a part of a word: in later analysis, tokens are not broken into any smaller parts, but only joined if necessary. If we look at our first example above, we can see that the number of words and tokens is equal. However, there are cases where some tokens are joined into one word:

In [5]:
text = Text('Mis aias sa-das 3me sorti s-saia?')
text.tag_layer()

text
Mis aias sa-das 3me sorti s-saia?

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,11
compound_tokens,"type, normalized",,tokens,False,2
normalized_words,normal,words,,False,2
words,,,,False,7
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7


As we can see, in this (quite weird) example sentence, there are 11 tokens but 7 words. To see the tokens (or words for that matter) that have been tagged on text using the tag_layer() method, we can either use the Text object as a typical Python dict:

In [6]:
text['tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


Or, we can use the class variable 'tokens' 

In [7]:
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text,start,end
Mis,0,3
aias,4,8
sa,9,11
-,11,12
das,12,15
3me,16,19
sorti,20,25
s,26,27
-,27,28
saia,28,32


To get words - smallest meaningful units of language - some tokens might need to be combined. That's why there are layers `compound_tokens` and `normalized_words` which include the tokens that are combined together to create words. This happens when the raw text needs some kind of normalization in order to comply with standard ortography. In addition to the hyphenated words as in the previous example, also some numbers (10,000), e-mail addresses (example@example.com), abbreviations (s.t.) and other entities that have been tagged as separate tokens have to be joined together.

We can see and use `words` layer (and all the other layers) the same way as the `tokens` layer:

In [8]:
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,,,,False,7

text,start,end
Mis,0,3
aias,4,8
sa-das,9,15
3me,16,19
sorti,20,25
s-saia,26,32
?,32,33


### Sentences

A sentence is a list of sequential words and so the sentence layer is a list of lists of words. This means that first, the text is broken into words, and then, sentence borders are determined so that no sentence border would end up inside a word. 

Let's see an example that has multiple sentences:

In [9]:
text = Text('''Ka köögis oli kõik endine: vana elektripliit, koorunud värviga ahjutruup ja vanamehe töövorm — rippumas ikka sealsamas ukse küljes nagis. Jälk. Köögi akna all laual oli vanaaegne arvuti. Juba aastaid. Selline kaasaskantav väike kastike, mille klaviatuur ekraani ette kinnitus. See oli sini-valge pildi ja DOS-opsüsteemiga mänguasi.''')

In [10]:
text.tag_layer()

text
"Ka köögis oli kõik endine: vana elektripliit, koorunud värviga ahjutruup ja vanamehe töövorm — rippumas ikka sealsamas ukse küljes nagis. Jälk. Köögi akna all laual oli vanaaegne arvuti. Juba aastaid. Selline kaasaskantav väike kastike, mille klaviatuur ekraani ette kinnitus. See oli sini-valge pildi ja DOS-opsüsteemiga mänguasi."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6
tokens,,,,False,59
compound_tokens,"type, normalized",,tokens,False,2
normalized_words,normal,words,,False,1
words,,,,False,55
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,55


We can see the text split into sentences by using the `sentences` class variable:

In [11]:
text.sentences

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6

text,start,end
"Ka köögis oli kõik endine: vana elektripliit, koorunud värviga ahjutruup ja vanamehe töövorm — rippumas ikka sealsamas ukse küljes nagis.",0,137
Jälk.,138,143
Köögi akna all laual oli vanaaegne arvuti.,144,186
Juba aastaid.,187,200
"Selline kaasaskantav väike kastike, mille klaviatuur ekraani ette kinnitus.",201,276
See oli sini-valge pildi ja DOS-opsüsteemiga mänguasi.,277,331


## <span style="color:purple">Morphological analysis</span>

In linguistics, morphology is the identification, analysis, and description of the structure of a given language’s morphemes and other linguistic units, such as root words, lemmas, suffixes, parts of speech etc. When we are processing a morphologically rich language - that Estonian cetrainly is -, getting this kind of information is essential for even the simplest tasks. For example, if we want to find all the mentions of 'maja' from the corpus, we are probably not eager to spell out the 27 different forms that we are interested in ('maja', 'majale', 'majadega'...), but we also do not want to get things like 'majandus' or 'majakas'. If we have morphologically analysed the text, we can just state that we are interested in the lemma 'maja'.

Estnltk wraps Vabamorf morphological analyzer. Morphological analysis is also performed with no extra hassle when we use the tag_layer() method on our text. When we look at the text object, we can see that there is the morph_analysis layer and it has several attributes: lemma, root, etc.

In [12]:
text = Text("Aga kõik juhtus iseenesest.").tag_layer()

In [13]:
text

text
Aga kõik juhtus iseenesest.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
normalized_words,normal,words,,False,0
words,,,,False,5
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5


Therefore, we can either view the whole analysis as a table:

In [14]:
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Aga,aga,aga,"(aga,)",0,,,J
kõik,kõik,kõik,"(kõik,)",0,,pl n,P
,kõik,kõik,"(kõik,)",0,,sg n,P
juhtus,juhtuma,juhtu,"(juhtu,)",s,,s,V
iseenesest,iseenesest,ise_enesest,"(ise, enesest)",0,,,D
.,.,.,"(.,)",,,,Z


Or, using the attributes, we can ask for specific parts of the analysis: lemmas, partofspeechtags, etc:

In [15]:
text.partofspeech

[['J'], ['P', 'P'], ['V'], ['D'], ['Z']]

### Parameters of morphological analysis: disambiguation, guessing...

By default, EstNLTK performs morphological analysis with disambiguation (giving out the analysis that is correct in the context), guessing (if the word is not in the dictionary and cannot be resolved as a compound, it is given a 'guessed' analysis) and proper name analysis. While this kind of output is easy to use because each word has been given an analysis and most words receive only one analysis,  sometimes we want to use the analyser differently. If we want to enhance the recall of the morphological analysis, we can switch off the disambiguation (which, of course, also affects precision). We can also switch off guessing and proper name analysis e.g to verify whether a word exists in Estonian or not.

To change the parameters of morphological analyser, we have to use a resolver (a register of taggers). To get to know the parameters of the default resolver that is used by the tag_layer() method, we have to import the default resolver:

In [16]:
from estnltk.resolve_layer_dag import DEFAULT_RESOLVER

Then we can see what the default parameters for morph_analysis are:

In [17]:
DEFAULT_RESOLVER.taggers.rules['morph_analysis']

name,layer,attributes,depends_on
VabamorfTagger,morph_analysis,"(lemma, root, root_tokens, ending, clitic, form, partofspeech)","[words, normalized_words]"

0,1
compound,True
guess,True
phonetic,False
postmorph_rewriter,VabamorfCorrectionRewriter
propername,True
premorph_layer,normalized_words
disambiguate,True


Configuration tells us that by default, disambiguation, proper name analysis, compound word analysis, and guessing are all applied during morphological analysis.

Using a resolver, we can write an equivalent code to
```python
text.tag_layer()
```
as follows:

In [18]:
from estnltk.resolve_layer_dag import make_resolver

In [19]:
resolver = make_resolver(
                 disambiguate=True,
                 guess=True,
                 propername=True,
                 phonetic=False,
                 compound=True)

In [20]:
text = Text("Kärbes hulbib mees ja naeris puhub sädelevaid mulle.")

In [21]:
text.tag_layer(resolver=resolver)['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Kärbes,kärbes,kärbes,"(kärbes,)",0,,sg n,S
hulbib,hulpima,hulpi,"(hulpi,)",b,,b,V
mees,mees,mees,"(mees,)",0,,sg n,S
ja,ja,ja,"(ja,)",0,,,J
naeris,naerma,naer,"(naer,)",is,,s,V
puhub,puhuma,puhu,"(puhu,)",b,,b,V
sädelevaid,sädelev,sädelev,"(sädelev,)",id,,pl p,A
mulle,mina,mina,"(mina,)",lle,,sg all,P
.,.,.,"(.,)",,,,Z


As we can see from the result, with default morphological analysis, all the words get assigned exactly one analysis, but three of them are not correct. The disambiguator has wrongly deleted the correct analyses this time. 

*Note that this example sentence is a little out of the ordinary and hence the bad performance of disambiguator. The more 'normal' your text is, the better the results.*

If we want to change the parameters of morphological analysis, we have to change the default values of the flags to create a customized resolver. For example, to switch off disambiguation, we have to set this value to False:

In [22]:
resolver2 = make_resolver(
                 disambiguate=False,
                 guess=True,
                 propername=True,
                 phonetic=False,
                 compound=True)

The resolver tags only those layers on the text that have not been previously tagged. To see the effect of changed parameters we have to create a new text object or delete the affected layer.

In [23]:
del text.morph_analysis
text.tag_layer(resolver=resolver2)['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Kärbes,Kärbe,Kärbe,"(Kärbe,)",s,,sg in,H
,Kärbes,Kärbes,"(Kärbes,)",0,,sg n,H
,kärbes,kärbes,"(kärbes,)",0,,sg n,S
hulbib,hulpima,hulpi,"(hulpi,)",b,,b,V
mees,mees,mees,"(mees,)",0,,sg n,S
,mesi,mesi,"(mesi,)",s,,sg in,S
ja,ja,ja,"(ja,)",0,,,J
naeris,naerma,naer,"(naer,)",is,,s,V
,naeris,naeris,"(naeris,)",0,,sg n,S
,naeris,naeris,"(naeris,)",s,,sg in,S


## <span style="color:purple"> Examples</span>

Here, two simple examples of using EstNLTK basic toolchain for extracting relevant parts of text are presented. Let's use the following short text as our corpus:

"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."

### Example 1: Finding all different nouns from the text

Let's assume we want to find all different noun lemmas that appear in the text. So, first we have to turn our text into an EstNLTK Text object:

In [24]:
my_text = Text("Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel.")

In [25]:
my_text

text
"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."


Next, we need to let the taggers do their job. Let's use the automatic tag_layer() method for that:

In [26]:
my_text.tag_layer()

text
"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6
tokens,,,,False,124
compound_tokens,"type, normalized",,tokens,False,1
normalized_words,normal,words,,False,0
words,,,,False,122
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,122


Now we can iterate over the lemmas and part-of-speech tags to extract the lemmas that are tagged as nouns. For this, it's easiest to use the zip function:

In [27]:
noun_lemmas = []
for lemma, postag in zip(my_text.lemma, my_text.partofspeech):
    if 'S' in postag:
        noun_lemmas += lemma

**NB!** Note that both, lemmas and part-of-speech tags for every word are given as a list, that's why we have to check if 'S' is in the postag and we cannot use `append` for adding the lemma to noun_lemmas list without iterating over each word's part-of-speech tags and lemmas.

In [28]:
noun_lemmas

['nimi',
 'nurgasaag',
 'tööriist',
 'puitdetail',
 'lõikamine',
 'eesmärk',
 'lõikenurk',
 'lõikenurk',
 'seadistamine',
 'võimalus',
 'näide',
 'pildiraam',
 'meisterdamine',
 'detail',
 'lõikenurk',
 'kraad',
 'juht',
 'nurgasaag',
 'tööriist',
 'täpsus',
 'lõige',
 'korratavus',
 'osa',
 'nurgasaag',
 'lõikenurk',
 'suund',
 'lisa',
 'saag',
 'saetera',
 'kaldenurk',
 'seadistamine',
 'kasu',
 'detail',
 'lõikamine',
 'nurgasaag',
 'laius',
 'puulaud',
 'puitdetail',
 'ristlõige',
 'tegemine',
 'järkamine',
 'näide',
 'puitkonstruktsioon',
 'ehitamine',
 'näide',
 'terrassilaud',
 'puitparkett',
 'paigaldamine']

In this case, we can easily guess what the specific text was about. If we had a larger corpus, we could make a frequency list and e.g draw some conclusions about topics/word use in a specific publication or variety of language in general.

### Example 2: Finding all sentences that contain an infinitive verb

Let's assume we want to extract all sentences containing an infinitive verb form from a corpus. Let's use the same text as in the previous example. Therefore, we have already tagged the necessary layers and we can iterate over the forms and extract sentences that we want to study:

In [29]:
infinitive_sentences = []
for sent in my_text.sentences:
    for form in sent.form:
        if 'da' in form:
            a = sent.enclosing_text
            infinitive_sentences.append(a)
            break # not to include the same sentence twice)
infinitive_sentences

['Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus.']

# <span style="color:blue"> B. Specific details for programmers: how it works</span>

## <span style="color:purple"> Segmentation </span>

To segment text, the following steps are performed:
1. tagging tokens,
2. tagging compound tokens,
3. tagging words,
4. tagging sentences,
5. tagging paragraphs.

### Tokens

#### General overview

Tagging the tokens means that we determine the start and end position of each token, based on whitespace and/or punctuation. There are many whitespace symbols, out of which spaces, tabs, and newlines occur most frequently. When tokens are tagged on the text, the type of whitespace does not matter, but in later analysis, it may be taken into consideration if there was a whitespace between the tokens or not. 

In the following example, we create a text object with the tokens layer and print out the tokens layer.

In [30]:
from estnltk import Text
from estnltk.taggers import TokensTagger
text = TokensTagger().tag(Text('Mis aias sa-das 3me sorti s-saia?'))
text['tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


Here we have 11 tokens in the text. To see the start and end position of each token print out the span list.

In [31]:
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text,start,end
Mis,0,3
aias,4,8
sa,9,11
-,11,12
das,12,15
3me,16,19
sorti,20,25
s,26,27
-,27,28
saia,28,32


#### Under the hood
 The `TokensTagger` applies NLTK's [WordPunctTokenizer](http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.WordPunctTokenizer) to split the text into tokens. The aim is to produce a tokenization where words ("alphanumeric sequences") are separated from each other, and where punctuation symbols are also separated from words and from each other. However, `WordPunctTokenizer` leaves punctuation symbols unsplit in some cases, and thus, `TokensTagger` applies an additional post-correction step to ensure that all punctuation symbols are split into single tokens. For instance, the string `"(1989.a.)."` is tokenized by  `WordPunctTokenizer` into tokens  `['(', '1989', '.', 'a', '.).']`, and in our post-correction step, it is further split into tokens `['(', '1989', '.', 'a', '.', ')', '.']`.

### Compound tokens

#### General overview
It was said before that although mostly words and tokens overlap with each other, there are cases where several tokens are combined together to form a word in the traditional sense - the smallest meaningful unit of language. There are also special types of text units -- such as emoticons and web and email addresses -- which need to be detected as a whole (as full token sequences) in order to avoid ambiguities in the following processing steps (for instance, a period inside an email address should not be mistaken with a sentence-ending period).

Compound token tagger takes care of these cases: it adds `compound_tokens` layer that envelopes the `tokens` layer. It means that every element of the `compound_tokens` layer is a list of `tokens` layer elements - tokens. That makes it easy to glue the tokens together to form the words later on.

Compound tokens are formed in a way that they are separate from each other -- no compound token has common tokens with other compound tokens.

In [32]:
from estnltk.taggers import CompoundTokenTagger
CompoundTokenTagger().tag(text)
text['compound_tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,type,normalized
sa-das,hyphenation,
s-saia,hyphenation,


In this example, two compound tokens are found, both of which consist of three tokens.

Note that the type 'hyphenation' for 's-saia' is incorrect, it should be 'stammer'. It is to be fixed.

Here we can see the list of lists of tokens that make up the compound tokens.

In [33]:
text.compound_tokens.text

[['sa', '-', 'das'], ['s', '-', 'saia']]

#### Types of compound tokens

The main aim of the `CompoundTokenTagger` is to join together tokens that were produced by the splitting logic of `TokensTagger`. `CompoundTokenTagger` addresses different types of compound tokens, and producing most of these tokens can also be switched on/off by flags passed to the constructor. In the following, `CompoundTokenTagger`'s compounding types will be listed, along with the flags that can be used to switch these compounds off (by default, all flags are switched on).

##### Numeric expressions (`tag_numbers`)

Tags numeric expressions with decimal separators, numbers with digit group separators, and common date and time formats.

In [34]:
text = Text('02.02.2010 22:55 Mati : saad sa mulle 100,50 asemel 10 000 laenata?')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_numbers = True).tag(text) # tagging numbers switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,start,end,type,normalized
02.02.2010,0,10,numeric_date,02.02.2010
22:55,11,16,numeric_time,22:55
10050,38,44,numeric,10050
10 000,52,58,numeric,10000


As can be seen from the previous example, a compound token can also have attribute `normalized`, which contains a normalized string value for the token. In most cases, the normalization involves removal of whitespace from the string (e.g. `'10 000' => '10000'`). If the pattern that captured the string does not use normalization, then `normalized==None`.

In addition, if `tag_numbers` is switched on, numeric expressions are also augmented with sign symbols and percentages:

In [35]:
text = Text('Mati : +100% kindel, et toon tagasi!!')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_numbers = True).tag(text) # tagging numbers switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,1

text,start,end,type,normalized
+100%,7,12,sign+percentage,+100%


Note: if more than one compounding rule is applied, the resulting compound token can have multiple compound types. In this case, compound types are separated by `+` (like `sign+percentage` in the previous example).

##### Units x-per-y (`tag_units`)

Tags x-per-y style units that follow numeric expressions:

In [36]:
text = Text('Tänase seisuga tuleb ikka suur lohe vaiksema tuule (6-12 m/s) jaoks ja teine väiksem tormikaks (12-20 m/s) võtta…')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_units = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,start,end,type,normalized
m/s,57,60,unit,m/s
m/s,102,105,unit,m/s


##### Email and www addresses (`tag_email_and_www`)

In [37]:
text = Text('Saada need e-postiaadressile big@boss.com või tule sisesta lehelt www.iamboss.com')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_email_and_www = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,3

text,start,end,type,normalized
e-postiaadressile,11,28,hyphenation,
big@boss.com,29,41,email,
www.iamboss.com,66,81,www_address,www.iamboss.com


##### Common emoticons (`tag_emoticons`)

Tags common (Western) emoticons:

In [38]:
text = Text('Maja on fantastiline :)) ja mõte on hea :-)')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_emoticons = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,start,end,type,normalized
:)),21,24,emoticon,:))
:-),40,43,emoticon,:-)


##### Names preceded by initials (`tag_initials`)

In [39]:
text = Text('(arhitektid M. Port, M. Meelak, O. Zhemtshugov, R.-L. Kivi)')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_initials = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,start,end,type,normalized
M. Port,12,19,name_with_initial,M. Port
M. Meelak,21,30,name_with_initial,M. Meelak
O. Zhemtshugov,32,46,name_with_initial,O. Zhemtshugov
R.-L. Kivi,48,58,name_with_initial,R.-L. Kivi


##### Common abbreviations (`tag_abbreviations`)

In [40]:
text = Text('Nt. hädas oli juba Vana-Hiina suurim ajaloolane Sima Qian (II—I saj. e. m. a.).')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_abbreviations = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,start,end,type,normalized
Nt.,0,3,non_ending_abbreviation,Nt.
Vana-Hiina,19,29,hyphenation,
saj.,64,68,abbreviation,saj.
e. m. a.,69,77,abbreviation,e.m.a.


Abbreviations are divided into two categories: 1) `non_ending_abbreviation`-s which most likely do not end the sentence (usually it can be expected that some sentence content follows them), and 2) `abbreviation`-s which can also appear at the end of the sentence.

##### Morphological case endings (`tag_case_endings`)

Tags morphological case endings preceded by single tokens, and also by compound tokens:  

In [41]:
text = Text("10 000-st LinkedIn 'i kontaktist mitte üks ei hoolinud meie SKT -st, aga meie workshop ' e väisasid küll.")
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_case_endings = True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,4

text,start,end,type,normalized
10 000-st,0,9,numeric+hyphenation+case_ending,10000-st
LinkedIn 'i,10,21,case_ending,LinkedIn'i
SKT -st,60,67,case_ending,SKT-st
workshop ' e,78,90,case_ending,workshop'e


##### Hyphenations (`tag_hyphenations`)

If consecutive tokens are separated by hyphen symbol, and these tokens consist of letters, then these tokens are joined together as forming a "hyphenated word":

In [42]:
text = Text('See on vää-ää-ääga huvitav, aga kas ka ka-su-lik?!')
text = TokensTagger().tag(text)
CompoundTokenTagger(tag_hyphenations=True).tag(text) # tagging switched on (default setting)
text.compound_tokens

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,start,end,type,normalized
vää-ää-ääga,7,18,hyphenation,
ka-su-lik,39,48,hyphenation,


Note that the language phenomen covered by "hyphenation compound tokens" is actually wider: in addition to hyphenated words, it also covers stretched out words (such as _'vää-ää-ääga'_) and syllabified words (such as _'ka-su-lik'_).

#### Technical details

**`CompoundTokenTagger`** combines the knowledge about token spans (produced by `TokensTagger`), and the knowledge about the original tokenization (i.e. which text units were separated by whitespace in the original text) to determine which tokens should be joined into compound ones. This process consists of the following steps:

1. **Tagging of _strict tokenization hints_**: `estnltk.taggers.RegexTagger` is applied to find non-overlapping text spans that correspond to tokens that need to be joined. In this phase, the following types of compounding hints are tagged:

      1.1 `numeric` expressions like numbers with decimal separators (e.g. `'10,5'`), numbers with digit group separators (e.g. `'10 000 000'`), and common formats of numeric dates (`'02.02.2010'`) and times (`'22:55'`);      
      1.2 "X-per-Y" style `units` (e.g. speed units like `'km/h'` or `'MB/s'`, and emission units like `'g/km'`);          
      1.3 `emails` and `www_addresses` (like `'big@boss.com'` or `'www.neti.ee'` or `'https://www.postimees.ee'`);  
      1.4 commonly used `emoticons` (like `:)`, `:D` or  `:-P`);   
      1.5 `names_with_initial` (like `'A. H. Tammsaare'` or `'J.K. Rowling'`);   
      1.6 commonly used `abbreviations` (like `'s.o.'`, `'st'`, or `'a.'`);  
            
2. **Creating an initial list of compound tokens** based on the _strict tokenization hints_ (produced in the previous step), and  _the hyphenation logic_. 

      2.1 _Strict tokenization hints_ are used in the following way: if a hint's text span starts _exactly_ where a token starts, and hint's text span ends _exactly_ where a sequence of tokens ends, then, and only then, a compound token is created from the sequence of tokens covered by the hint. So, no compound token is created if hint's text span either starts or ends at the middle of a token;
      
      2.2. _Hyphenation logic_ collects consecutive tokens that have a hyphenation symbol '-', but no space in between them, and creates corresponding compound tokens. For instance, the token sequence `['v','-','v','-','v','-','ve','-','ve','-','veri']` (originating from the string `'v-v-v-ve-ve-veri'`) will be joined into a compound token;

3. **Tagging of _non-strict tokenization hints_**: `estnltk.taggers.RegexTagger` is applied with a second set of patterns to find non-overlapping text spans that hint about potential joining places of tokens and/or compound tokens (from the step 2). The following types of compounding hints are tagged:

      3.1. morphological `case_endings` preceded by single tokens (e.g. `"Palace'ist"`), and compound tokens (e.g. in numeric expressions like `"10 000-ni"`, or in web addresses like `"www.neti.ee-st"`);        
      3.2 `sign` symbols (-, +, ±) followed by numbers (like in `'+20'` or `'-10 000'`);        
      3.3 `percentage` symbols preceded by numbers (like in `'20%'` or `'30,567%'`);    

4. ** Extending tokens and compound tokens based on the _non-strict tokenization hints_ (produced in the previous step)**.

      _Non-strict tokenization hints_ differ from the _strict ones_ in a way that they leave one end of the hint's span (either left or right) unspecified. For instance, the pattern detecting `case_endings` leaves left side of the sequence unspecified: the left side could be a single token, or a compound_token with an unspecified length. The pattern only describes the end of the sequence, which must consist of a letter (or a number) followed by a case separator (like `'′'` or `'-'`), and finally followed by a case ending in a single token (like `'st'` or `'ni'`). In similar manner, the pattern adding signs to numbers leaves open the right side (the actual extent of the numeric expression);
      
      Note: as long as the regions described by the hints do not overlap, one token or compound token can be modified by multiple hints, e.g. `sign` symbol could be added before a numeric token, and `percentage` symbol could be added after that token;
      
5. ** Creating the layer `'compound_tokens'` based on the compound tokens aquired in the previous steps.**

##### Tokenization hints

Basically, each tokenization hint is a result of applying a regular expression over the original text. All patterns producing tokenization hints are in the module `estnltk.taggers.text_segmentation.patterns`.  The file contains lists of records in the `estnltk.taggers.RegexTagger` vocabulary format. For instance, a pattern for capturing simple email addresses is conveyed by the following entry:

         {'comment': '*) Pattern for detecting common e-mail formats;',
          'example': 'bla@bla.bl',
          'pattern_type': 'email',
          '_group_': 1,
          '_priority_': (0, 0, 1),
          '_regex_pattern_': r'([{ALPHANUM}_.+-]+@[{ALPHANUM}-]+\.[{ALPHANUM}-.]+)'.format(**MACROS),
          'normalized': 'lambda m: None'},
          
Attribute `'comment'` is used to give a short description of the pattern, and `'example'` exemplifies a string captured by the pattern. Although these attributes are not mandatory, it is highly advisable to use them when adding new entries, as it helps to maintain interpretably of the vocabulary file.

Attribute `'pattern_type'` is mandatory and expresses the category of the compound token. If a compound token is created based on the tokenization hint, then compound token's attribute `'type'` will get its value from the `'pattern_type'`. Note that the name should not contain symbol  `'+'`, because during the application of _non-strict tokenization hints_, `'type'`-s of several compound tokens are merged, and `'+'` is used as a separator character.

If `'pattern_type'` of a _strict tokenization hint_ (a "1st level pattern") contains prefix `negative:` (e.g. `'negative:ps-abbreviation'`), then the pattern does not produce any tokenization hints, but it is used instead to prevent other patterns from matching. Basically, it describes strings that are similar to ones captured by some positive pattern, and that should not be captured (as they would be false positives). For instance, a negative pattern is created to capture temperature units followed by sentence ending (e.g. ... _kuumarekord on 38**º C.** Talved on_ ... ) in order to prevent patterns capturing names with initials from matching (e.g. capturing _**C. Talved**_ as a name with an initial). Note that a negative pattern must have `'_priority_'` value smaller than `'_priority_'` values of patterns it prevents from matching.

Attribute `'_priority_'` describes priority of the pattern: smaller the value, higher the priority. Priority comes into play when multiple patterns capture the same string region, or there are overlaps in captured regions. In such cases, the string captured by the pattern with the highest priority (lowest priority value) will be chosen. In case of equal `'_priority_'` values, the default strategy is to choose the longest string.

Attribute `'_regex_pattern_'` gives the regular expression for capturing the string of the compound token. It can be a regular expression pattern string, but also a pre-compiled regular expression object. In the previous example, the pattern string is given as a template, in which named placeholders (`{ALPHANUM}`) are filled in using the information from the dictionary `MACROS`.

Attribute `'_group_'` gives the number (or the name) of the group captured by the regular expression which represents the _actual compound token_. So, the regular expression can also describe compound tokens with some added context, and the group number can be used to pick out the _compound token_.

And finally, attribute `'normalized'` gives a lambda function (or a string describing a lambda function) which is to be applied on a match object to produce a normalized version of the captured string. If normalization is not necessary, the value can be `'lambda m: None'` (like in the previous example).


#### Comparisons to compounding rules used in the EstSyntax pre-processing module

On building EstNLTK's compounding rules, the tokenization postcorrection rules of the pre-processing module of EstSyntax (available at https://github.com/EstSyntax/preprocessing-module and https://github.com/kristiinavaik/ettenten-eeltootlus) were taken as a starting point. A number of these rules were also reimplemented in EstNLTK, but not all of them. The following table compares EstNLTK's and EstSyntax's token compounding approaches:

Type of compound token | Examples | Compounded by EstSyntax <br> preprocessing module? \*\* | Compounded by EstNLTK <br> 1.6?
--- | --- | --- | --- | ---
**`numerics`** `with digit grouping` | `20 000` | yes | yes
`numerics with decimal separator` | `3 , 5` <br> `3,5` <br> `3.5` | yes | yes
`numerics followed by period` <br> (ordinal numbers) | `1995.`  <br> `1 .` | yes | yes
`numerics with sign` | `-3` <br> `± 500` | yes | yes
`numerics with percentage sign` | `10 %` <br> `25%` | yes | yes
`common` **`date and time patterns`** | `15. 04. 2005` <br> `16:30` | yes\*\* | yes
**`ranges`** `of numbers` | `40 000-45 000` <br> `8 - 16%` <br> `14.00 – 16.30` <br> `2 ... 3 , 5` | yes | no
**`scales/ratios`** `of numbers` | `1 , 5 : 0 , 5` <br> `0 : 4` | yes | no
**`proportions`** `of numbers` | `5 36-st` | yes | no
`(binary)` **`arithmetic operations`** | `17± 5` <br> `3 x 15` | yes | no
**`arithmetic expressions`** and <br> formula-like expressions | `2 + 3 = 5` <br> `n = 122` | yes | no
**`units`** `"X-per-Y"`  | `km / h` <br> `g/km` | yes | yes
`quantities with units` | `60 km / h` <br> `2,3 h/m` <br> `1,0 mM` | yes | no
`1-letter` **`abbreviations`** <br> `with numbers`  | `E 961` <br> `I 26` | yes | yes
`common` **`abbreviations`** <br> | `s. o.` <br> `Nt .` <br> `Jr.` | yes | yes
`names with` **`initials`** | `A . H . Tammsaare` <br> `A. H. Tammsaare` <br> `D . Trump` | yes | yes
`names with` **`ampersands`** | `Simon &amp; Schusteri` | yes | no
`morphological` **`case endings`** | `4000-le` <br> `SKT-st` <br> `workshop ' e` | yes | yes
**`email addresses`** | `big@boss.com` <br> `user [ -at- ] dumb.com` | yes\*\* | yes
**`www addresses`** | `http : //www.offa.org/ stats` <br> `www.esindus.ee/korteriturg` | yes\*\* | yes
`common` **`emoticons`** | `:-)` <br> `:)))` | no | yes


\*\* The most important difference between EstNLTK's and EstSyntax's token compounding approaches is the following. EstSyntax aims to provide postcorrections -- that is, to fix tokenization that has been broken (e.g. by an earlier automatic tokenization). So, in many cases, EstSyntax's patterns focus only broken cases and do not specifically address correct cases (e.g. email address patterns can capture address `"dumb . user [ -at- ] dumb.com"`, but there is no pattern for capturing address `"big@boss.com"`). EstNLTK, on the other hand, aims to cover correctly tokenized cases, and also to provide postcorrections where necessary (e.g. both email addresses `"dumb . user [ -at- ] dumb.com"` and `"big@boss.com"` are captured).

### Words
To get words - smallest meaningful units of language - the outputs of tokens tagger and compound token tagger have to be combined. This is done by word tokenizer and it is quite straightforward: every compound token is a word, and every token that is not a part of a compound token is also a word. The words are tagged on the raw text the same way as the tokens were. It means that the `words` layer does not depend on `tokens` layer or `compound_tokens` layer and so these layers may be deleted after the words are tagged.

In [43]:
from estnltk.taggers import WordTokenizer
WordTokenizer().tag(text)
text['words']

layer name,attributes,parent,enveloping,ambiguous,span count
words,,,,False,10

text
See
on
vää-ää-ääga
huvitav
","
aga
kas
ka
ka-su-lik
?!


### Sentences
The sentence tagger first looks for the sentence end points in the raw text and then leaves out all the points that are not the ending points of the words. The remaining points are used to split the list of words into the list of sentences. This avoids many common mistakes of sentence tagging provided that the compound token tagger has done a good job.

In [44]:
from estnltk.taggers import SentenceTokenizer
text = Text('''Esimene lõik. Teine lause.

Teine lõik.''')
text.tag_layer(['words'])
SentenceTokenizer().tag(text)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,3

text
Esimene lõik.
Teine lause.
Teine lõik.


### Paragraphs
A paragraph is a list of sequential sentences. The process of tagging paragraphs is similar to the sentence tagging. First the possible ending points of the paragraphs are searched from the raw text and then the list of the sentences is split into the list of paragraphs taking into account only those points that are ending points of the sentences.

In [45]:
from estnltk.taggers import ParagraphTokenizer
ParagraphTokenizer().tag(text)
text['paragraphs']

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,2

text
Esimene lõik. Teine lause.
Teine lõik.


## <span style="color:purple">Morphological analysis</span>
The core of morphological analysis is the VabamorfTagger. Before VabamorfTagger we run the WordNormalizingTagger that creates a 'normalized_words' layer. VabamorfTagger then uses 'words' and 'normalized_words' as the input layers and tags the 'morph_analysis' layer on the 'words' layer.
### Premorph
Currently we have a WordNormalizingTagger which tags words with extra hyphens and stammer but I think this functionality should be incorporated into CompoundTokenTagger.

In [46]:
from estnltk.taggers import WordNormalizingTagger
text = Text('Mis aias sa-das 2te sorti s-saia?')
text.tag_layer(['words']) 
WordNormalizingTagger().tag(text)
text['normalized_words']

layer name,attributes,parent,enveloping,ambiguous,span count
normalized_words,normal,words,,False,2

text,normal
sa-das,sadas
s-saia,saia


### VabamorfTagger
The central part of the VabamorfTagger is the Vabamorf of estnltk. VabamorfTagger creates a morphological analysis layer on the words layer. This layer is ambiguous. It means that one word can have more than one analysis. Inside the VabamorfTagger the output of Vabamorf is corrected for some words that contain numbers. The result is written to the 'morph_analysis' layer.

In [47]:
from estnltk.taggers import VabamorfTagger
VabamorfTagger().tag(text)
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,mis,mis,"(mis,)",0,,pl n,P
,mis,mis,"(mis,)",0,,sg n,P
aias,aed,aed,"(aed,)",s,,sg in,S
sa-das,sadama,sada,"(sada,)",s,,s,V
2te,2.,2.,"(2.,)",te,,pl g,O
,2,2,"(2,)",0,,adt,N
,2,2,"(2,)",0,,sg p,N
sorti,sort,sort,"(sort,)",0,,sg p,S
s-saia,sai,sai,"(sai,)",0,,sg p,S
?,?,?,"(?,)",,,,Z


In the following example we swich off vabamorf corrector.

In [48]:
del text.morph_analysis
VabamorfTagger(postmorph_rewriter=None).tag(text)
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,mis,mis,"(mis,)",0,,pl n,P
,mis,mis,"(mis,)",0,,sg n,P
aias,aed,aed,"(aed,)",s,,sg in,S
sa-das,sadama,sada,"(sada,)",s,,s,V
2te,2sina,2_sina,"(2, sina)",0,,pl g,P
sorti,sort,sort,"(sort,)",0,,sg p,S
s-saia,sai,sai,"(sai,)",0,,sg p,S
?,?,?,"(?,)",,,,Z
