# <span style="color:darkblue"> Basic NLP toolchain</span> 

## <span style="color:blue"> A. Short Introduction and Tutorial for Linguists</span> 

This tutorial gives an overview of how to use EstNLTK for the basic analysis of text: splitting it into linguistically meaningful units - words, sentences, - and performing morphological analysis. These steps are necessary in tackling most language-related problems: if we are able to extract words and sentences from text and filter them by lemmas, part-of-speech tags and morphological forms, we can solve numerous tasks, e.g. automatically find example sentences of different grammatical constructions from large corpora, compose word/lemma frequency lists, compare texts in terms of sentence lengths/structures, etc.

The most important class in Estnltk is Text, which is essentally the main interface for doing everything Estnltk is capable of. To use it, we have to import it:

In [196]:
from estnltk import Text

To start working on our text, we have to create a new Text class object of it. Let's use a simple sentence  as an example:

In [199]:
text = Text("Müüja tatsas rahulikult külmiku juurde.")

In [200]:
text

text
Müüja tatsas rahulikult külmiku juurde.


The basic way to use EstNLTK toolchain is to use the tag_layer() method that automatically segments the text and performs morphological analysis. From its output, we can see which layers have been tagged on text, which attributes the layers have and how many elements belong to every layer (column span count). The details about each layer come below.

In [201]:
text.tag_layer()

text
Müüja tatsas rahulikult külmiku juurde.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,6
compound_tokens,"type, normalized",,tokens,False,0
normalized_words,normal,words,,False,0
words,,,,False,6
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,6


##  <span style="color:purple"> Text segmentation </span>

One of the most basic tasks of any NLP pipeline is text segmentation: splitting the text into smaller meaningful units - words, sentences, paragraphs, etc. This might seem like a trivial task at first - aren't words separated by spaces and sentences by full stops? And yes, question marks and exclamation marks. However, if we take an existing text, we will see that there are lots of exceptions to these rules. Therefore, EstNLTK has dedicated methods for these kinds of tasks that try to tackle the frequent segmentation issues.

###  Tokens vs words
To make a distinction between properly tagged words (incl. punctuation, abbreviations, e-mail addresses, etc) and elements in text that are separated from each other by whitespace (or not... in case of punctuation), we use the term 'tokens' for the latter. For the most part, tokens overlap with words, but a token might also be a part of a word: in later analysis, tokens are not broken into any smaller parts, but only joined if necessary. If we look at our first example above, we can see that the number of words and tokens is equal. However, there are cases where some tokens are joined into one word:

In [202]:
text = Text('Mis aias sa-das 3me sorti s-saia?')
text.tag_layer()

text
Mis aias sa-das 3me sorti s-saia?

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,11
compound_tokens,"type, normalized",,tokens,False,2
normalized_words,normal,words,,False,2
words,,,,False,7
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7


As we can see, in this (quite weird) example sentence, there are 11 tokens but 7 words. To see the tokens (or words for that matter) that have been tagged on text using the tag_layer() method, we can either use the Text object as a typical Python dict:

In [47]:
text['tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


Or, we can use the class variable 'tokens' 

In [48]:
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text,start,end
Mis,0,3
aias,4,8
sa,9,11
-,11,12
das,12,15
3me,16,19
sorti,20,25
s,26,27
-,27,28
saia,28,32


To get words - smallest meaningful units of language - some tokens might need to be combined. That's why there are layers `compound_tokens` and `normalized_words` which include the tokens that are combined together to create words. This happens when the raw text needs some kind of normalization in order to comply with standard ortography. In addition to the hyphenated words as in the previous example, also some numbers (10,000), e-mail addresses (example@example.com), abbreviations (s.t.) and other entities that have been tagged as separate tokens have to be joined together.

We can see and use `words` layer (and all the other layers) the same way as the `tokens` layer:

In [191]:
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,,,,False,7

text,start,end
Mis,0,3
aias,4,8
sa-das,9,15
3me,16,19
sorti,20,25
s-saia,26,32
?,32,33


### Sentences

A sentence is a list of sequential words and so the sentence layer is a list of lists of words. This means that first, the text is broken into words, and then, sentence borders are determined so that no sentence border would end up inside a word. 

Let's see an example that has multiple sentences:

In [209]:
text = Text('''Ka köögis oli kõik endine: vana elektripliit, koorunud värviga ahjutruup ja vanamehe töövorm — rippumas ikka sealsamas ukse küljes nagis. Jälk. Köögi akna all laual oli vanaaegne arvuti. Juba aastaid. Selline kaasaskantav väike kastike, mille klaviatuur ekraani ette kinnitus. See oli sini-valge pildi ja DOS-opsüsteemiga mänguasi.''')

In [210]:
text.tag_layer()

text
"Ka köögis oli kõik endine: vana elektripliit, koorunud värviga ahjutruup ja vanamehe töövorm — rippumas ikka sealsamas ukse küljes nagis. Jälk. Köögi akna all laual oli vanaaegne arvuti. Juba aastaid. Selline kaasaskantav väike kastike, mille klaviatuur ekraani ette kinnitus. See oli sini-valge pildi ja DOS-opsüsteemiga mänguasi."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6
tokens,,,,False,59
compound_tokens,"type, normalized",,tokens,False,2
normalized_words,normal,words,,False,1
words,,,,False,55
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,55


We can see the text split into sentences by using the `sentences` class variable:

In [224]:
text.sentences

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6

text,start,end
"Ka köögis oli kõik endine: vana elektripliit, koorunud värviga ahjutruup ja vanamehe töövorm — rippumas ikka sealsamas ukse küljes nagis.",0,137
Jälk.,138,143
Köögi akna all laual oli vanaaegne arvuti.,144,186
Juba aastaid.,187,200
"Selline kaasaskantav väike kastike, mille klaviatuur ekraani ette kinnitus.",201,276
See oli sini-valge pildi ja DOS-opsüsteemiga mänguasi.,277,331


## <span style="color:purple">Morphological analysis</span>

In linguistics, morphology is the identification, analysis, and description of the structure of a given language’s morphemes and other linguistic units, such as root words, lemmas, suffixes, parts of speech etc. When we are processing a morphologically rich language - that Estonian cetrainly is -, getting this kind of information is essential for even the simplest tasks. For example, if we want to find all the mentions of 'maja' from the corpus, we are probably not eager to spell out the 27 different forms that we are interested in ('maja', 'majale', 'majadega'...), but we also do not want to get things like 'majandus' or 'majakas'. If we have morphologically analysed the text, we can just state that we are interested in the lemma 'maja'.

Estnltk wraps Vabamorf morphological analyzer. Morphological analysis is also performed with no extra hassle when we use the tag_layer() method on our text. When we look at the text object, we can see that there is the morph_analysis layer and it has several attributes: lemma, root, etc.

In [234]:
text = Text("Aga kõik juhtus iseenesest.").tag_layer()

In [235]:
text

text
Aga kõik juhtus iseenesest.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
normalized_words,normal,words,,False,0
words,,,,False,5
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5


Therefore, we can either view the whole analysis as a table:

In [236]:
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Aga,aga,aga,"(aga,)",0,,,J
kõik,kõik,kõik,"(kõik,)",0,,pl n,P
,kõik,kõik,"(kõik,)",0,,sg n,P
juhtus,juhtuma,juhtu,"(juhtu,)",s,,s,V
iseenesest,iseenesest,ise_enesest,"(ise, enesest)",0,,,D
.,.,.,"(.,)",,,,Z


Or, using the attributes, we can ask for specific parts of the analysis: lemmas, partofspeechtags, etc:

In [58]:
text.partofspeech

[['P', 'P'], ['S'], ['V'], ['P'], ['S'], ['S'], ['Z']]

## <span style="color:purple"> Examples</span>

Here, two simple examples of using EstNLTK basic toolchain for extracting relevant parts of text are presented. Let's use the following short text as our corpus:

"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."

### Example 1: Finding all different nouns from the text

Let's assume we want to find all different noun lemmas that appear in the text. So, first we have to turn our text into an EstNLTK Text object:

In [240]:
my_text = Text("Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel.")

In [241]:
my_text

text
"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."


Next, we need to let the taggers do their job. Let's use the automatic tag_layer() method for that:

In [242]:
my_text.tag_layer()

text
"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6
tokens,,,,False,124
compound_tokens,"type, normalized",,tokens,False,1
normalized_words,normal,words,,False,0
words,,,,False,122
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,122


Now we can iterate over the lemmas and part-of-speech tags to extract the lemmas that are tagged as nouns. For this, it's easiest to use the zip function:

In [63]:
noun_lemmas = []
for lemma, postag in zip(my_text.lemma, my_text.partofspeech):
    if 'S' in postag:
        noun_lemmas += lemma

**NB!** Note that both, lemmas and part-of-speech tags for every word are given as a list, that's why we have to check if 'S' is in the postag and we cannot use `append` for adding the lemma to noun_lemmas list without iterating over each word's part-of-speech tags and lemmas.

In [64]:
noun_lemmas

['nimi',
 'nurgasaag',
 'tööriist',
 'puitdetail',
 'lõikamine',
 'eesmärk',
 'lõikenurk',
 'lõikenurk',
 'seadistamine',
 'võimalus',
 'näide',
 'pildiraam',
 'meisterdamine',
 'detail',
 'lõikenurk',
 'kraad',
 'juht',
 'nurgasaag',
 'tööriist',
 'täpsus',
 'lõige',
 'korratavus',
 'osa',
 'nurgasaag',
 'lõikenurk',
 'suund',
 'lisa',
 'saag',
 'saetera',
 'kaldenurk',
 'seadistamine',
 'kasu',
 'detail',
 'lõikamine',
 'nurgasaag',
 'laius',
 'puulaud',
 'puitdetail',
 'ristlõige',
 'tegemine',
 'järkamine',
 'näide',
 'puitkonstruktsioon',
 'ehitamine',
 'näide',
 'terrassilaud',
 'puitparkett',
 'paigaldamine']

In this case, we can easily guess what the specific text was about. If we had a larger corpus, we could make a frequency list and e.g draw some conclusions about topics/word use in a specific publication or variety of language in general.

### Example 2: Finding all sentences that contain an infinitive verb

Let's assume we want to extract all sentences containing an infinitive verb form from a corpus. Let's use the same text as in the previous example. Therefore, we have already tagged the necessary layers and we can iterate over the forms and extract sentences that we want to study:

### <span style="color:red">[ Heeelp, this breaks :( ] </span> 

In [243]:
infinitive_sentences = []
for sent in my_text.sentences:
    for form in sent.form:
        if 'da' in form:
            a = ' '.join(sent.text)
            infinitive_sentences.append(a)
            break # not to include the same sentence twice)

TypeError: unhashable type: 'list'

### <span style="color:red">[ But on another text, it works fine ] </span> 

In [244]:
my_text2 = Text("Mul on rohkem jõudu vaja, et suudaksin taltsas püsida. Kujunditest on kõrini, tahan oma tuld õhetada. Sööksin põllutäie roogu ja porgandeid oma suhkruhaigusele, et suuta sinusugust lehma oma karjamaal pidada. Anna andeks, et ma sind seni naiseks olen pidanud. Ma ei mõelnud ju muule. Tuleb magada diivanil sest seda sõbrad teevadki. Magavad üksteiste kodudes tiivanitel. Ma ei mõtle, et midagi valesti teeksin. Mu Nganassaan tõuseb ja liigub ukseni, hingab ja vajutab lingile. Hapnik põleb ta sees, ja su voodi jõnksatab tema raskuse all.")

In [245]:
my_text2.tag_layer()

text
"Mul on rohkem jõudu vaja, et suudaksin taltsas püsida. Kujunditest on kõrini, tahan oma tuld õhetada. Sööksin põllutäie roogu ja porgandeid oma suhkruhaigusele, et suuta sinusugust lehma oma karjamaal pidada. Anna andeks, et ma sind seni naiseks olen pidanud. Ma ei mõelnud ju muule. Tuleb magada diivanil sest seda sõbrad teevadki. Magavad üksteiste kodudes tiivanitel. Ma ei mõtle, et midagi valesti teeksin. Mu Nganassaan tõuseb ja liigub ukseni, hingab ja vajutab lingile. Hapnik põleb ta sees, ja su voodi jõnksatab tema raskuse all."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,10
tokens,,,,False,100
compound_tokens,"type, normalized",,tokens,False,0
normalized_words,normal,words,,False,0
words,,,,False,100
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,100


In [247]:
infinitive_sentences = []
for sent in my_text2.sentences:
    for form in sent.form:
        if 'da' in form:
            a = ' '.join(sent.text)
            infinitive_sentences.append(a)
            break # not to include the same sentence twice)

In [248]:
infinitive_sentences

['Mul on rohkem jõudu vaja , et suudaksin taltsas püsida .',
 'Kujunditest on kõrini , tahan oma tuld õhetada .',
 'Sööksin põllutäie roogu ja porgandeid oma suhkruhaigusele , et suuta sinusugust lehma oma karjamaal pidada .',
 'Tuleb magada diivanil sest seda sõbrad teevadki .']

#### <span style="color:red"> Things that might interest a linguist (e.g \me): </span> 
 * Is there any way to get sentence texts not as lists besides doing the ' '.join(sent.text)?  
 * Anything about switching on/off disambiguation and guessing in morphological analysis? <-- This is important!

# <span style="color:blue"> B. Specific details for programmers: how it works</span>

## <span style="color:purple"> Segmentation </span>

To segment text, the following steps are performed:
1. tagging tokens,
2. tagging compound tokens,
3. tagging words,
4. tagging sentences,
5. tagging paragraphs.

### Tokens

Tagging the tokens means that we determine the start and end position of each token, based on whitespace and/or punctuation. There are many whitespace symbols, out of which spaces, tabs, and newlines occur most frequently. When tokens are tagged on the text, the type of whitespace does not matter, but in later analysis, it may be taken into consideration if there was a whitespace between the tokens or not. 

In the following example, we create a text object with the tokens layer and print out the tokens layer.

In [40]:
from estnltk import Text
from estnltk.taggers import TokensTagger
text = TokensTagger().tag(Text('Mis aias sa-das 3me sorti s-saia?'))
text['tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


Here we have 11 tokens in the text. To see the start and end position of each token print out the span list.

In [2]:
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text,start,end
Mis,0,3
aias,4,8
sa,9,11
-,11,12
das,12,15
3me,16,19
sorti,20,25
s,26,27
-,27,28
saia,28,32


### Compound tokens
It was said before that although mostly words and tokens overlap with each other, there are some cases where several tokens are combined together to form a word in the traditional sense - the smallest meaningful unit of language. 

Compound token tagger takes care of this step: it adds `compound_tokens` layer that envelopes the `tokens` layer. It means that every element of the `compound_tokens` layer is a list of `tokens` layer elements - tokens. That makes it easy to glue the tokens together to form the words later on.

No compound token may have common tokens with another compound token.

In [3]:
from estnltk.taggers import CompoundTokenTagger
CompoundTokenTagger().tag(text)
text['compound_tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
compound_tokens,"type, normalized",,tokens,False,2

text,type,normalized
sa-das,hyphenation,
s-saia,hyphenation,


In [4]:
text.compound_tokens.spans[0][0].legal_attribute_names

()

In this example, two compound tokens are found, both of which consist of three tokens.

Note that the type 'hyphenation' for 's-saia' is incorrect, it should be 'stammer'. It is to be fixed.

Here we can see the list of lists of tokens that make up the compound tokens.

In [5]:
text.compound_tokens.text

[['sa', '-', 'das'], ['s', '-', 'saia']]

### Words
To get words - smallest meaningful units of language - the outputs of tokens tagger and compound token tagger have to be combined. This is done by word tokenizer and it is quite straightforward: every compound token is a word, and every token that is not a part of a compound token is also a word. The words are tagged on the raw text the same way as the tokens were. It means that the `words` layer does not depend on `tokens` layer or `compound_tokens` layer and so these layers may be deleted after the words are tagged.

In [6]:
from estnltk.taggers import WordTokenizer
WordTokenizer().tag(text)
text['words']

layer name,attributes,parent,enveloping,ambiguous,span count
words,,,,False,7

text
Mis
aias
sa-das
3me
sorti
s-saia
?


### Sentences
The sentence tagger first looks for the sentence end points in the raw text and then leaves out all the points that are not the ending points of the words. The remaining points are used to split the list of words into the list of sentences. This avoids many common mistakes of sentence tagging provided that the compound token tagger has done a good job.

In [7]:
from estnltk.taggers import SentenceTokenizer
text = Text('''Esimene lõik. Teine lause.

Teine lõik.''')
text.tag_layer(['words'])
SentenceTokenizer().tag(text)
text['sentences']

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,3

text
Esimene lõik.
Teine lause.
Teine lõik.


### Paragraphs
A paragraph is a list of sequential sentences. The process of tagging paragraphs is similar to the sentence tagging. First the possible ending points of the paragraphs are searched from the raw text and then the list of the sentences is split into the list of paragraphs taking into account only those points that are ending points of the sentences.

In [8]:
from estnltk.taggers import ParagraphTokenizer
ParagraphTokenizer().tag(text)
text['paragraphs']

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,2

text
Esimene lõik. Teine lause.
Teine lõik.


## <span style="color:purple">Morphological analysis</span>
The core of morphological analysis is the VabamorfTagger. Before VabamorfTagger we run the WordNormalizingTagger that creates a 'normalized_words' layer. VabamorfTagger then uses 'words' and 'normalized_words' as the input layers and tags the 'morph_analysis' layer on the 'words' layer.
### Premorph
Currently we have a WordNormalizingTagger which tags words with extra hyphens and stammer but I think this functionality should be incorporated into CompoundTokenTagger.

In [9]:
from estnltk.taggers import WordNormalizingTagger
text = Text('Mis aias sa-das 2te sorti s-saia?')
text.tag_layer(['words']) 
WordNormalizingTagger().tag(text)
text['normalized_words']

layer name,attributes,parent,enveloping,ambiguous,span count
normalized_words,normal,words,,False,2

text,normal
sa-das,sadas
s-saia,saia


### VabamorfTagger
The central part of the VabamorfTagger is the Vabamorf of estnltk. VabamorfTagger creates a morphological analysis layer on the words layer. This layer is ambiguous. It means that one word can have more than one analysis. Inside the VabamorfTagger the output of Vabamorf is corrected for some words that contain numbers. The result is written to the 'morph_analysis' layer.

In [10]:
from estnltk.taggers import VabamorfTagger
VabamorfTagger().tag(text)
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,mis,mis,"(mis,)",0,,pl n,P
,mis,mis,"(mis,)",0,,sg n,P
aias,aed,aed,"(aed,)",s,,sg in,S
sa-das,sadama,sada,"(sada,)",s,,s,V
2te,2.,2.,[2.],te,,pl g,O
,2,2,[2],0,,adt,N
,2,2,[2],0,,sg p,N
sorti,sort,sort,"(sort,)",0,,sg p,S
s-saia,sai,sai,"(sai,)",0,,sg p,S
?,?,?,"(?,)",,,,Z


In the following example we swich off vabamorf corrector.

In [11]:
del text.morph_analysis
VabamorfTagger(postmorph_rewriter=None).tag(text)
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,mis,mis,"(mis,)",0,,pl n,P
,mis,mis,"(mis,)",0,,sg n,P
aias,aed,aed,"(aed,)",s,,sg in,S
sa-das,sadama,sada,"(sada,)",s,,s,V
2te,2sina,2_sina,"(2, sina)",0,,pl g,P
sorti,sort,sort,"(sort,)",0,,sg p,S
s-saia,sai,sai,"(sai,)",0,,sg p,S
?,?,?,"(?,)",,,,Z
