# <span style="color:darkblue"> Basic NLP toolchain</span> 

## <span style="color:blue"> A. Short Introduction and Tutorial for Linguists</span> 

This tutorial gives an overview of how to use EstNLTK for the basic analysis of text: splitting it into linguistically meaningful units - words, sentences, - and performing morphological analysis. These steps are necessary in tackling most language-related problems: if we are able to extract words and sentences from text and filter them by lemmas, part-of-speech tags and morphological forms, we can solve numerous tasks, e.g. automatically find example sentences of different grammatical constructions from large corpora, compose word/lemma frequency lists, compare texts in terms of sentence lengths/structures, etc.

The most important class in Estnltk is Text, which is essentally the main interface for doing everything Estnltk is capable of. To use it, we have to import it:

In [1]:
from estnltk import Text

To start working on our text, we have to create a new Text class object of it. Let's use a simple sentence  as an example:

In [2]:
text = Text("Müüja tatsas rahulikult külmiku juurde.")

In [3]:
text

text
Müüja tatsas rahulikult külmiku juurde.


The basic way to use EstNLTK toolchain is to use the tag_layer() method that automatically segments the text and performs morphological analysis. From its output, we can see which layers have been tagged on text, which attributes the layers have and how many elements belong to every layer (column span count). The details about each layer come in Section B.

In [4]:
text.tag_layer()

text
Müüja tatsas rahulikult külmiku juurde.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,6
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,6
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,6


##  <span style="color:purple"> Text segmentation </span>

One of the most basic tasks of any NLP pipeline is text segmentation: splitting the text into smaller meaningful units - words, sentences, paragraphs, etc. This might seem like a trivial task at first - aren't words separated by spaces and sentences by full stops? And yes, question marks and exclamation marks. However, if we take an existing text, we will see that there are lots of exceptions to these rules. Therefore, EstNLTK has dedicated methods for these kinds of tasks that try to tackle the frequent segmentation issues.

###  Tokens vs words
To make a distinction between properly tagged words (incl. punctuation, abbreviations, e-mail addresses, etc) and elements in text that are separated from each other by whitespace (or not... in case of punctuation), we use the term 'tokens' for the latter. For the most part, tokens overlap with words, but a token might also be a part of a word: in later analysis, tokens are not broken into any smaller parts, but only joined if necessary. If we look at our first example above, we can see that the number of words and tokens is equal. However, there are cases where some tokens are joined into one word:

In [5]:
text = Text('Mis aias sa-das 3me sorti s-saia?')
text.tag_layer()

text
Mis aias sa-das 3me sorti s-saia?

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,11
compound_tokens,"type, normalized",,tokens,False,2
words,normalized_form,,,True,7
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,7


As we can see, in this (quite weird) example sentence, there are 11 tokens but 7 words. To see the tokens (or words for that matter) that have been tagged on text using the tag_layer() method, we can either use the Text object as a typical Python dict:

In [6]:
text['tokens']

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


Or, we can use the class variable 'tokens' 

In [7]:
text.tokens

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,11

text
Mis
aias
sa
-
das
3me
sorti
s
-
saia


To get words - smallest meaningful units of language - some tokens might need to be combined. That's why there are layers `tokens` and `compound_tokens` which are combined together to create words. This happens when the raw text needs some kind of normalization in order to comply with standard ortography. In addition to the hyphenated words as in the previous example, also some numbers ( _'10 000'_ ), e-mail addresses ( 'example@example.com' ), abbreviations ( _'s.t.'_ ) and other entities that have been tagged as separate tokens have to be joined together.

We can see and use `words` layer (and all the other layers) the same way as the `tokens` layer:

In [8]:
text.words

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,7

text,normalized_form
Mis,
aias,
sa-das,sadas
3me,
sorti,
s-saia,saia
?,


### Sentences

A sentence is a list of sequential words and so the sentence layer is a list of lists of words. This means that first, the text is split into words, and then, sentence borders are determined so that no sentence border would end up inside a word. 

Let's see an example that has multiple sentences:

In [9]:
text = Text('''Ka köögis oli kõik endine: vana elektripliit, koorunud värviga ahjutruup ja vanamehe töövorm — rippumas ikka sealsamas ukse küljes nagis. Jälk. Köögi akna all laual oli vanaaegne arvuti. Juba aastaid. Selline kaasaskantav väike kastike, mille klaviatuur ekraani ette kinnitus. See oli sini-valge pildi ja DOS-opsüsteemiga mänguasi.''')

In [10]:
text.tag_layer()

text
"Ka köögis oli kõik endine: vana elektripliit, koorunud värviga ahjutruup ja vanamehe töövorm — rippumas ikka sealsamas ukse küljes nagis. Jälk. Köögi akna all laual oli vanaaegne arvuti. Juba aastaid. Selline kaasaskantav väike kastike, mille klaviatuur ekraani ette kinnitus. See oli sini-valge pildi ja DOS-opsüsteemiga mänguasi."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6
tokens,,,,False,59
compound_tokens,"type, normalized",,tokens,False,2
words,normalized_form,,,True,55
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,55


We can see the text split into sentences by using the `sentences` class variable:

In [11]:
text.sentences

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6

text
"['Ka', 'köögis', 'oli', 'kõik', 'endine', ':', 'vana', 'elektripliit', ',', 'koo ..., type: <class 'list'>, length: 23"
"['Jälk', '.']"
"['Köögi', 'akna', 'all', 'laual', 'oli', 'vanaaegne', 'arvuti', '.']"
"['Juba', 'aastaid', '.']"
"['Selline', 'kaasaskantav', 'väike', 'kastike', ',', 'mille', 'klaviatuur', 'ekr ..., type: <class 'list'>, length: 11"
"['See', 'oli', 'sini-valge', 'pildi', 'ja', 'DOS-opsüsteemiga', 'mänguasi', '.']"


## <span style="color:purple">Morphological analysis</span>

In linguistics, morphology is the identification, analysis, and description of the structure of a given language’s morphemes and related linguistic units, such as root words, lemmas, suffixes, parts of speech etc. When we are processing a morphologically rich language - that Estonian certainly is -, getting this kind of information is essential for even the simplest tasks. For example, if we want to find all the mentions of 'maja' from the corpus, we are probably not eager to spell out the 27 different forms that we are interested in ( _'maja'_ , _'majale'_ , _'majadega'_ ...), but we also do not want to get things like _'majandus'_ or _'majakas'_ . If we have morphologically analysed the text, we can just state that we are interested in the lemma _'maja'_ .

Estnltk wraps Vabamorf morphological analyzer. Morphological analysis is also performed with no extra hassle when we use the tag_layer() method on our text. When we look at the text object, we can see that there is the morph_analysis layer and it has several attributes: lemma, root, etc.

In [12]:
text = Text("Aga kõik juhtus iseenesest.").tag_layer()

In [13]:
text

text
Aga kõik juhtus iseenesest.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5


Therefore, we can either view the whole analysis as a table:

In [14]:
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Aga,Aga,aga,aga,['aga'],0,,,J
kõik,kõik,kõik,kõik,['kõik'],0,,pl n,P
,kõik,kõik,kõik,['kõik'],0,,sg n,P
juhtus,juhtus,juhtuma,juhtu,['juhtu'],s,,s,V
iseenesest,iseenesest,iseenesest,ise_enesest,"['ise', 'enesest']",0,,,D
.,.,.,.,['.'],,,,Z


Possible categories of _partofspeech_ are described [here](https://estnltk.github.io/estnltk/1.4.1/tutorials/morphology_tables.html#table-pos-tag-descriptions).
_Verb form_ categories are listed [here](
https://estnltk.github.io/estnltk/1.4.1/tutorials/morphology_tables.html#table-verb-form-descriptions-vabamorf) and _noun form_ categories [here](https://estnltk.github.io/estnltk/1.4.1/tutorials/morphology_tables.html#table-noun-form-descriptions-vabamorf).

You can also convert morphological analysis category names to Estonian:

In [15]:
# Add a morph layer that has Estonian category names
text.tag_layer(['morph_analysis_est'])

# Browse the layer
text['morph_analysis_est']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis_est,"normaliseeritud_sõne, algvorm, lõpp, sõnaliik, vormi_nimetus, kliitik",morph_analysis,,True,5

text,normaliseeritud_sõne,algvorm,lõpp,sõnaliik,vormi_nimetus,kliitik
Aga,Aga,aga,0,sidesõna,,
kõik,kõik,kõik,0,asesõna,mitmus nimetav (nominatiiv),
,kõik,kõik,0,asesõna,ainsus nimetav (nominatiiv),
juhtus,juhtus,juhtuma,s,tegusõna,kindel kõneviis lihtminevik 3. isik ainsus aktiiv jaatav kõne,
iseenesest,iseenesest,iseenesest,0,määrsõna,,
.,.,.,,lausemärk,,


However, note that the layer `'morph_analysis_est'` is provided for educational purposes, and it is not standard in EstNLTK. 
All the tools building upon morphological analysis are using the layer `'morph_analysis'` .
In the following examples, we will also continue with the standard layer.

### Accessing details of the morphological analysis 

`text['morph_analysis']` and `text.morph_analysis` give us the whole layer of morphological analyses. 
If we are only interested in specific details, e.g. in partofspeechtags or lemmas, we can use attributes to access them:

In [16]:
text.partofspeech

Unnamed: 0,partofspeech
0.0,J
1.0,P
,P
2.0,V
3.0,D
4.0,Z


In [17]:
text.lemma

Unnamed: 0,lemma
0.0,aga
1.0,kõik
,kõik
2.0,juhtuma
3.0,iseenesest
4.0,.


Resulting `AmbiguousAttributeList` can also be converted to a list:

In [18]:
list(text.lemma)

[['aga'], ['kõik', 'kõik'], ['juhtuma'], ['iseenesest'], ['.']]

Using indexing on `text.morph_analysis`, we can take out analyses of specific words.
For instance, let's take out the 2nd word and its lemmas and forms:

In [19]:
print('word:',   text.morph_analysis[1].text)
print('lemmas:', text.morph_analysis[1].lemma)
print('forms:',  text.morph_analysis[1].form)

word: kõik
lemmas: ['kõik', 'kõik']
forms: ['pl n', 'sg n']


In [20]:
# or use zip to combine lemmas and forms of the 2nd word
list( zip(text.morph_analysis[1].lemma, text.morph_analysis[1].form) )

[('kõik', 'pl n'), ('kõik', 'sg n')]

Because 'words' is parent of 'morph_analysis' layer, we can also access analyses through words:

In [21]:
# lemmas of all words
text.words.lemma

Unnamed: 0,lemma
0.0,aga
1.0,kõik
,kõik
2.0,juhtuma
3.0,iseenesest
4.0,.


Taking full advantage of the relations between layers, we can iterate over sentences of `Text`, and then access morphological analyses from words of the sentences:

In [22]:
text = Text("Olin nägin vaatasin. Ja väga hea oli.").tag_layer()

In [23]:
for sentence in text.sentences:
    print(' Sentence: ', sentence.enclosing_text)
    for word in sentence:
        # Output first lemma and partofspeech of the word
        print( word.morph_analysis.lemma[0], word.morph_analysis.partofspeech[0] )
    print()

 Sentence:  Olin nägin vaatasin.
olema V
nägema V
vaatama V
. Z

 Sentence:  Ja väga hea oli.
ja J
väga D
hea A
olema V
. Z



### Parameters of morphological analysis: disambiguation, guessing...

By default, EstNLTK performs morphological analysis with disambiguation (giving out the analysis that is correct in the context), guessing (if the word is not in the dictionary and cannot be resolved as a compound, it is given a 'guessed' analysis) and proper name analysis. While this kind of output is easy to use because each word has been given an analysis and most words receive only one analysis,  sometimes we want to use the analyser differently. If we want to enhance the recall of the morphological analysis, we can switch off the disambiguation (which, of course, also affects precision). We can also switch off guessing and proper name analysis e.g to verify whether a word exists in Estonian or not. And finally, if we switch on the text-based disambiguation, we will likely get better disambiguation results (especially on proper names), although this may not work on every corpora (e.g. usually works well on news articles, but be careful when applying this on the Internet language).

To change the parameters of morphological analyser, we have to use a resolver (a register of taggers). To get to know the parameters of the default resolver that is used by the tag_layer() method, we have to import the default resolver:

In [24]:
from estnltk.resolve_layer_dag import DEFAULT_RESOLVER

Then we can see what the default parameters for morph_analysis are:

In [25]:
DEFAULT_RESOLVER.taggers.rules['morph_analysis']

name,output layer,output attributes,input layers
VabamorfTagger,morph_analysis,"('normalized_text', 'lemma', 'root', 'root_tokens', 'ending', 'clitic', 'form', 'partofspeech')","('words', 'sentences', 'compound_tokens')"

0,1
guess,True
propername,True
disambiguate,True
compound,True
phonetic,False
slang_lex,False
postanalysis_tagger,"PostMorphAnalysisTagger(('compound_tokens', 'words', 'morph_analysis')->morph_analysis)"
use_postanalysis,True
analysis_reorderer,"MorphAnalysisReorderer(('morph_analysis',)->morph_analysis)"
use_reorderer,True


Basic configuration parameters are:

* `guess` -- if a word is not in the dictionary and cannot be resolved as a compound, then it's analyses will be guessed;
* `propername` -- titlecase words will receive additional guesses for propername analyses;
* `disambiguate` -- if there are multiple possible analyses for a word, then only analyses fitting to the context will be picked out. This leaves you only one analysis per word for most words;
* `compound` -- roots in analyses will have compound word markers. Normally, you wouldn't need to change this parameter;
* `phonetic` -- roots in analyses will have phonetic markers. Normally, you wouldn't need to change this parameter;

Other parameters are related to components which enhance the quality of morphological analysis:

* Parameter `slang_lex` switches on an extended version of Vabamorf's lexicon, which contains extra entries for analysing most common spoken and slang words, such as _'muideks'_ , _'kodukas'_ , _'mõnsa'_ , _'mersu'_ , _'kippelt'_ .


* Parameter `postanalysis_tagger` refers to an internal component of `VabamorfTagger`, which makes post-corrections to morphological analyses. Details are covered in the tutorial [B_06_morphological_analysis.ipynb](B_06_morphological_analysis.ipynb). The component can be enabled/disabled by the flag `use_postanalysis`. 


* Parameter `analysis_reorderer` refers to an internal component, which re-orders ambiguous analyses by their corpus frequency (based on [Estonian UD corpus](https://github.com/estnltk/ambiguous-morph-reordering/)). The reordering is applied as a last step, after the disambiguation. Details are covered in the tutorial [B_07c_morph_analysis_reordering.ipynb](B_07c_morph_analysis_reordering.ipynb). The component can be enabled/disabled by the flag `use_reorderer`. 


* Parameter `textbased_disambiguator` refers to an internal component, which analyses ambiguities in the whole text in order to make advanced disambiguation decisions. It consists of two sub-steps. First, pre-disambiguation of ambiguous proper name analyses applied before the standard disambiguation (flag `predisambiguate`). Second, post-disambiguation of remaining ambiguous analyses applied after the standard disambiguation (flag `postdisambiguate`). Details are in the tutorial [B_07b_morph_analysis_with_corpus-based_disambiguation.ipynb](B_07b_morph_analysis_with_corpus-based_disambiguation.ipynb).

The configuration tells us that by default, disambiguation, proper name analysis, compound word analysis, and guessing are all applied during morphological analysis.

Using a resolver, we can write an equivalent code to
```python
text.tag_layer()
```
as follows:

In [26]:
from estnltk.resolve_layer_dag import make_resolver

In [27]:
resolver = make_resolver(
                 disambiguate=True,
                 guess=True,
                 propername=True,
                 phonetic=False,
                 compound=True)

In [28]:
text = Text("Kärbes hulbib mees ja naeris puhub sädelevaid mulle.")

In [29]:
text.tag_layer(resolver=resolver)['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Kärbes,Kärbes,kärbes,kärbes,['kärbes'],0,,sg n,S
hulbib,hulbib,hulpima,hulpi,['hulpi'],b,,b,V
mees,mees,mees,mees,['mees'],0,,sg n,S
ja,ja,ja,ja,['ja'],0,,,J
naeris,naeris,naerma,naer,['naer'],is,,s,V
puhub,puhub,puhuma,puhu,['puhu'],b,,b,V
sädelevaid,sädelevaid,sädelev,sädelev,['sädelev'],id,,pl p,A
mulle,mulle,mina,mina,['mina'],lle,,sg all,P
.,.,.,.,['.'],,,,Z


As we can see from the result, with default morphological analysis, all the words get assigned exactly one analysis, but three of them are not correct. The disambiguator has wrongly deleted the correct analyses this time. 

*Note that this example sentence is a little out of the ordinary and hence the bad performance of disambiguator. The more 'normal' your text is, the better the results.*

If we want to change the parameters of morphological analysis, we have to change the default values of the flags to create a customized resolver. For example, to switch off disambiguation, we have to set this value to False:

In [30]:
resolver2 = make_resolver(
                 disambiguate=False,
                 guess=True,
                 propername=True,
                 phonetic=False,
                 compound=True)

The resolver tags only those layers on the text that have not been previously tagged. To see the effect of changed parameters we have to create a new text object or delete the affected layer.

In [31]:
text.pop_layer('morph_analysis')  # remove morph_analysis from text 
text.tag_layer(resolver=resolver2)['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech,_ignore
Kärbes,Kärbes,Kärbe,Kärbe,['Kärbe'],s,,sg in,H,False
,Kärbes,Kärbes,Kärbes,['Kärbes'],0,,sg n,H,False
,Kärbes,kärbes,kärbes,['kärbes'],0,,sg n,S,False
hulbib,hulbib,hulpima,hulpi,['hulpi'],b,,b,V,False
mees,mees,mees,mees,['mees'],0,,sg n,S,False
,mees,mesi,mesi,['mesi'],s,,sg in,S,False
ja,ja,ja,ja,['ja'],0,,,J,False
naeris,naeris,naerma,naer,['naer'],is,,s,V,False
,naeris,naeris,naeris,['naeris'],0,,sg n,S,False
,naeris,naeris,naeris,['naeris'],s,,sg in,S,False


##### Note:
 * If disambiguation is switched off, the layer `morph_analysis` will have one extra attribute named `_ignore`. This is actually an internal attribute that is used to tell the disambiguator which analyses should be ignored. Once the disambiguation will be applied, the attribute will be removed. ( You can use `VabamorfDisambiguator` to perform disambiguation separately, see the details in the tutorial [B_06_morphological_analysis.ipynb](B_06_morphological_analysis.ipynb) );

#### Unknown words

If guessings and disambiguation are switched off ( `guess=False`, `propername=False` and `disambiguate=False` ), then morphological analyser can be used to detect unknown words -- words that are orthographically incorrect, or not common in written language:

In [32]:
from estnltk.resolve_layer_dag import make_resolver
# Switch off guessing and disambiguation
resolver3 = make_resolver(
                 disambiguate=False,
                 guess=False,
                 propername=False,
                 phonetic=False,
                 compound=True)

# Tag morph analysis
text = Text("Ma tahax minna järve ääde")
text.tag_layer(resolver=resolver3)['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech,_ignore
Ma,Ma,mina,mina,['mina'],0,,sg n,P,False
tahax,,,,,,,,,False
minna,minna,minema,mine,['mine'],a,,da,V,False
järve,järve,järv,järv,['järv'],0,,adt,S,False
,järve,järv,järv,['järv'],0,,sg g,S,False
,järve,järv,järv,['järv'],0,,sg p,S,False
ääde,,,,,,,,,False


In the previous example, the morphological analysis revealed two unknown words: _'tahax'_ and _'ääde'_. Each unknown still has one analysis, but all attributes of the analysis (lemma, root, ending, partofspeech etc.) are set to `None`.

##### Remarks on morphological analysis:
 * Switching off guessing ( `guess=False` ) only works if guessing of proper names and disambiguation are also switched off  ( `propername=False` and `disambiguate=False` ). If only `guess=False` is used, then the setting is ignored, and the morphological analysis is performed with the default settings;
 * Be aware that: 
    * if `guess` is switched off, then punctuation symbols (such as `'.'`, `'!'`, `'?'`) also do not receive any analyses;
    * **(!)** if `guess` and `propername` are switched off, but disambiguation is switched on, then an exception will be raised if there are unknown words and/or punctuation symbols in the text. This is because disambiguation requires that all words have been morphologically analysed; 
    
      Note also that if you catch the exception, and proceed with the processing, then the Text object will still have layer `morph_analysis`. But the layer will be incomplete, as its analyses will be ambiguous and contain gaps in places of unknown words;
    * **(!)** if `guess` and `propername` are switched off,  and you switch on the parameter `slang_lex` , then slang words, such as `"kudas"` or `"muideks"` , still get analyses and appear as "known words". So, if you want to detect all non-standard words, you should switch off the parameters `guess` and `propername` and refrain from switching on `slang_lex`;
 * Reordering ( `use_reorderer=True` ) only works with `disambiguate=True`;
 * In practice, parameters `compound` and `phonetic` rarely need to be changed. So, it is advisable to change these parameters only when you really know, what you are doing ...

## <span style="color:purple"> Examples</span>

Here, two simple examples of using EstNLTK basic toolchain for extracting relevant parts of text are presented. Let's use the following short text as our corpus:

"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."

### Example 1: Finding all different nouns from the text

Let's assume we want to find all different noun lemmas that appear in the text. So, first we have to turn our text into an EstNLTK Text object:

In [33]:
my_text = Text("Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel.")

In [34]:
my_text

text
"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."


Next, we need to let the taggers do their job. Let's use the automatic tag_layer() method for that:

In [35]:
my_text.tag_layer()

text
"Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus. Näiteks pildiraamide meisterdamisel, kus on oluline, et detailide lõikenurgad oleksid kõik täpselt 45 kraadi. Sellisel juhul on nurgasaag täiuslikuks tööriistaks, sest tagab täpsuse ja lõike korratavuse. Üldiselt on valdav osa nurgasaage seadistatavad 45-kraadise lõikenurga alla vähemalt ühes suunas. Lisaks võimaldavad mõned saed veel ka saetera kaldenurga seadistamist, mis tuleb kasuks keerukamate detailide lõikamisel. Nurgasaag on väga tõhus ka kitsamate, kuni 30 cm laiuste puulaudade või muude puitdetailide ristlõigete tegemiseks ehk järkamiseks, mida tuleb palju ette näiteks puitkonstruktsioonide ehitamisel või ka näiteks terrassilaudade või puitparketi paigaldamisel."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,6
tokens,,,,False,124
compound_tokens,"type, normalized",,tokens,False,1
words,normalized_form,,,True,122
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,122


Now we can iterate over the lemmas and part-of-speech tags to extract the lemmas that are tagged as nouns. For this, it's easiest to use the zip function:

In [36]:
noun_lemmas = []
for lemma, postag in zip(my_text.lemma, my_text.partofspeech):
    if 'S' in postag:
        noun_lemmas += lemma

**NB!** Note that both, lemmas and part-of-speech tags for every word are given as a list, that's why we have to check if 'S' is in the postag and we cannot use `append` for adding the lemma to noun_lemmas list without iterating over each word's part-of-speech tags and lemmas.

In [37]:
noun_lemmas

['nimi',
 'nurgasaag',
 'tööriist',
 'puitdetail',
 'lõikamine',
 'eesmärk',
 'lõikenurk',
 'lõikenurk',
 'seadistamine',
 'võimalus',
 'näide',
 'pildiraam',
 'meisterdamine',
 'detail',
 'lõikenurk',
 'kraad',
 'juht',
 'nurgasaag',
 'tööriist',
 'täpsus',
 'lõige',
 'korratavus',
 'osa',
 'nurgasaag',
 'lõikenurk',
 'suund',
 'lisa',
 'saag',
 'saetera',
 'kaldenurk',
 'seadistamine',
 'kasu',
 'detail',
 'lõikamine',
 'nurgasaag',
 'laius',
 'puulaud',
 'puitdetail',
 'ristlõige',
 'tegemine',
 'järkamine',
 'näide',
 'puitkonstruktsioon',
 'ehitamine',
 'näide',
 'terrassilaud',
 'puitparkett',
 'paigaldamine']

In this case, we can easily guess what the specific text was about. If we had a larger corpus, we could make a frequency list and e.g draw some conclusions about topics/word use in a specific publication or variety of language in general.

### Example 2: Finding all sentences that contain an infinitive verb

Let's assume we want to extract all sentences containing an infinitive verb form from a corpus. Let's use the same text as in the previous example. Therefore, we have already tagged the necessary layers and we can iterate over the forms and extract sentences that we want to study:

In [38]:
infinitive_sentences = []
for sent in my_text.sentences:
    for form in sent.form:
        if 'da' in form:
            a = sent.enclosing_text
            infinitive_sentences.append(a)
            break # not to include the same sentence twice)
infinitive_sentences

['Nagu nimigi reedab, on nurgasaag kõige tõhusam tööriist erinevate puitdetailide lõikamiseks, kus eesmärgiks on saavutada täpne lõikenurk ning oluline on lõikenurga seadistamise võimalus.']