# Basics of EstNLTK 1.7

<ul>
 <li><a href="#Text-object">Text object</a></li>
  <ul><li><a href="#Metadata-about-Text">Metadata about Text</a></li>
  <li><a href="#Layers">Layers</a></li>
    <ul><ul><li><a href="#Adding-layers-via-tag_layer">Adding layers via tag_layer</a></li>
    <li><a href="#Which-layers-can-be-created-via-tag_layer?-(-DEFAULT_RESOLVER-)">Which layers can be created via tag_layer? ( DEFAULT_RESOLVER )</a></li>
    <li><a href="#Importing-taggers-from-estnltk.taggers.-Applying-a-tagger-directly-(-the-method-tag-)">Importing taggers from estnltk.taggers. Applying a tagger directly ( the method tag )</a></li>
    <li><a href="#Changing-taggers-of-the-pipeline-(-make_resolver-)">Changing taggers of the pipeline ( make_resolver )</a></li>
    <li><a href="#Removing-a-layer">Removing a layer</a></li>
  </ul></ul><li><a href="#Accessing-annotations-of-Text.-Iterating-and-querying-annotations">Accessing annotations of Text. Iterating and querying annotations</a></li>
    <ul><ul><li><a href="#Textual-span-of-annotation-(-Span-and-EnvelopingSpan-)">Textual span of annotation ( Span and EnvelopingSpan )</a></li>
    <li><a href="#Informational-content-of-annotation-(-Annotation-)">Informational content of annotation ( Annotation )</a></li>
    <li><a href="#Selecting-multiple-annotations:-indexing-operators">Selecting multiple annotations: indexing operators</a></li>
    <li><a href="#Iterating-over-multiple-layers:-an-example">Iterating over multiple layers: an example</a></li>
    <li><a href="#Grouping-annotations-(-Layer.groupby-)">Grouping annotations ( Layer.groupby )</a></li>
    <li><a href="#Sliding-window-over-annotations-(-Layer.rolling-)">Sliding window over annotations ( Layer.rolling )</a></li>
  </ul></ul><li><a href="#Dividing-Text-object-into-smaller-Text-objects">Dividing Text object into smaller Text objects</a></li>
    <ul><ul><li><a href="#Making-extracts-from-Text-(-extract_sections-)">Making extracts from Text ( extract_sections )</a></li>
    <li><a href="#Splitting-Text-(-split_by-)">Splitting Text ( split_by )</a></li>
  </ul></ul><li><a href="#Removing-layer's-dependencies-(-flatten-)">Removing layer's dependencies ( flatten )</a></li>
 </ul>
</ul>

Online documentation is best viewed with https://nbviewer.jupyter.org/

# `Text` object

The central component of EstNLTK is Text class.
It encapsulates the raw text and allows to call for text analysers (_taggers_).
Text analysis results (_annotations_) can also be accessed via the Text object.

Example: creating a Text object:

In [1]:
from estnltk import Text
text = Text('Ära mine sinna, kuhu viib rada. Mine selle asemel sinna, kus pole ühtki rada ja ole teerajaja.')
text

text
"Ära mine sinna, kuhu viib rada. Mine selle asemel sinna, kus pole ühtki rada ja ole teerajaja."


The attribute `text` can be used to get the initial raw text as a string:

In [2]:
text.text

'Ära mine sinna, kuhu viib rada. Mine selle asemel sinna, kus pole ühtki rada ja ole teerajaja.'

<p>
<div class="alert alert-block alert-warning"> 
<h4><i>Initial text is immutable</i></h4> 
<br>
EstNTLK adheres to a principle that the initial raw text should always remain immutable. 
If you need to change the raw text, you should create a new <code>Text</code> object corresponding to the changed text.
<br>
Remark: However, the raw text can be indirectly altered through changing annotations. An example is word normalization: normalized word forms are stored as annotations and thus can be changed in order to improve downstream linguistic analysis. For details, see <a href="nlp_pipeline/A_text_segmentation/03_words.ipynb">this tutorial</a>.
</div>
</p>

## Metadata about Text

Metadata of a `Text` object is a simple dictionary, which can be accessed via `meta` attribute:

```python
# setting metadata dictionary
text.meta = {'author': 'Tundmatu', 'date': 2015}
# setting metadata items one by one
text.meta['origin'] = 'tsitaadid.ee'
text.meta['url'] = 'https://tsitaadid.ee/quote/576/14'
```
By default, the created `Text` object does not have any metadata -- metadata needs to be added by the user. 
However, EstNLTK's [corpus importing functions](corpus_processing/importing_text_objects_from_corpora.ipynb) try to populate the imported texts with metadata if possible.

Note: if you need to serialize `Text` objects and/or use the Postgres storage, then it is advised to use only the data types `str`, `int`, `float` and `datetime` for metadata.

## Layers

Layer is a collection of annotations with the same set of attributes. Each annotation refers to a span that specifies a text region and a list of attributes.

### Adding layers via `tag_layer`

Method `tag_layer` creates annotation layers to the Text by using EstNLTK's basic NLP pipeline:

In [3]:
text.tag_layer(['tokens', 'words'])

text
"Ära mine sinna, kuhu viib rada. Mine selle asemel sinna, kus pole ühtki rada ja ole teerajaja."

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,21
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,21


Most of the layers created by the basic pipeline have **dependencies** (an exception is the layer `'tokens'`).
If a layer has dependencies, it can only be created after dependency layers have been created.
The method `tag_layer` resolves dependencies automatically and creates all the prerequisite layers.
In the previous example: in addition to the layers `'tokens'` and `'words'`, the layer `'compound_tokens'` was also created, because it was required by the layer `'words'`.

If `tag_layer` is called without input arguments, the default value `['morph_analysis', 'sentences']` is used:

In [4]:
text.tag_layer()

text
"Ära mine sinna, kuhu viib rada. Mine selle asemel sinna, kus pole ühtki rada ja ole teerajaja."

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,21
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,21
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,21


Remark: if `tag_layer` is called on a `Text` object that already has the input layers, the existing layers remain as they are: there will be no updating nor overwriting.
If you need to update an existing layer (e.g. perform morphological analysis with different settings), then you first need to remove the old layer, and then tag it once again.

What does `tag_layer` return? It returns the `Text` object on which it was called:

In [5]:
annotated_text = text.tag_layer()
assert text == annotated_text

<p>
<div class="alert alert-block alert-warning"> 
<h4><i>Adding layers via <code>analyse</code> [Deprecated]</i></h4> 
<br>
EstNTLK's versions 1.6.0beta - 1.6.9beta also allowed adding layers via method <code>Text.analyse</code>, but this is no longer supported. If your code is using the deprecated <code>analyse</code> method, you can replace it with <code>tag_layer</code> in the following ways:
<ul>
    <li><code>text.analyse('segmentation')</code> is same as <code>text.tag_layer('paragraphs'); text.pop_layer('tokens');</code></li>
    <li><code>text.analyse('morphology')</code> is same as <code>text.tag_layer('morph_analysis'); text.pop_layer('tokens');</code></li>
    <li><code>text.analyse('syntax_preprocessing')</code> is same as <code>text.tag_layer(['sentences','morph_extended']); text.pop_layer('tokens');</code></li>
    <li><code>text.analyse('all')</code> is same as <code>text.tag_layer(['paragraphs','morph_extended']);</code></li>
<ul>
</div>
</p>

### Which layers can be created via `tag_layer`? ( `DEFAULT_RESOLVER` )

The method `tag_layer` knows how to create layers and how to resolve dependencies between layers thanks to a special component called `LayerResolver`.
You can access this component via `Text` object's attribute `layer_resolver`:

In [6]:
# NBVAL_IGNORE_OUTPUT
text.layer_resolver

layer,depends_on,tagger_name,description
tokens,[],TokensTagger,Preprocessing for word segmentation: segments text into tokens.
compound_tokens,[tokens],CompoundTokenTagger,Preprocessing for word segmentation: joins tokens into compound tokens.
words,"[tokens, compound_tokens]",WordTagger,Segments text into words.
sentences,"[compound_tokens, words]",SentenceTokenizer,Segments text into sentences.
paragraphs,[sentences],ParagraphTokenizer,Segments text into paragraphs.
morph_analysis,"[compound_tokens, words, sentences]",VabamorfTagger,Tags morphological analysis with Vabamorf.
clauses,"[words, sentences, morph_analysis]",ClauseSegmenter,Segments sentences into clauses. (requires Java)
morph_analysis_est,[morph_analysis],VabamorfEstCatConverter,Translates category names of Vabamorf's morphological analyses into Estonian (for educational purposes).
morph_extended,[morph_analysis],MorphExtendedTagger,Converts Vabamorf's morphological analyses to syntax preprocessing (CG3) format.
gt_morph_analysis,"[words, sentences, morph_analysis, clauses]",GTMorphConverter,Converts Vabamorf's morphological analyses to giellatekno's (GT) format.


By default, `LayerResolver`'s representation shows which layers can be created via `tag_layer`, what are their dependency  layers, and which are the taggers responsible for creating the layers. 

If you want to see attributes of layers, you can change the representation via calling `text.layer_resolver.layer_attributes`:

In [7]:
# NBVAL_IGNORE_OUTPUT
text.layer_resolver.layer_attributes

layer,attributes,tagger_name,description
tokens,(),TokensTagger,Preprocessing for word segmentation: segments text into tokens.
compound_tokens,"(type, normalized)",CompoundTokenTagger,Preprocessing for word segmentation: joins tokens into compound tokens.
words,"(normalized_form,)",WordTagger,Segments text into words.
sentences,(),SentenceTokenizer,Segments text into sentences.
paragraphs,(),ParagraphTokenizer,Segments text into paragraphs.
morph_analysis,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech)",VabamorfTagger,Tags morphological analysis with Vabamorf.
clauses,"(clause_type,)",ClauseSegmenter,Segments sentences into clauses. (requires Java)
morph_analysis_est,"(normaliseeritud_sõne, algvorm, lõpp, sõnaliik, vormi_nimetus, kliitik)",VabamorfEstCatConverter,Translates category names of Vabamorf's morphological analyses into Estonian (for educational purposes).
morph_extended,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat)",MorphExtendedTagger,Converts Vabamorf's morphological analyses to syntax preprocessing (CG3) format.
gt_morph_analysis,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech)",GTMorphConverter,Converts Vabamorf's morphological analyses to giellatekno's (GT) format.


Calling `text.layer_resolver.layer_dependencies` switches back to the representation listing dependencies of each layer.

`LayerResolver` available via `text.layer_resolver` corresponds to EstNLTK's basic NLP pipeline. 
This pipeline can also be imported separately as `DEFAULT_RESOLVER`:

```python
    from estnltk.default_resolver import DEFAULT_RESOLVER
```

`LayerResolver`'s method `get_tagger(layer)` returns the tagger responsible for creating `layer`.
This also allows to examine configuration of the tagger. 
Example:

In [8]:
 text.layer_resolver.get_tagger('morph_analysis')

name,output layer,output attributes,input layers
VabamorfTagger,morph_analysis,"('normalized_text', 'lemma', 'root', 'root_tokens', 'ending', 'clitic', 'form', 'partofspeech')","('words', 'sentences', 'compound_tokens')"

0,1
guess,True
propername,True
disambiguate,True
compound,True
phonetic,False
slang_lex,False
postanalysis_tagger,"PostMorphAnalysisTagger(('compound_tokens', 'words', 'morph_analysis')->morph_analysis)"
use_postanalysis,True
analysis_reorderer,"MorphAnalysisReorderer(('morph_analysis',)->morph_analysis)"
use_reorderer,True


`LayerResolver` has attribute `default_layers` which lists names of the layers that are created if `tag_layer` is called without input arguments:

In [9]:
 text.layer_resolver.default_layers

('morph_analysis', 'sentences')

This attribute can also be changed to different default values:

In [10]:
from estnltk import Text
text = Text('Ma ei hooli juveelidest. Mulle meeldivad lilled.')

# Change default layer to gt_morph_analysis
text.layer_resolver.default_layers = ['gt_morph_analysis']

# Tag gt_morph_analysis (and all its prerequisites)
text.tag_layer()

text
Ma ei hooli juveelidest. Mulle meeldivad lilled.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,9
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,9
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9
clauses,clause_type,,words,False,2
gt_morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",morph_analysis,,True,9


Taggers inside `LayerResolver` can also be updated, see below for details.

### Importing taggers from `estnltk.taggers`. Applying a tagger directly ( the method `tag` )

`DEFAULT_RESOLVER` does not include all the taggers available in EstNLTK.
There are several reasons why. 
Some taggers can be applied only in specific contexts, some taggers depend on specific resources (e.g. large models that need to be downloaded separately), and some address specific tasks in specific domains (e.g. detect dates from medical texts).
However, most EstNLTK's taggers (except taggers meant for internal usage) can be imported from `estnltk.taggers`, and then applied when needed:

```python
import estnltk.taggers
# List names of taggers that can be imported
dir( estnltk.taggers )
```

Example. Let's create a new text for analysis:

In [11]:
from estnltk import Text
text = Text('Ma ei hooli juveelidest. Mulle meeldivad lilled.')

This time, we want to analyse the text morphologically by applying the corresponding tagger directly on text. 
First, let's import the `VabamorfTagger`:

In [12]:
from estnltk.taggers import VabamorfTagger
# Create morphological tagger with default settings
morph_tagger = VabamorfTagger()
morph_tagger

name,output layer,output attributes,input layers
VabamorfTagger,morph_analysis,"('normalized_text', 'lemma', 'root', 'root_tokens', 'ending', 'clitic', 'form', 'partofspeech')","('words', 'sentences', 'compound_tokens')"

0,1
guess,True
propername,True
disambiguate,True
compound,True
phonetic,False
slang_lex,False
postanalysis_tagger,"PostMorphAnalysisTagger(('compound_tokens', 'words', 'morph_analysis')->morph_analysis)"
use_postanalysis,True
analysis_reorderer,"MorphAnalysisReorderer(('morph_analysis',)->morph_analysis)"
use_reorderer,True


Now, taggers themselves do not create their dependencies automatically: they will raise an expection if a dependency layer is missing.
So, before applying a tagger, you need to make sure that the input text has all the prerequisite layers ( _input layers_ ).

In our example, the input `Text` misses required layers `'words'`, `'sentences'`, `'compound_tokens'`. 
We can add these via `tag_layer`:

In [13]:
# Add prerequisite input layers
text.tag_layer(['words', 'sentences', 'compound_tokens'])

text
Ma ei hooli juveelidest. Mulle meeldivad lilled.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,9
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,9


Once we have added all depenencies,  we can use the method `tag`, which creates a new layer and adds it to the `Text` object:

In [14]:
# apply tagger on the text
morph_tagger.tag( text )

text
Ma ei hooli juveelidest. Mulle meeldivad lilled.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,2
tokens,,,,False,9
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,9
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9


Further details about EstNLTK's taggers:

🔗 Introduction to the basic NLP pipeline and morphological tagging: [nlp_pipeline/introduction_to_nlp_pipeline.ipynb](nlp_pipeline/introduction_to_nlp_pipeline.ipynb)

🔗 Detailed tutorials about the NLP components: [nlp_pipeline](nlp_pipeline)

🔗 Detailed tutorials about using system taggers and creating your own taggers: [taggers](taggers)

<p>
<div class="alert alert-block alert-warning"> 
<h4><i>Taggers and Retaggers</i></h4> 
<br>
Some of EstNTLK's taggers create new layers, while others rewrite existing layers (fix or update annotations).
A tagger inheriting from <b><code>Retagger</code></b> class rewrites an existing layer.
If you use a <b><code>Retagger</code></b>, make sure the target layer (<code>output_layer</code>) has already been created.
Then you can use the method <code>retag( text )</code> to rewrite the layer.
</div>
</p>

<p>
<div class="alert alert-block alert-warning"> 
<h4><i>The name of the output layer</i></h4> 
<br>
If you import a tagger via <code>estnltk.taggers</code>, then you can also change the name of the <code>output_layer</code> via constructor's parameter. For instance:
<pre>
from estnltk.taggers import VabamorfTagger
morph_tagger = VabamorfTagger(output_layer='my_morph_analysis')
</pre>
Changing layer names is useful if you need to compare different configurations or versions of a tagger.
For example, if we would name morph analysis layers according to versions of the tagger (such as <code>'morph_analysis_v1'</code>, <code>'morph_analysis_v2'</code>), then we could compare these to one another with <a href="taggers/system/diff_tagger.ipynb"><code>DiffTagger</code></a>.
</div>
</p>

### Changing taggers of the pipeline ( `make_resolver` )

The function `make_resolver` is responsible for creating the `DEFAULT_RESOLVER`. The easiest way of modifying the pipeline is by making a copy of the default pipeline with `make_resolver` and then updating its taggers according to your needs.

An example:

In [15]:
from estnltk.default_resolver import make_resolver

my_resolver = make_resolver()  # Make a copy of the default resolver

Now, we can use the method `update` to replace an existing tagger in the pipeline with a new one:


In [16]:
# Create a new morphological tagger that has disambiguation and guesser components switched off
from estnltk.taggers import VabamorfTagger
vabamorf_tagger = VabamorfTagger( disambiguate=False, guess=False, propername=False )

# Replace the existing tagger in the pipeline with a new one
my_resolver.update( vabamorf_tagger )

In order to apply the new pipeline, you need to specify which `resolver` is to be used when calling `tag_layer`:

In [17]:
from estnltk import Text
text = Text('Metsawahi hobusele om uus laut ehitet.')
text.tag_layer(['morph_analysis'], resolver=my_resolver)
text.morph_analysis

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore",words,,True,7

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech,_ignore
Metsawahi,,,,,,,,,False
hobusele,hobusele,hobune,hobune,['hobune'],le,,sg all,S,False
om,,,,,,,,,False
uus,uus,uus,uus,['uus'],0,,sg n,A,False
laut,laut,laut,laut,['laut'],0,,sg n,S,False
ehitet,,,,,,,,,False
.,,,,,,,,,False


In the previous example: because guessing of unknown words was switched off, words with old spelling ( _Metsawahi_ , _om_ , _ehitet_ ) obtained zero analysis (`None` attribute values) during the morphological analysis.
So by changing parameters of morphological tagging, we have successfully detected spelling variants that are unknown to contemporary Estonian.

<p>
<div class="alert alert-block alert-warning"> 
<h4><i><code>make_resolver</code> and the parameters of morphological analysis</i></h4> 
<br>
As morphological tagger is the central component of EstNLTK's linguistic analysis, it is also possible to directly change the parameters of morphological analysis via <code>make_resolver</code>. 
For details, see the tutorial: <a href="nlp_pipeline/introduction_to_nlp_pipeline.ipynb">introduction_to_nlp_pipeline.ipynb</a>.
</div>
</p>

<p>
<div class="alert alert-block alert-warning"> 
<h4><i>Warning: updating layer resolver can lead to conflicts in the pipeline</i></h4> 
<br>
<code>make_resolver</code> and <code>DEFAULT_RESOLVER</code> provide a working version of EstNLTK's pipeline. However, there are no guarantees that the pipeline remains fully functional if you change / update of some of its taggers. 
As a matter of fact, in the example above, switching off morphological disambiguation and guesser components <b>will tamper the quality</b> of the components that are dependent of morphological analysis, and some of the components will also became non-functional (the <code>my_resolver</code> cannot be used for named entity recognition, and for syntactic preprocessing and analysis). We recommend to change the pipeline only when you understand the risks and dependencies between taggers.
</div>
</p>

If you have accidentially updated the default pipeline (available via `Text.tag_layer`) in a way that some taggers have became non-functional, you can use `make_resolver` to reset the pipeline:

In [18]:
from estnltk.default_resolver import make_resolver
# Reset default resolver
Text.layer_resolver = make_resolver()

<p>
<div class="alert alert-block alert-warning"> 
<h4><i><code>make_resolver</code> and Python's multiprocessing</i></h4> 
<br>
If you want to use EstNLTK's morphological analysis with Python's multiprocessing, you should make a separate <code>LayerResolver</code> for each job, and pass to <code>tag_layer</code> via <code>resolver</code> argument inside a job.
Using a single resolver (<code>DEFAULT_RESOLVER</code>) for multiple jobs will result in error <code>('CFSException: internal error with vabamorf' ... )</code>.
</div>
</p>

### Removing a layer

The method `pop_layer` removes the layer from the `Text` object and returns it:

In [19]:
# Remove morph analysis layer
text.pop_layer('morph_analysis')

# Make sure the layer is no longer there
text.layers

{'compound_tokens', 'sentences', 'tokens', 'words'}

_NB!_ If a `Text` object has other layers depending on the removable layer, then these layers will also be removed.

<p>
<div class="alert alert-block alert-warning"> 
<h4><i>Remark about older versions</i></h4> 
<br>
In EstNTLK's versions 1.6.0b to 1.6.5b, the <code>del</code> operator was used for removing layers. For instance:
<pre>
>> del text.morph_analysis
</pre>
However, deleting layers with the <code>del</code> operator is no longer supported in newer versions.
</div>
</p>

## Accessing annotations of `Text`. Iterating and querying annotations

There are two equivalent ways to access layers:

* access via index operator:
`text['tokens']`
* access via attribute:
`text.tokens`

Both ways give a `Layer` object, which is basically a collection of annotations.

Example:

In [20]:
# Create a text with words, sentences and morph_analysis annotations
from estnltk import Text
text = Text('Ma ei hooli juveelidest. Mulle meeldivad lilled.')
text.tag_layer(['words', 'sentences', 'morph_analysis'])

# Ask for morph_analysis layer
text['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Ma,Ma,mina,mina,['mina'],0,,sg n,P
ei,ei,ei,ei,['ei'],0,,neg,V
hooli,hooli,hoolima,hooli,['hooli'],0,,o,V
juveelidest,juveelidest,juveel,juveel,['juveel'],dest,,pl el,S
.,.,.,.,['.'],,,,Z
Mulle,Mulle,mina,mina,['mina'],lle,,sg all,P
meeldivad,meeldivad,meeldima,meeldi,['meeldi'],vad,,vad,V
lilled,lilled,lill,lill,['lill'],d,,pl n,S
.,.,.,.,['.'],,,,Z


### Textual span of annotation ( `Span` and `EnvelopingSpan` )

A typical _annotation_ consists of a `Span`, which specifies the location of annotated text fragment, and `Annotation` objects, which specify the information contained in the annotation (attribute-value pairs). 

When accessing a single element of a layer, you will get a span:

In [21]:
# Ask for the first element of 'morph_analysis'
text['morph_analysis'][0]

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Ma,Ma,mina,mina,['mina'],0,,sg n,P


In case of an **_enveloping layer_** , the span is defined as a sequence of spans from some other layer, and it's called `EnvelopingSpan`.
For example, a sentence consists of the words inside the sentence:

In [22]:
# First sentence (a list of words)
text['sentences'][0]

text
Ma ei hooli juveelidest.


Each span has attributes `start`, `end` and `text`, which specify start/end indexes of the annotated text fragment, and the corresponding textual fragment.

Examples:

In [23]:
text['words'][0].start

0

In [24]:
text['words'][0].end

2

In [25]:
text['words'][0].text

'Ma'

In case of an _enveloping layer_ , the attribute `text` returns a list of strings instead of a single string. 
For instance, when asking for sentences  `text`, you will get a list of  `text` values from the words belonging to the sentence:

In [26]:
# "text" value of the first sentence
text['sentences'][0].text

['Ma', 'ei', 'hooli', 'juveelidest', '.']

If you need to get the raw string corresponding to an _enveloping span_ , you should use the attribute `enclosing_text` instead:

In [27]:
# Enclosing text of the first sentence
text['sentences'][0].enclosing_text

'Ma ei hooli juveelidest.'

In addition to the `'sentences'` layer, `'clauses'`, `'compound_tokens'` and `'paragraphs'` are also enveloping layers.

<p>
<div class="alert alert-block alert-warning"> 
<h4><i>Warning: <code>enclosing_text</code> and discontinuous spans</i></h4> 
<br>
The attribute <code>enclosing_text</code> gives a substring of initial text between indexes <code>start</code> and <code>end</code>.
This holds true even if we have an enveloping span that does not contain a continuous region of spans, but has some gaps in its span list.
This is the reason you should be careful when using <code>enclosing_text</code> with the layer <code>'clauses'</code>, because a clause can be made of discontinuous snippets of word spans, and <code>enclosing_text</code> can give a false impression about the extent of the clause.
</div>
</p>

<p>
<div class="alert alert-block alert-warning"> 
<h4><i>Comparing spans</i> (<code>estnltk_core.layer.span_operations</code>) </h4> 
<br>
EstNLTK has operators for systematic comparison of <code>Span</code> objects. For instance:
<ul> 
 <li> <code>conflict(span_x, span_y)</code> checks if <code>span_x</code> and <code>span_y</code> are nested (one of them is inside the other), or if there is an overlap between them from right or from left side;</li>
 <li> <code>nested(span_x, span_y)</code> checks if one of the spans is inside the other;</li>
 <li> <code>equal(span_x, span_y)</code> checks if the spans are totally overlapping / equal;</li>
</ul>
    
🔗 There are more comparing operators available, see the source for details: <a href="https://github.com/estnltk/estnltk/blob/main/estnltk_core/estnltk_core/layer/span_operations.py">https://github.com/estnltk/estnltk/blob/main/estnltk_core/estnltk_core/layer/span_operations.py</a>
    
NB! Please keep in mind that these functions only compare locations (spans) of annotations, ignoring their informational content (<code>Annotation</code> objects).
</div>

### Informational content of annotation ( `Annotation` )

The informational content of annotation -- e.g. lemma and part of speech information in morphological analysis -- resides in  `Annotation` object. 
`Span` and `EnvelopingSpan` objects have attribute `annotations`, which gives access to `Annotation` objects:

In [28]:
text['morph_analysis'][0].annotations

[Annotation('Ma', {'normalized_text': 'Ma', 'lemma': 'mina', 'root': 'mina', 'root_tokens': ['mina'], 'ending': '0', 'clitic': '', 'form': 'sg n', 'partofspeech': 'P'})]

`Annotation` object is similar to a _dict_ object : it contains **attributes** and their **values**, which can be accessed via  indexing:

In [29]:
# Get the first annotation
annotation = text['morph_analysis'][0].annotations[0]
# Get attribute 'lemma' from the annotation
annotation['lemma']

'mina'

However, if you need to access many attributes at once and/or you need to make queries over annotations, accessing via `annotations` can be cumbersome.
For this reason, EstNLTK also contains convenient shortcuts for accessing/querying annotations.

### Selecting multiple annotations: indexing operators

In similar to accessing elements of list, you can use **slice notation** to select a subset of a layer:

In [30]:
text['words'][5:8]

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,3

text,normalized_form
Mulle,
meeldivad,
lilled,


You can also **select specific spans** via indexing:

In [31]:
text['words'][[3,7]]

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,2

text,normalized_form
juveelidest,
lilled,


You can use a **`lambda` function** to select only spans that satisfy some specific criterion.

For instance, let's select words with length of 2:

In [32]:
text['words'][ lambda span: len(span.text) == 2 ]

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,2

text,normalized_form
Ma,
ei,


You can **combine selecting spans with selecting specific attributes** of annotations. In this case, the result of selection is no longer a layer, but an `AttributeList` (if one attribute is selected) or `AttributeTupleList` (in case of selecting multiple attributes). If the layer is ambiguous, the result is `AmbiguousAttributeList` / `AmbiguousAttributeTupleList`.

For instance, we can select only _part of speech_ values of morphological analysis:

In [33]:
# Select partofspeech of the first 4 words
text.morph_analysis[0:4, 'partofspeech']

Unnamed: 0,partofspeech
0,P
1,V
2,V
3,S


In [34]:
# Select lemma and partofspeech of the first 4 words
text.morph_analysis[0:4, ['lemma','partofspeech']]

Unnamed: 0,lemma,partofspeech
0,mina,P
1,ei,V
2,hoolima,V
3,juveel,S


In [35]:
# Select index attributes start, end, text along with lemma and partofspeech
text.morph_analysis[['start', 'end', 'text', 'lemma','partofspeech']]

Unnamed: 0,start,end,text,lemma,partofspeech
0,0,2,Ma,mina,P
1,3,5,ei,ei,V
2,6,11,hooli,hoolima,V
3,12,23,juveelidest,juveel,S
4,23,24,.,.,Z
5,25,30,Mulle,mina,P
6,31,40,meeldivad,meeldima,V
7,41,47,lilled,lill,S
8,47,48,.,.,Z


If you only need to access a single attribute, you can **combine indexing with the attribute access**:

In [36]:
# Select lemmas
text.morph_analysis[5:].lemma

Unnamed: 0,lemma
0,mina
1,meeldima
2,lill
3,.


Finally, if an annotation layer has a parent layer, you can also select its annotations via the parent layer.
An example -- selecting a `'morph_analysis'` attribute `'lemma'` via `'words'`:

In [37]:
# Select lemmas
text.words[5:].lemma

Unnamed: 0,lemma
0,mina
1,meeldima
2,lill
3,.


<p>
<div class="alert alert-block alert-warning"> 
<h4><i>Remark about ambiguity</i></h4> 
<br>
As names <code>AmbiguousAttributeTupleList</code> and <code>AmbiguousAttributeList</code> indicate, the selection was made from an ambiguous layer in previous examples.
When selecting an attribute from an ambiguous layer, please keep in mind that the result is not a single value, but a list of values.
For instance:
<pre>
>> text.words[5].lemma
['mina']
>> text.words[5].partofspeech
['P']
</pre>
</div>
</p>

### Iterating over multiple layers: an example

Frequently, one needs to select information from multiple layers in combination.
Let's consider an example of processing morphological analyses sentence by sentence.
Because the 'sentences' layer envelops the 'words' layer, and the 'words' layer is parent for 'morph_analysis' layer, we can iterate over sentences and access 'morph_analysis' attributes within the sentence:

In [38]:
for sentence in text.sentences:
    print('Sentence:', sentence.enclosing_text)
    for word in sentence:
        print( '  Lemma: ', word.morph_analysis.lemma[0], \
               '\t\tPOS:', word.morph_analysis.partofspeech[0] )
    print()

Sentence: Ma ei hooli juveelidest.
  Lemma:  mina 		POS: P
  Lemma:  ei 		POS: V
  Lemma:  hoolima 		POS: V
  Lemma:  juveel 		POS: S
  Lemma:  . 		POS: Z

Sentence: Mulle meeldivad lilled.
  Lemma:  mina 		POS: P
  Lemma:  meeldima 		POS: V
  Lemma:  lill 		POS: S
  Lemma:  . 		POS: Z



### Grouping annotations ( `Layer.groupby` )

EstNLTK's `Layer` has method `groupby`, which groups annotations **by attributes or by enveloping layers**.

For instance, we can group 'morph_analysis' by 'partofspeech' attributes:

In [39]:
groups = text.morph_analysis.groupby('partofspeech', return_type='spans')

Then we can use `count` to get frequencies of 'partofspeech':

In [40]:
groups.count

{('P',): 2, ('V',): 3, ('S',): 2, ('Z',): 2}

From the group of annotations, we can get a subselection of annotations that have specific attribute value (or values).
For instance, let us fetch all verbs:

In [41]:
groups.groups[('V',)]

[Span('ei', [{'normalized_text': 'ei', 'lemma': 'ei', 'root': 'ei', 'root_tokens': ['ei'], 'ending': '0', 'clitic': '', 'form': 'neg', 'partofspeech': 'V'}]),
 Span('hooli', [{'normalized_text': 'hooli', 'lemma': 'hoolima', 'root': 'hooli', 'root_tokens': ['hooli'], 'ending': '0', 'clitic': '', 'form': 'o', 'partofspeech': 'V'}]),
 Span('meeldivad', [{'normalized_text': 'meeldivad', 'lemma': 'meeldima', 'root': 'meeldi', 'root_tokens': ['meeldi'], 'ending': 'vad', 'clitic': '', 'form': 'vad', 'partofspeech': 'V'}])]

We can also group by multiple attributes, e.g. group by 'partofspeech' and 'form' attributes:

In [42]:
groups = text.morph_analysis.groupby(['partofspeech', 'form'], return_type='spans')
groups.count

{('P', 'sg n'): 1,
 ('V', 'neg'): 1,
 ('V', 'o'): 1,
 ('S', 'pl el'): 1,
 ('Z', ''): 2,
 ('P', 'sg all'): 1,
 ('V', 'vad'): 1,
 ('S', 'pl n'): 1}

And we can group by _enveloping layers_ .
For example, let's group morphological analyses by sentences and then output text and partofspeech of every word:

In [43]:
for sentence_id, sentence_spanlist in text.morph_analysis.groupby( text.sentences ):
    print([span.text for span in sentence_spanlist])
    for morph_span in sentence_spanlist:
        print('   ',morph_span.text,'\t',morph_span.partofspeech)

['Ma', 'ei', 'hooli', 'juveelidest', '.']
    Ma 	 ['P']
    ei 	 ['V']
    hooli 	 ['V']
    juveelidest 	 ['S']
    . 	 ['Z']
['Mulle', 'meeldivad', 'lilled', '.']
    Mulle 	 ['P']
    meeldivad 	 ['V']
    lilled 	 ['S']
    . 	 ['Z']


🔗 For more detailed information about `groupby` can be found from the tutorial [layer_operations.ipynb](system/layer_operations.ipynb)

### Sliding window over annotations ( `Layer.rolling` )

EstNLTK's method `Layer.rolling` allows to process layer's annotations through **a sliding window**. It can be used for making _n_-grams from the annotations.

For instance, we can make trigrams out of lemmas from the morphological analysis layer:

In [44]:
for spans in text.morph_analysis.rolling( window=3 ):
    print(spans.text, '=>', spans[0].lemma, spans[1].lemma, spans[2].lemma)

['Ma', 'ei', 'hooli'] => ['mina'] ['ei'] ['hoolima']
['ei', 'hooli', 'juveelidest'] => ['ei'] ['hoolima'] ['juveel']
['hooli', 'juveelidest', '.'] => ['hoolima'] ['juveel'] ['.']
['juveelidest', '.', 'Mulle'] => ['juveel'] ['.'] ['mina']
['.', 'Mulle', 'meeldivad'] => ['.'] ['mina'] ['meeldima']
['Mulle', 'meeldivad', 'lilled'] => ['mina'] ['meeldima'] ['lill']
['meeldivad', 'lilled', '.'] => ['meeldima'] ['lill'] ['.']


In addition to specifying the size of the window, you can also specify the (enveloping) layer to constrain the process, and the minimal length of the _n_-gram (which applies in the border situations, e.g. at the beginning or the end of the text).

🔗 For more detailed information about `rolling`, see the tutorial: [layer_operations.ipynb](system/layer_operations.ipynb)

## Dividing `Text` object into smaller `Text` objects

### Making extracts from `Text` ( `extract_sections` )

The function `extract_sections` can be used to extract sections from a `Text` object. For example:

In [45]:
from estnltk_core.layer_operations import extract_sections

sections = extract_sections(text, sections=[(12, 24), (25,40)])
sections

[Text(text='juveelidest.'), Text(text='Mulle meeldivad')]

Resulting sections are also `Text` objects and all layers are preserved by default:

In [46]:
sections[0]

text
juveelidest.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,0
tokens,,,,False,2
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,2
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2


However, the default setting only preserves annotations that fit completely inside the extracted sections.
In the previous example, none of the sentence annotations were preserved because they did not fit into the section.

If `trim_overlapping=True` is set, then `extract_sections` tries to preserve annotations by trimming them shorter if they do not fit into section that overlaps them. 
In the previous example, `trim_overlapping=True` would have preserved the first sentence annotation in `sections[0]`, but would have caused it to be trimmed into a 2-word sentence ( _juveelidest._ ).


🔗 More detailed tutorial about `extract_sections` can be found at [layer_operations.ipynb](system/layer_operations.ipynb)

### Splitting `Text` ( `split_by` )

The function `split_by` can be used to **split `Text` object by a specific layer.**
For instance, we can split our text into smaller `Text` objects so that each `Text` object corresponds to a sentence:

In [47]:
from estnltk_core.layer_operations import split_by

sentence_texts = split_by(text, 'sentences')
sentence_texts

[Text(text='Ma ei hooli juveelidest.'), Text(text='Mulle meeldivad lilled.')]

In [48]:
sentence_texts[0]

text
Ma ei hooli juveelidest.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5


While `extract_sections` preserves all layers, `split_by` keeps only the layer that was used for splitting, and its (indirect or direct) dependent layers.
In the previous example, `'sentences'` layer was kept because it was used for splitting, `'words'` layer was kept because `'sentences'` envelops `'words'`, and `'morph_analysis'` was kept because `'words'`  is its parent. However, layers `'tokens'` and `'compound_tokens'` were removed, as they do not depend directly nor indirectly from the aforementioned 3 layers.
Still, there is also possible to preserve all layers while splitting, see the documentation for details.

🔗 More detailed tutorial about `split_by` can be found at [layer_operations.ipynb](system/layer_operations.ipynb)

🔗 Splitting can also be reversed (to an extent): you can join multiple `Text` objects back into a single `Text` object. Details at [layer_operations.ipynb](system/layer_operations.ipynb)

## Removing layer's dependencies ( `flatten` )

The function `flatten` turns a layer into **a simple layer** (a layer that is not enveloping nor a child of some other layer).
This is useful when you are only interested in specific layers and you want to reduce the size of the `Text` object -- you can flatten the layers of interest, and then delete other layers.

For instance, let's flatten the sentences layer:

In [49]:
from estnltk_core.layer_operations import flatten

# Make text and add sentences layer
text = Text('Tere, maailm! Kuidas läheb?').tag_layer(['sentences'])

# create flat sentences layer
flat_sentences = flatten(text['sentences'], 'flat_sentences')

# add new layer to the Text
text.add_layer( flat_sentences )

# examine new layer
text.flat_sentences

layer name,attributes,parent,enveloping,ambiguous,span count
flat_sentences,,,,True,2

text
"Tere, maailm!"
Kuidas läheb?


While the original sentences layer was enveloping around the words layer, the new layer no longer has that dependency.

If we now remove the words layer, the original sentences layer will also be deleted. 
But the flat sentences layer will remain:

In [50]:
text.pop_layer( 'words' )
text

text
"Tere, maailm! Kuidas läheb?"

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,7
compound_tokens,"type, normalized",,tokens,False,0
flat_sentences,,,,True,2


🔗 For more detailed information about `flatten`, see the tutorial [layer_operations.ipynb](system/layer_operations.ipynb)