# Layer operations
## Extractions

Create a text object.

In [1]:
from estnltk import Text

text = Text('Tere, maailm!').analyse('morphology')
text

text
"Tere, maailm!"

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,False,4
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4


### `extract_sections`
Extract the first 9 and last 7 characters from the text.

In [2]:
from estnltk.layer_operations import extract_sections

texts = extract_sections(text, [(0, 9), (6,13)])
texts

[Text(text="Tere, maa"), Text(text="maailm!")]

This is equivalent of writing
```python
extract_sections(text=text,
                 sections=[(0, 9), (6,13)],
                 layers_to_keep=None,
                 trim_overlapping=False)
```
where<br>
**text** is a Text object<br>
**sections** is a list of tuples. Each tuple is a pair `(start, end)` where the `start` is the first character of the extraction and the `end` is the index of the first character after the extraction in the text<br>
**layers_to_keep** is a list of the layer names to be kept. 
        The dependences must also be included, that is, if a layer in the list
        has a parent or is enveloping, then the parent or enveloped layer
        must also be in this list. If `None` (the default), all layers are kept.<br>
**trim_overlapping** If `False` (the default), overlapping spans are not kept in the extracted text.
If `True`, overlapping spans are trimmed to fit the boundaries.

Returns a list of text objects that corresponds to the list of sections.

Lets take a look at the first of the two texts extracted.

In [3]:
texts[0]

text
"Tere, maa"

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,False,2
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2


Here the span count 2 means that 'Tere' and ',' are tagged as words, but the letters 'maa' are not covered by any spans since it is a part of a longer word 'maailm'.

In the next example the span of 'maailm' is trimmed to cover the letters 'maa'. That gives a strange result where the analysis of 'maailm' is attached to the partial word 'maa'. So, use the trimming option with caution.

In [4]:
extract_sections(text, [(0, 9)], ('words', 'morph_analysis'), True)[0]['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Tere,tere,tere,"(tere,)",0.0,,,I
",",",",",","(,,)",,,,Z
maa,maailm,maa_ilm,"(maa, ilm)",0.0,,sg n,S


A more practical use case of
```python
trim_overlapping=True
```
would be trimming a span of a paragraph while leaving out the last sentence of a text.

`extract_sections` does not yet use binary search of spans and is therefore not efficient on long texts.

`extract_sections` does not yet support extracting layers of the following types:
* not ambiguous with parent;
* ambiguous enveloping;
* ambiguous (not enveloping, no parent).

### `extract_section`
To extract only one section from a text, the `extract_section` function can be used.

In [5]:
from estnltk.layer_operations import extract_section
extract_section(text, 0, 9)

text
"Tere, maa"

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,False,2
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2


This is equivalent of writing
```python
extract_section(text=text,
                start=0,
                end=9,
                layers_to_keep=None,
                trim_overlapping=False)
```
where `layers_to_keep` and `trim_overlapping` parameters are the same as of `extract_sections` function.

## Splitting
Now let's create a text with three sentences.

In [6]:
t = '''Esimene lause.

Teine lõik. Kolmas lause.'''

from estnltk import Text
text = Text(t)
text.analyse('all')
text

text
Esimene lause. Teine lõik. Kolmas lause.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,2
sentences,,,words,False,3
tokens,,,,False,9
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,9
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9
morph_extended,"lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,9


### `split_by`
Using the `split_by` function, the text object can be split into pieces by spans of any layer. Here, for instance, we split the text by words.

In [7]:
from estnltk.layer_operations import split_by
texts = split_by(text, 'words')
texts

[Text(text="Esimene"),
 Text(text="lause"),
 Text(text="."),
 Text(text="Teine"),
 Text(text="lõik"),
 Text(text="."),
 Text(text="Kolmas"),
 Text(text="lause"),
 Text(text=".")]

This is equivalent of writing
```python
split_by(text=text,
         layer='words',
         layers_to_keep=None,
         trim_overlapping=False)
```
If **`layes_to_keep`** is `None`, then the list of layers that are kept is the minimal list with the poperties:
* `layer` is in the list;
* if L is in the list and L is enveloping M, then M is in the list;
* if L is in the list and parent of L is M, then M is in the list;
* if L is in the list and parent of M is L, then M is in the list.

If
```python
layers_to_keep = None,
```
then `trim_overlapping` has no practical effect.

Print out the first word extracted.

In [8]:
texts[0]

text
Esimene

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,False,1
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,1
morph_extended,"lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,1


### `split_by_sentences`
Using the `split_by_sentences` function, we can turn the text object into a list of text objects, each containig one sentence of the original text.

In [9]:
from estnltk.layer_operations import split_by_sentences
texts = split_by_sentences(text)
texts

[Text(text="Esimene lause."),
 Text(text="Teine lõik."),
 Text(text="Kolmas lause.")]

This is equivalent of writing
```python
split_by_sentences(text=text,
                   layers_to_keep=None,
                   trim_overlapping=False)
```
where `layers_to_keep` and `trim_overlapping` parameters are the same as of `split_by` function.
Here is the second sentence.

In [10]:
texts[1]

text
Teine lõik.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,normalized_form,,,False,3
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3
morph_extended,"lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,3


In the next example we keep all the layers of the text object, trim the overlaping spans, and print out the second sentence extracted.

In [11]:
texts = split_by_sentences(text, layers_to_keep=list(text.layers), trim_overlapping=True)
texts[1]

text
Teine lõik.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,1
tokens,,,,False,3
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,3
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3
morph_extended,"lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,3


## Rebase

The parent of the `morph_extended` layer is `morph_analysis`. So, if one deletes `morph_analysis` layer, then `morph_extended` layer is also deleted. To avoid this, the `parent` attribute of `morph_extended` can be changed to `words` using the `rebase` function.

This can be done because, the `_base` attribute of both layers is the same:

In [12]:
text['morph_analysis']._base, text['morph_extended']._base

('words', 'words')

(In the future, it might be a good idea to replace the `parent` attribute with the `_base` attribute.)

In [13]:
from estnltk.layer_operations import rebase
rebase(text, 'morph_extended', 'words')

text
Esimene lause. Teine lõik. Kolmas lause.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,2
sentences,,,words,False,3
tokens,,,,False,9
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,9
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9
morph_extended,"lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",words,,True,9


In [14]:
del text.morph_analysis
text

text
Esimene lause. Teine lõik. Kolmas lause.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,2
sentences,,,words,False,3
tokens,,,,False,9
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,9
morph_extended,"lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",words,,True,9
