# Layer operations
## Extractions

Create a text object.

In [1]:
from estnltk import Text

text = Text('Tere, maailm!').analyse('morphology')
text

text
"Tere, maailm!"

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4


### `extract_sections`
Extract the first 9 and last 7 characters from the text.

In [2]:
from estnltk.layer_operations import extract_sections

texts = extract_sections(text=text,
                         sections=[(0, 9), (6,13)],
                         layers_to_keep=None,  # default: None
                         trim_overlapping=False  # default: False
                         )
texts

[Text(text='Tere, maa'), Text(text='maailm!')]

where<br>
**text** is a Text object<br>
**sections** is a list of tuples. Each tuple is a pair `(start, end)` where the `start` is the first character of the extraction and the `end` is the index of the first character after the extraction in the text<br>
**layers_to_keep** is a list of the layer names to be kept. 
        The dependences must also be included, that is, if a layer in the list
        has a parent or is enveloping, then the parent or enveloped layer
        must also be in this list. If `None` (the default), all layers are kept.<br>
**trim_overlapping** If `False` (the default), overlapping spans are not kept in the extracted text.
If `True`, overlapping spans are trimmed to fit the boundaries.

Returns a list of text objects that corresponds to the list of sections.

Lets take a look at the first of the two texts extracted.

In [3]:
texts[0]

text
"Tere, maa"

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,0
words,normalized_form,,,True,2
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2


Here the span count 2 means that 'Tere' and ',' are tagged as words, but the letters 'maa' are not covered by any spans since it is a part of a longer word 'maailm'.

In the next example the span of 'maailm' is trimmed to cover the letters 'maa'. That gives a strange result where the analysis of 'maailm' is attached to the partial word 'maa'. So, use the trimming option with caution.

In [4]:
extract_sections(text, [(0, 9)], ('words', 'morph_analysis'), True)[0]['morph_analysis']

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Tere,Tere,tere,tere,['tere'],0.0,,,I
",",",",",",",","[',']",,,,Z
maa,maailm,maailm,maa_ilm,"['maa', 'ilm']",0.0,,sg n,S


A more practical use case of
```python
trim_overlapping=True
```
would be trimming a span of a paragraph while leaving out the last sentence of a text.

`extract_sections` does not yet use binary search of spans and is therefore not efficient on long texts.

`extract_sections` does not yet support extracting layers of the following types:
* not ambiguous with parent;
* ambiguous enveloping;
* ambiguous (not enveloping, no parent).

### `extract_section`
To extract only one section from a text, the `extract_section` function can be used.

In [5]:
from estnltk.layer_operations import extract_section
extract_section(text=text,
                start=0,
                end=9,
                layers_to_keep=None,  # defaut: None
                trim_overlapping=False  # default: False
                )

text
"Tere, maa"

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,0
words,normalized_form,,,True,2
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2


Here the parameters `layers_to_keep` and `trim_overlapping` are the same as of `extract_sections` function.

## Splitting
Now let's create a text with three sentences.

In [6]:
t = '''Esimene lause.

Teine lõik. Kolmas lause.'''

text = Text(t)
text.analyse('all')
text

text
Esimene lause.Teine lõik. Kolmas lause.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,2
sentences,,,words,False,3
tokens,,,,False,9
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,9
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,9
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,9


### `split_by`
Using the `split_by` function, the text object can be split into pieces by spans of any layer. Here, for instance, we split the text by words.

In [7]:
from estnltk.layer_operations import split_by

texts = split_by(text, 'words')
texts

[Text(text='Esimene'),
 Text(text='lause'),
 Text(text='.'),
 Text(text='Teine'),
 Text(text='lõik'),
 Text(text='.'),
 Text(text='Kolmas'),
 Text(text='lause'),
 Text(text='.')]

This is equivalent of writing
```python
split_by(text=text,
         layer='words',
         layers_to_keep=None,
         trim_overlapping=False)
```
If **`layes_to_keep`** is `None`, then the list of layers that are kept is the minimal list with the poperties:
* `layer` is in the list;
* if L is in the list and L is enveloping M, then M is in the list;
* if L is in the list and parent of L is M, then M is in the list;
* if L is in the list and parent of M is L, then M is in the list.

If
```python
layers_to_keep = None,
```
then `trim_overlapping` has no practical effect.

Print out the first word extracted.

In [8]:
texts[0]

text
Esimene

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,1
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,1
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,1


### `split_by_sentences`
Using the `split_by_sentences` function, we can turn the text object into a list of text objects, each containig one sentence of the original text.

In [9]:
from estnltk.layer_operations import split_by_sentences

texts = split_by_sentences(text=text,
                           layers_to_keep=None,  # default: None
                           trim_overlapping=False  # default: False
                           )
texts

[Text(text='Esimene lause.'),
 Text(text='Teine lõik.'),
 Text(text='Kolmas lause.')]

Here `layers_to_keep` and `trim_overlapping` parameters are the same as of `split_by` function.
Here is the second sentence.

In [10]:
texts[1]

text
Teine lõik.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
words,normalized_form,,,True,3
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,3


In the next example we keep all the layers of the text object, trim the overlaping spans, and print out the second sentence extracted.

In [11]:
texts = split_by_sentences(text=text, 
                           layers_to_keep=list(text.layers),
                           trim_overlapping=True
                           )
texts[1]

text
Teine lõik.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,1
tokens,,,,False,3
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,3
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,3


### `split_by_clauses`

Splitting text object into clauses requires a special splitting logic, because some of the clauses may be embedded inside other clauses.
This logic is provided by the `split_by_clauses` function, which splits the text object into a list of text objects, each containing exactly one clause of the original text.

In [12]:
from estnltk.layer_operations import split_by_clauses

# Create a text with clause annotations
text = Text('Mees, keda seal kohtasime, oli tuttav ja teretas meid.').tag_layer('clauses')

In [13]:
# Split text into clauses
texts = split_by_clauses(text=text,
                         layers_to_keep=list(text.layers),
                         trim_overlapping=True )

In [14]:
# Display results
for clause_text in texts:
    display(clause_text)

text
Mees oli tuttav ja

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,4
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4
clauses,clause_type,,words,False,1


text
", keda seal kohtasime,"

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,5
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,5
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5
clauses,clause_type,,words,False,1


text
teretas meid.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,3
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,3
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3
clauses,clause_type,,words,False,1


Parameters `layers_to_keep` and `trim_overlapping` are the same as of `split_by` function.
In addition, parameter `input_clauses_layer` can be used to specify name of the clauses layer, e.g. 

```python
# split by the layer 'my_clauses'
texts = split_by_clauses(text=text, input_clauses_layer='my_clauses')
```

## Rebase

In order to exemplify rebasing, let's consider the following example text:

In [15]:
text = Text('''Päike paistab. Lõokene lõõritab.''')
text.analyse('all')

text
Päike paistab. Lõokene lõõritab.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,2
tokens,,,,False,6
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,6
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,6
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",morph_analysis,,True,6


The parent of the `morph_extended` layer is `morph_analysis`. So, if one deletes `morph_analysis` layer, then `morph_extended` layer is also deleted. To avoid this, the `parent` attribute of `morph_extended` can be changed to `words` using the `rebase` function.

This can be done because, the `_base` attribute of both layers is the same:

In [16]:
from estnltk.layer_operations import rebase

rebase(text, 'morph_extended', 'words')

text
Päike paistab. Lõokene lõõritab.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,2
tokens,,,,False,6
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,6
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,6
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",words,,True,6


In [17]:
text.pop_layer('morph_analysis') # remove morph_analysis from text
text

text
Päike paistab. Lõokene lõõritab.

layer name,attributes,parent,enveloping,ambiguous,span count
paragraphs,,,sentences,False,1
sentences,,,words,False,2
tokens,,,,False,6
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,True,6
morph_extended,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, punctuation_type, pronoun_type, letter_case, fin, verb_extension_suffix, subcat",words,,True,6


## Flatten

Flatten operation turns any layer to a simple layer (parent==None, enveloping==None, ambiguous==False). In the following we flatten the sentences layer.

In [18]:
from estnltk.layer_operations import flatten

text = Text('Päike paistab. Lõokene lõõritab. Vana karu lööb trummi.').tag_layer(['sentences'])
text.pop_layer('tokens')  # remove tokens from text
text.add_layer(flatten(text['sentences'], 'flat_sentences'))
text

text
Päike paistab. Lõokene lõõritab. Vana karu lööb trummi.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,3
words,normalized_form,,,True,11
flat_sentences,,,,True,3


In [19]:
text.flat_sentences

layer name,attributes,parent,enveloping,ambiguous,span count
flat_sentences,,,,True,3

text
Päike paistab.
Lõokene lõõritab.
Vana karu lööb trummi.


## Filter annotations
Example layer

In [20]:
from estnltk.tests import new_text

new_text(3).layer_1

{}


layer name,attributes,parent,enveloping,ambiguous,span count
layer_1,"attr, attr_1",,,True,4

text,attr,attr_1
Tere,L1-0,A
,L1-0,B
",",L1-1,C
maailm,L1-2,D
,L1-2,E
!,L1-3,F


### `apply_filter`

The `function` parameter is a callable that takes three parameters: Layer, span index and annotation index.
`preserve_spans=True` forces to keep at least one annotation for every span.

In [21]:
from estnltk.layer_operations import apply_filter

def filter_function(layer, i, j):
    return layer[i].annotations[j].attr_1 in {'B', 'C', 'E', 'F'} 
    
text = new_text(3)

apply_filter(layer=text.layer_1,
             function=filter_function,
             preserve_spans=False,  # default: False
             drop_immediately=False  # default: False
            )

text.layer_1

{}


layer name,attributes,parent,enveloping,ambiguous,span count
layer_1,"attr, attr_1",,,True,4

text,attr,attr_1
Tere,L1-0,B
",",L1-1,C
maailm,L1-2,E
!,L1-3,F


### `keep_annotations`

Use `keep_annotations` function to keep only those annotations in the layer that have certain attribute values. Here the annotations with `attr_1` value equal to `B`, `C`, `E` or `F` are kept. All other annotations are dropped.

In [22]:
from estnltk.layer_operations import keep_annotations

text = new_text(3)

keep_annotations(layer=text.layer_1,
                 attribute='attr_1',
                 values={'B', 'C', 'E', 'F'},
                 preserve_spans=False  # default: False
                 )
text.layer_1

{}


layer name,attributes,parent,enveloping,ambiguous,span count
layer_1,"attr, attr_1",,,True,4

text,attr,attr_1
Tere,L1-0,B
",",L1-1,C
maailm,L1-2,E
!,L1-3,F


In the next example `preserve_spans=True` forces to keep one annotation for every span despite the fact that the set of values is empty.

In [23]:
def function(annotation):
    return len(annotation.span) == 1

text = new_text(3)

keep_annotations(layer=text.layer_1,
                 attribute='attr_1',
                 values={},
                 preserve_spans=True
                 )
text.layer_1

{}


layer name,attributes,parent,enveloping,ambiguous,span count
layer_1,"attr, attr_1",,,True,4

text,attr,attr_1
Tere,L1-0,A
",",L1-1,C
maailm,L1-2,D
!,L1-3,F


### `drop_annotations`

Use `drop_annotations` function to drop only those annotations from the layer that have certain attribute values. The function parameters are similar to the parameters of the `keep_annotations` function.

Here the annotations whith `attr_1` value equal to `A` or `D` are dropped. All other annotations are kept.

In [24]:
from estnltk.layer_operations import drop_annotations

text = new_text(3)

drop_annotations(layer=text.layer_1,
                 attribute='attr_1',
                 values={'A', 'D'},
                 preserve_spans=False  # default: False
                 )
text.layer_1

{}


layer name,attributes,parent,enveloping,ambiguous,span count
layer_1,"attr, attr_1",,,True,4

text,attr,attr_1
Tere,L1-0,B
",",L1-1,C
maailm,L1-2,E
!,L1-3,F


## `Layer.groupby`
Creates a generator object that groups spans or annotations by the attribute values. The attribute values must be hashable.

Spans can also be groubed by an enveloping layer.

In [25]:
text = new_text(5)
text.layer_1

{}


layer name,attributes,parent,enveloping,ambiguous,span count
layer_1,"attr, attr_1",,,True,19

text,attr,attr_1
Sada,L1-0,SADA
kaks,L1-1,KAKS
kakskümmend,L1-2,KAKS
,L1-2,KÜMME
,L1-2,KAKSKÜMMEND
kümme,L1-3,KÜMME
kolm,L1-4,KOLM
Neli,L1-5,NELI
tuhat,L1-6,TUHAT
viis,L1-7,VIIS


### Group spans by attribute value

In [26]:
groups = text.layer_1.groupby(['attr_1'], return_type='spans')
groups

GroupBy(layer:'layer_1', by=['attr_1'], return_type='spans')

Here<br/>
**layer** is a `Layer` object<br/>
**by** is a sequence of attribute names<br/>
**return_type** is either `'spans'` or `'annotations'`

The `Layer.groupby` method provides a shortcut for this:

In [27]:
for key, spans in groups:
    print(key)
    for span in spans:
        display(span)
    break

('KAHEKSA',)


text,attr,attr_1
kaheksa,L1-15,KAHEKSA


`GroupBy.count` returns number of spans/annotations in every group as a `dict`.

In [28]:
groups.count

{('SADA',): 3,
 ('KAKS',): 2,
 ('KAKSKÜMMEND',): 1,
 ('KÜMME',): 6,
 ('KOLM',): 1,
 ('NELI',): 1,
 ('TUHAT',): 1,
 ('VIIS',): 2,
 ('VIISSADA',): 1,
 ('KUUS',): 2,
 ('KUUSKÜMMEND',): 1,
 ('SEITSE',): 1,
 ('KOMA',): 1,
 ('KAHEKSA',): 1,
 ('ÜHEKSA',): 2,
 ('ÜHEKSAKÜMMEND',): 1}

`GroupBy.groups` returns `dict` of all groups.

In [29]:
groups.groups[('KAHEKSA',)]

[Span('kaheksa', [{'attr': 'L1-15', 'attr_1': 'KAHEKSA'}])]

In [30]:
groups.groups[('KAHEKSA',)][0]

text,attr,attr_1
kaheksa,L1-15,KAHEKSA


In [31]:
import pandas as pd

def func(spans):
    return [[a.attr for span in spans for a in span.annotations]]

def combiner(d):
    return pd.DataFrame.from_dict(d, orient='index', columns=['attr']).sort_index()

groups.aggregate(func=func,
                 combiner=combiner  # default: lambda d: d
                 )

Unnamed: 0,attr
"(KAHEKSA,)",[L1-15]
"(KAKS,)","[L1-1, L1-2, L1-2, L1-2]"
"(KAKSKÜMMEND,)","[L1-2, L1-2, L1-2]"
"(KOLM,)",[L1-4]
"(KOMA,)",[L1-14]
"(KUUS,)","[L1-10, L1-11, L1-11, L1-11]"
"(KUUSKÜMMEND,)","[L1-11, L1-11, L1-11]"
"(KÜMME,)","[L1-2, L1-2, L1-2, L1-3, L1-11, L1-11, L1-11, L1-12, L1-17, L1-17, L1-17, L1-18]"
"(NELI,)",[L1-5]
"(SADA,)","[L1-0, L1-8, L1-8, L1-8, L1-9]"


### Group by an enveloping layer.

In [32]:
text = new_text(5)
for spanlist in text.layer_0.groupby(text.layer_4):
    print(spanlist.text)
    for span in spanlist:
        print(span.attr, '\t',span.attr_0)

{}
['Sada', 'kakskümmend', 'kolm']
L0-0 	 100
L0-2 	 20
L0-4 	 3
[' Neli', 'tuhat', 'viissada', 'kuuskümmend', 'seitse']
L0-5 	 4
L0-6 	 1000
L0-8 	 500
L0-11 	 60
L0-13 	 7
['koma']
L0-14 	 ,
['kaheksa']
L0-15 	 8
['Üheksakümmend', 'tuhat']
L0-17 	 90
L0-19 	 1000


## `Layer.rolling`

Yields span lists from window rolling over a layer.

In [33]:
text = new_text(5)
text.layer_1

{}


layer name,attributes,parent,enveloping,ambiguous,span count
layer_1,"attr, attr_1",,,True,19

text,attr,attr_1
Sada,L1-0,SADA
kaks,L1-1,KAKS
kakskümmend,L1-2,KAKS
,L1-2,KÜMME
,L1-2,KAKSKÜMMEND
kümme,L1-3,KÜMME
kolm,L1-4,KOLM
Neli,L1-5,NELI
tuhat,L1-6,TUHAT
viis,L1-7,VIIS


In [34]:
for spans in new_text(5).layer_0.rolling(window=3,
                                         min_periods=None,  # default None means that min_periods=window
                                         inside=None):
    print(spans.text)

{}
['Sada', 'kaks', 'kakskümmend']
['kaks', 'kakskümmend', 'kümme']
['kakskümmend', 'kümme', 'kolm']
['kümme', 'kolm', ' Neli']
['kolm', ' Neli', 'tuhat']
[' Neli', 'tuhat', 'viis']
['tuhat', 'viis', 'viissada']
['viis', 'viissada', 'sada']
['viissada', 'sada', 'kuus']
['sada', 'kuus', 'kuuskümmend']
['kuus', 'kuuskümmend', 'kümme']
['kuuskümmend', 'kümme', 'seitse']
['kümme', 'seitse', 'koma']
['seitse', 'koma', 'kaheksa']
['koma', 'kaheksa', 'Üheksa']
['kaheksa', 'Üheksa', 'Üheksakümmend']
['Üheksa', 'Üheksakümmend', 'kümme']
['Üheksakümmend', 'kümme', 'tuhat']


In [35]:
text.layer_4

layer name,attributes,parent,enveloping,ambiguous,span count
layer_4,"attr, attr_4",,layer_0,False,5

text,attr,attr_4
"['Sada', 'kakskümmend', 'kolm']",L4-0,123
"[' Neli', 'tuhat', 'viissada', 'kuuskümmend', 'seitse']",L4-1,4567
['koma'],L4-3,","
['kaheksa'],L4-2,8
"['Üheksakümmend', 'tuhat']",L4-4,90 000


In [36]:
for spans in new_text(5).layer_0.rolling(window=3,
                                         min_periods=2,
                                         inside='layer_4'):
    print(spans.text)

{}
['Sada', 'kakskümmend']
['Sada', 'kakskümmend', 'kolm']
['kakskümmend', 'kolm']
[' Neli', 'tuhat']
[' Neli', 'tuhat', 'viissada']
['tuhat', 'viissada', 'kuuskümmend']
['viissada', 'kuuskümmend', 'seitse']
['kuuskümmend', 'seitse']
['Üheksakümmend', 'tuhat']
['Üheksakümmend', 'tuhat']
