# Basic concepts in EstNLTK

## `Text` object

Text class is a central concept that encapsulates raw text together with various annotations. 
These annotations fall into two categories: metadata about the entire text and annotations for specific text fragments.

### Interface

* **Magic methods**

  ```
  __init__(self, text)
  __setattr__(self, name, value)
  __getattr__(self, item)
  __setitem__(self, key, value)
  __getitem__(self, item)
  __delattr__(self, item)
  __eq__(self, other)
  __str__(self)
  __repr__(self)
  _repr_html_(self)
  
  ```
  
* **Public methods**

  ```
  tag_layer(self, layer_names, resolver)
  analyse(self, analysis_type, resolver)
  list_layers(self)
  text(self)
  diff(self, other)
  ```
  
* **Public attributes**

  ```
  meta
  layers
  attributes
  layers_to_attributes ??
  base_to_dependant ??  
  enveloping_to_enveloped ??
  pairs ??
  ```
 

### Description

Text class is a central concept that encapsulates raw text together with various annotations.
These annotations fall into two categories:

* metadata about the entire text
* annotations for specific text fragments 

Metadata is stored as a simple dictionary and can be accessed via `meta` attribute
```python
    text.meta = {'origin': 'Tartu', 'author': 'Jansen', 'date': 1890}
    text.meta['origin']
    text.meta['date'] = 1790
    text.meta['language'] = 'estonian'
```
Annotations for specific text fragments such as words and sentences are stored in layers.
The interface for layers mimics the indexing interface of Pandas and SQL Alchemy.
There are two equivalent ways to access layers trough index operator and through attribute access 
```python
    text['tokens']
    text.tokens
```
**Restrictions to layer names:**

* Layer `text` is reserved
* Layers `tokens`, `compound_tokens`, `paragraphs`, `sentences`, `words`, `morph_analysis`, `premorph_analysis`, `syntax_analysis` are used by common NLP toolchains. It is possible to use these layer names if you know how to avoid access and format conflicts.

**Precice typing**

* `__getitem__(self, item : str) -> Layer`
* `__getattr__(self, item : str) -> Layer`

## `Layer` object

### Interface


* **Magic methods**

  ```
  ??
  
  ```
  
* **Public methods**

  ```
  ??
  ```
  
* **Public attributes**

  ```
  ???
  ```



### Description

Layer object provides a mechanism for storing annotations for individual text fragments that may or may not be continuous blocks of characters. 
Each text fragment is represented by Span.
Spans can be specified in terms of raw character positions or in terms of spans of another layer (base layer).
Each span has a fixed set of attributes attached to it that characterise the text fragment.
The list of attributes might be empty if mere location serves as an annotation.
For example, morphological analysis is stored by fixing spans corresponding to words and tagging individual spans with appropriate morphological attributes.


#### Restrictions to attribute names

* Attributes `start` and `end` are reserved
* Attributes `_??_` are system attributes and and should not be present in a layer. 
  It is posible to create these attributes as intermediate results but not as final layers 

#### Indexing operators

There are three major ways to select parts of a layer

* Select some spans but keep all attributes

  ```python
      text.layer[0]
      text.layer[0:40]
      text.layer[(true, false, true)]
      text.layer[lambda x: true]
  ```
  with precise typing
  
  * `__getitem__(self, item: int) -> Span`
  * `__getitem__(self, item: slice) -> SubSpanList`
  * `__getitem__(self, item: List[int]) -> SubSpanList`
  * `__getitem__(self, item: List[boolean]) -> SubSpanList`
  * `__getitem__(self, item: callable) -> SubSpanList`


* Select some attributes but keep all spans

  ```python
      text.layer['attribute']
      text.layer[['attribute1', 'attribute']]
  ```
  
  with precise typing
  
  * `__getitem__(self, item: str) -> Union[List[ANY], List[List[ANY]]]`
  * `__getitem__(self, item: List[str]) -> Union[List[Tuple(ANY)], List[List[Tuple(ANY)]]]`

  

* Select some spans and some attributes

  ```python
      text.layer[0, 'attribute']
      text.layer[0:40, 'attribute']
      text.layer[0, ['attribute1', 'attribute']]
      text.layer[0:40, ['attribute1', 'attribute']]
  ```
  
  with precise typing
  
  * `__getitem__(self, item : tuple[int, str]) -> Union[ANY, List[Any]]`
  * `__getitem__(self, item : tuple[slice, str]) -> Union[List[ANY], List[List[ANY]]]`
  * `__getitem__(self, item : tuple[int, List[str]]) -> Union[Tuple(ANY), List[Tuple[ANY]]]`
  * `__getitem__(self, item : tuple[slice, List[str]]) -> Union[List[Tuple(ANY)], List[List[Tuple(ANY)]]`
  * `__getitem__(self, item : tuple[list(int), List[str]]) -> Union[List[Tuple(ANY)], List[List[Tuple(ANY)]]`



#### Big issue in return values

How do you handle ambigous spans. What is the corresponding output type.
I propose a multiset of normal answers. What does it do now?

#### External attributes

Layer may have attributes that are inherited from its parent layer or other anchestors.
These are specified by callable resolver that looks it up on demand.
This way 

```python
    text.morph_layer['words']
    text.morph_layer.words
```

make sense although `morph_layer` does not contain `words` or `word` attributes

Responsibility for correct specification is your responsibility.

** Mechanism for defining external attributes **

As inherited attributes are not part of the Layer they must be computed on the fly.
There are two ways an inherited attribute can be accessed

* through a direct call to attribute ```text.morph_layer.words```
* through and indexing operator call  

  ```python
  text.morph_layer['words']
  text.morph_layer[1:40,'words']
  text.morph_layer[1:40, ['words', attr]]
  ```

To achive that there must be a functon that for each Span fetches the corresponding attribute. 
For the optimisation, we might want to get corresponding attributes for some SpanList.

```python
local_resolver(span:Span) -> ANY
global_resolver(slice) --> List[ANY]
```

where by default global resolver is defined as

```python
[local_resolver(span) for span in self[slice]]
```

Hence, we need a mapping

```python
inherited_attributes: Map[str -> Pair[callable, callable]]
```

and a boilerplate code to handle all possible indexing calls.

In [1]:
from estnltk import Text

text = Text('Mis on Sinu nimi?').analyse('morphology')
layer = text.morph_analysis
layer

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,Mis,mis,mis,['mis'],0.0,,pl n,P
,Mis,mis,mis,['mis'],0.0,,sg n,P
on,on,olema,ole,['ole'],0.0,,b,V
,on,olema,ole,['ole'],0.0,,vad,V
Sinu,Sinu,sina,sina,['sina'],0.0,,sg g,P
nimi,nimi,nimi,nimi,['nimi'],0.0,,sg n,S
?,?,?,?,['?'],,,,Z


In [2]:
layer[1]

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
on,on,olema,ole,['ole'],0,,b,V
,on,olema,ole,['ole'],0,,vad,V


In [3]:
layer[2:7:2]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Sinu,Sinu,sina,sina,['sina'],0.0,,sg g,P
?,?,?,?,['?'],,,,Z


In [4]:
layer[[True, False, True, False, True]]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,Mis,mis,mis,['mis'],0.0,,pl n,P
,Mis,mis,mis,['mis'],0.0,,sg n,P
Sinu,Sinu,sina,sina,['sina'],0.0,,sg g,P
?,?,?,?,['?'],,,,Z


In [5]:
layer[lambda span: len(span.annotations) > 1]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,2

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,Mis,mis,mis,['mis'],0,,pl n,P
,Mis,mis,mis,['mis'],0,,sg n,P
on,on,olema,ole,['ole'],0,,b,V
,on,olema,ole,['ole'],0,,vad,V


In [6]:
layer[[1,3,4]]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,3

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
on,on,olema,ole,['ole'],0.0,,b,V
,on,olema,ole,['ole'],0.0,,vad,V
nimi,nimi,nimi,nimi,['nimi'],0.0,,sg n,S
?,?,?,?,['?'],,,,Z


In [7]:
layer['text', 'lemma']

Unnamed: 0,text,lemma
0.0,Mis,mis
,Mis,mis
1.0,on,olema
,on,olema
2.0,Sinu,sina
3.0,nimi,nimi
4.0,?,?


In [8]:
layer[0, 'lemma']

['mis', 'mis']

In [9]:
layer[0, ['lemma', 'form']]

[['mis', 'pl n'], ['mis', 'sg n']]

In [10]:
layer[0:3, ['lemma', 'form']]

Unnamed: 0,lemma,form
0.0,mis,pl n
,mis,sg n
1.0,olema,b
,olema,vad
2.0,sina,sg g


In [11]:
layer[:]

layer name,attributes,parent,enveloping,ambiguous,span count
morph_analysis,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,5

text,normalized_text,lemma,root,root_tokens,ending,clitic,form,partofspeech
Mis,Mis,mis,mis,['mis'],0.0,,pl n,P
,Mis,mis,mis,['mis'],0.0,,sg n,P
on,on,olema,ole,['ole'],0.0,,b,V
,on,olema,ole,['ole'],0.0,,vad,V
Sinu,Sinu,sina,sina,['sina'],0.0,,sg g,P
nimi,nimi,nimi,nimi,['nimi'],0.0,,sg n,S
?,?,?,?,['?'],,,,Z


## SpanList
SpanList is a list of spans without extra restrictions encoded into Layer

**Indexing operators**

```
    text.layer.spans[0]
    text.layer.spans[0:40]
    text.layer.spans[(true, false, true)]
    text.layer.spans[lambda x: true]
```

with precise typing

* `__getitem__(self, item : int) -> Span`
* `__getitem__(self, item : slice) -> SubSpanList`
* `__getitem__(self, item : List[boolean]) -> SubSpanList`
* `__getitem__(self, item : callable) -> SubSpanList`

Restrictions to `SubSpanList`: no addition or deletion of elements, no addition or deletion of attributes.

## Spans 

What is there?

**Indexing operators**

```
    text.sentences[0][0]
    text.senences[0][1:5]
    text.sentences[0][(true, false, true)] 
    text.sentences[0][lambda x: true]
```

Span is defined as a sequence of characters or sequence of spans of a parent layer.
Thus, we can index subobjects of a span 



## Taggers and ReTaggers

## Rewriters