# Storing of `Text` objects in a PostgreSQL database

This tutorial demonstrates how to store and query EstNLTK `Text` objects in a PostgreSQL database.

In [1]:
from estnltk import Text, logger
from estnltk.taggers import VabamorfTagger, WordTagger, CompoundTokenTagger
from estnltk.storage.postgres import PostgresStorage, create_schema, delete_schema
from estnltk.storage.postgres import LayerQuery, SubstringQuery, IndexQuery, MetadataQuery

## Access to the database

In [2]:
storage = PostgresStorage(host=None,
                          port=None,
                          dbname='test_db',
                          user=None,
                          password=None,
                          pgpass_file='~/.pgpass',
                          schema='my_schema',
                          role=None,
                          temporary=False)

INFO:storage.py:41: connecting to host: 'localhost', port: '5432', dbname: 'test_db', user: 'postgres'
INFO:storage.py:58: schema: 'my_schema', temporary: False, role: 'postgres'


If any of the parameters `host`, `port`, `dbname`, `user` or `password` is `None` then the missing values are searced from the `pgpass_file`. The first line of the file that matches the given arguments is used to connect to an existing PostgreSQL database.

File line format:

    host:port:dbname:user:password
 
Example file contents:

    # host:port:dbname:user:password
    localhost:5432:test_db:username:password
    example.com:5432:*:exampleuser:kj3dno34

## Create schema

Probably the schema is already set up in the database. If not and you have enough privileges, you can create one:

In [3]:
create_schema(storage)

## Create collections

Now, new collections can be created and displayed. Collection stores `Text` objects in the database and provides a read/write API.

In [4]:
storage['my_first_collection'].create('first demo collection')
storage['my_second_collection'].create('second demo collection')
storage

INFO:collection.py:94: new empty collection 'my_first_collection' created
INFO:collection.py:94: new empty collection 'my_second_collection' created


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,rows,total_size,comment
collection,version,relations,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
my_first_collection,3.0,,0,32 kB,first demo collection
my_first_collection,3.0,structure,0,16 kB,
my_second_collection,3.0,,0,32 kB,second demo collection
my_second_collection,3.0,structure,0,16 kB,


The collection names as a list of strings is also available

In [5]:
storage.collections

['my_first_collection', 'my_second_collection']

Derive and display a collection

In [6]:
collection = storage['my_first_collection']
collection

## Delete collections

In [7]:
del storage['my_first_collection']
# or
storage['my_second_collection'].delete()

and the storage is empty again

In [8]:
storage

## Add texts

Let's create a new collection

In [9]:
collection = storage["my_collection"].create(description='demo collection', meta={'author': 'str'})

INFO:collection.py:94: new empty collection 'my_collection' created


and add some data

In [10]:
with collection.insert() as collection_insert:
    text1 = Text('Ööbik laulab.').tag_layer('morph_analysis')
    text1.meta['author'] = 'Kõivupuu'
    collection_insert(text1, meta_data=text1.meta)

    text2 = Text('Öökull ei laula.').tag_layer('morph_analysis')
    text2.meta['author'] = 'Niinepuu'
    key2 = collection_insert(text2, meta_data=text2.meta)
    
    text3 = Text('Karu magab.').tag_layer('morph_analysis')
    text3.meta['author'] = 'Niinemets'
    key3 = collection_insert(text3, meta_data=text3.meta)
    
    text4 = Text('Vana-Karu lõi trummi.').tag_layer('morph_analysis')
    text4.meta['author'] = 'Musumets'
    key4 = collection_insert(text4, meta_data=text4.meta)

INFO:collection_text_object_inserter.py:107: inserted 4 texts into the collection 'my_collection'


All inserted `Text` objects must have the same layers.

You can see what's inside

In [11]:
collection

Unnamed: 0,data type
author,text

Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
sentences,attached,(),False,False,,words,[]
morph_analysis,attached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech)",True,False,words,,[]
words,attached,"(normalized_form,)",True,False,,,[]


The layers inserted with the `Text` objects are stored in the same database table with the `Text` object and are called **attached** layers.

### Create layers

The `create_layer` method creates a new layer for every `Text` object in the collection. These layers are stored in separate database files and are called **detached** layers. 

In [12]:
layer_1 = 'detached_morph_1'
layer_2 = 'detached_morph_2'

tagger = VabamorfTagger(disambiguate=False, output_layer=layer_1)
collection.create_layer(tagger=tagger)

tagger = VabamorfTagger(disambiguate=False, output_layer=layer_2)
collection.create_layer(tagger=tagger)

collection

INFO:collection.py:696: collection: 'my_collection'
INFO:collection.py:715: preparing to create a new layer: 'detached_morph_1'
INFO:collection.py:747: inserting data into the 'detached_morph_1' layer table
INFO:collection_detached_layer_inserter.py:86: inserted 4 detached 'detached_morph_1' layers into the collection 'my_collection'
INFO:collection.py:782: layer created: 'detached_morph_1'
INFO:collection.py:696: collection: 'my_collection'
INFO:collection.py:715: preparing to create a new layer: 'detached_morph_2'
INFO:collection.py:747: inserting data into the 'detached_morph_2' layer table
INFO:collection_detached_layer_inserter.py:86: inserted 4 detached 'detached_morph_2' layers into the collection 'my_collection'
INFO:collection.py:782: layer created: 'detached_morph_2'


Unnamed: 0,data type
author,text

Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
sentences,attached,(),False,False,,words,[]
morph_analysis,attached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech)",True,False,words,,[]
words,attached,"(normalized_form,)",True,False,,,[]
detached_morph_1,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]
detached_morph_2,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]


Note: after you have added a detached layer to the collection, you can no longer add new `Text` objects to it. 

<p>
<div class="alert alert-block alert-warning">
<h4><i>Size limits on text and layer insertion</i></h4> 
<p>Be aware that database columns have size limits. If you insert large <code>Text</code>-s and/or many layers (especially richly annotated morphological or syntactic layers), you may end up exceeding those limits. This is indicated by the following error message:
<pre>
psycopg2.errors.ProgramLimitExceeded: total size of jsonb array elements exceeds the maximum of 268435455 bytes
</pre>
Unfortunately, this limit cannot be changed in the database configuration. 
To bypass the situation, you can split the large <code>Text</code> into smaller <code>Text</code> objects and insert small texts separately. For more details about splitting, see the functions <code>extract_sections</code> and <code>split_by</code>: <a href="https://github.com/estnltk/estnltk/blob/version_1.6/tutorials/system/layer_operations.ipynb">https://github.com/estnltk/estnltk/blob/version_1.6/tutorials/system/layer_operations.ipynb</a>. A recommendation is to consider splitting if the size of the (raw) text exceeds 1 MB. 

</p>
</div>
</p>

### Delayed and parallel layer creation

Sometimes you want to add a new detached layer to the collection, but without filling it with data right away. 
Then you can use the `add_layer` method to add a layer template to the collection. 
You can use collection's `add_layer` method in combination with tagger's `get_layer_template` method:

In [13]:
from estnltk.taggers import ParagraphTokenizer

paragraph_tokenizer = ParagraphTokenizer()
collection.add_layer( layer_template=paragraph_tokenizer.get_layer_template() )

collection

INFO:collection.py:867: detached layer 'paragraphs' created from template


Unnamed: 0,data type
author,text

Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
sentences,attached,(),False,False,,words,[]
morph_analysis,attached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech)",True,False,words,,[]
words,attached,"(normalized_form,)",True,False,,,[]
detached_morph_1,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]
detached_morph_2,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]
paragraphs,detached,(),False,False,,sentences,[]


To fill in the newly created layer, you can use `collection.create_layer()` with `mode='append'`:

In [14]:
collection.create_layer(tagger=paragraph_tokenizer, mode='append')

INFO:collection.py:696: collection: 'my_collection'
INFO:collection.py:704: appending existing layer: 'paragraphs'
INFO:collection.py:747: inserting data into the 'paragraphs' layer table
INFO:collection_detached_layer_inserter.py:86: inserted 4 detached 'paragraphs' layers into the collection 'my_collection'
INFO:collection.py:782: layer created: 'paragraphs'


Alternatively, you can also launch several layer creators in parallel, so that they create layers for non-overlapping blocks of texts.

In [15]:
# Remove layer
collection.delete_layer( paragraph_tokenizer.output_layer )

INFO:collection.py:1097: layer deleted: 'paragraphs'


In [16]:
# Add the template layer once more
collection.add_layer( layer_template=paragraph_tokenizer.get_layer_template() )

INFO:collection.py:867: detached layer 'paragraphs' created from template


Now, you can use `collection.create_layer_block()` to apply the tagger only on a block of collection's texts, where the block is defined by method's input parameter `(module, remainder)`. As a result, only texts with `text_id % module == remainder` will be tagged. 
If you are using parallel processing, it is recommended to create a new database connection for each block-creating process, like in the examples below:

In [17]:
# In case of parallel processing: Open a new connection to the database & collection
storage_a = PostgresStorage( dbname='test_db', pgpass_file='~/.pgpass', schema='my_schema' )
collection_a = storage_a["my_collection"]
# Tag the first block
collection_a.create_layer_block( paragraph_tokenizer, (2, 0) )
# Close the connection
storage_a.close()

INFO:storage.py:41: connecting to host: 'localhost', port: '5432', dbname: 'test_db', user: 'postgres'
INFO:storage.py:58: schema: 'my_schema', temporary: False, role: 'postgres'
INFO:collection.py:911: inserting data into the 'paragraphs' layer table block (2, 0)
INFO:collection_detached_layer_inserter.py:86: inserted 2 detached 'paragraphs' layers into the collection 'my_collection'
INFO:collection.py:985: block (2, 0) of 'paragraphs' layer created


In [18]:
# In case of parallel processing: Open a new connection to the database & collection
storage_b = PostgresStorage( dbname='test_db', pgpass_file='~/.pgpass', schema='my_schema' )
collection_b = storage_b["my_collection"]
# Tag the second block
collection_b.create_layer_block( paragraph_tokenizer, (2, 1) )
# Close the connection
storage_b.close()

INFO:storage.py:41: connecting to host: 'localhost', port: '5432', dbname: 'test_db', user: 'postgres'
INFO:storage.py:58: schema: 'my_schema', temporary: False, role: 'postgres'
INFO:collection.py:911: inserting data into the 'paragraphs' layer table block (2, 1)
INFO:collection_detached_layer_inserter.py:86: inserted 2 detached 'paragraphs' layers into the collection 'my_collection'
INFO:collection.py:985: block (2, 1) of 'paragraphs' layer created


Note: if you use `collection.create_layer_block()` with `mode='append'`, then the method will continue creating an existing block, tagging only untagged texts inside the block.

### Sparse layers

Detached layers can be _sparse_, which means that empty layers are not stored in the layer table. 
This saves up the storage, and queries can also be faster over sparse layers.

You can use the parameter `sparse=True` to create a sparse layer:

In [19]:
# Create tagger for sparse 'compound_tokens' layer
compound_token_tagger = CompoundTokenTagger(output_layer='compound_tokens_sparse')

# Tag sparse layer
collection.create_layer(tagger=compound_token_tagger, sparse=True)

# Check results
collection

INFO:collection.py:696: collection: 'my_collection'
INFO:collection.py:715: preparing to create a new layer: 'compound_tokens_sparse'
INFO:collection.py:747: inserting data into the 'compound_tokens_sparse' layer table
INFO:collection_detached_layer_inserter.py:82: inserted 1 detached 'compound_tokens_sparse' layers into the collection 'my_collection', skipped 3 empty layers
INFO:collection.py:782: layer created: 'compound_tokens_sparse'


Unnamed: 0,data type
author,text

Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
sentences,attached,(),False,False,,words,[]
morph_analysis,attached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech)",True,False,words,,[]
words,attached,"(normalized_form,)",True,False,,,[]
detached_morph_1,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]
detached_morph_2,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]
paragraphs,detached,(),False,False,,sentences,[]
compound_tokens_sparse,detached,"(type, normalized)",False,True,,tokens,[]


The parameter `sparse=True` can also be passed to `collection.add_layer()` method.

## Iterate collection

Number of `Text` objects in the collection.

In [20]:
len(collection)

4

Don't list the collection elements if the collection is large.

In [21]:
list(collection)

[Text(text='Ööbik laulab.'),
 Text(text='Öökull ei laula.'),
 Text(text='Karu magab.'),
 Text(text='Vana-Karu lõi trummi.')]

Collection yields `Text` objects with selected layers. The selected layers are by default the attached layers.

In [22]:
collection.selected_layers

['tokens', 'compound_tokens', 'sentences', 'morph_analysis', 'words']

The dependencies are included automatically.

In [23]:
collection.selected_layers = [layer_1]
collection.selected_layers

['words', 'detached_morph_1']

The indexes start from `0`.

In [24]:
collection[0]

text
Ööbik laulab.

0,1
author,Kõivupuu

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,True,3
detached_morph_1,"normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore",words,,True,3


Now, you can iterate over the whole collection using the `select()` method:

In [25]:
for text_id, text_obj in collection.select():
    print(text_id, text_obj)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')
2 Text(text='Karu magab.')
3 Text(text='Vana-Karu lõi trummi.')


If the collection has metadata columns (i.e. `meta` argument was specified while creating the collection), then `collection_meta` argument can be used to select the metadata along with the index and `Text` object:

In [26]:
for text_id, text_obj, text_meta in collection.select( collection_meta=['author'] ):
    print(text_id, text_obj, text_meta)

0 Text(text='Ööbik laulab.') {'author': 'Kõivupuu'}
1 Text(text='Öökull ei laula.') {'author': 'Niinepuu'}
2 Text(text='Karu magab.') {'author': 'Niinemets'}
3 Text(text='Vana-Karu lõi trummi.') {'author': 'Musumets'}


## Search collection

EstNLTK provides different types of queries to search `Text` objects from the collection.

`IndexQuery` can be used to search for a particular entry by index:

In [27]:
list(collection.select( query=IndexQuery( [1] ) ))

[(1, Text(text='Öökull ei laula.'))]

`MetadataQuery` can be used to search for `Text` objects with specific metadata. By default, collection's metadata columns will be searched:

In [28]:
q = MetadataQuery( {'author': 'Niinepuu'} )
for key, txt, meta in collection.select( query=q, collection_meta=['author'] ):
    print(key, txt, meta)

1 Text(text='Öökull ei laula.') {'author': 'Niinepuu'}


Alternatively, if you use `MetadataQuery` with `meta_type='TEXT'`, then the query searches for metadata inside `Text` objects (the `meta` field):

In [29]:
q = MetadataQuery( {'author': 'Niinemets'}, meta_type='TEXT' )
for key, txt in collection.select( query=q ):
    print(key, txt, txt.meta)

2 Text(text='Karu magab.') {'author': 'Niinemets'}


`SubstringQuery` finds all `Text` objects that have the given substring in their raw text:

In [30]:
q = SubstringQuery('laula')
for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


`LayerQuery` can be used to search texts by the attribute values in layers.

Find texts that contain lemma `laulma` in the attached `morph_analysis` layer.

In [31]:
q = LayerQuery('morph_analysis', lemma='laulma')
for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


You can also search for multiple layer attributes.

Find texts that contain a span in the detached `detached_morph_1` layer with partofspeech `V` and form `b` .

In [32]:
q = LayerQuery(layer_name=layer_1, partofspeech='V', form='b')

for key, text in collection.select(query=q):
    print(key, text)

0 Text(text='Ööbik laulab.')
2 Text(text='Karu magab.')


Find texts that contain a span in the attached `morph_analysis` layer with lemma `laulma` and form `b`.

In [33]:
q = LayerQuery('morph_analysis', lemma='laulma', form='b')

for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')


### Combined conditions with OR and AND operators

You can use `|` ("OR"), and `&` ("AND") operators to create composite queries.

Find texts that contain a span in the `morph_analysis` layer with lemma `ööbik` **or** lemma `öökull`.

In [34]:
q = LayerQuery('morph_analysis', lemma='ööbik') | \
    LayerQuery('morph_analysis', lemma='öökull')

for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


Find texts that contain a span in the `detached_morph_2` layer with lemma `ööbik` **and** lemma `öökull`.

In [35]:
q = LayerQuery(layer_2, lemma='ööbik') & \
    LayerQuery(layer_2, lemma='öökull')
for key, txt in collection.select(query=q):
    print(key, txt)

No such text.

Find texts that contain a span in the `detached_morph_2` layer with lemma `ööbik` **and** another span with partofspeech `V` and form `b`:

In [36]:
q = LayerQuery(layer_name=layer_2, lemma='ööbik') & \
    LayerQuery(layer_name=layer_2, partofspeech='V', form='b')

for key, text in collection.select(query=q):
    print(key, text)

0 Text(text='Ööbik laulab.')


Find texts that contain a span in the `morph_analysis` layer with lemma `laulma` **and** another span with lemma `ööbik` **or** `öökull`.

In [37]:
q = (LayerQuery('morph_analysis', lemma='ööbik') | LayerQuery('morph_analysis', lemma='öökull')) & \
     LayerQuery('morph_analysis', lemma='laulma')
for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


Naturally, we can also combine layer queries over different layers. 

Find texts with lemma `ööbik` **or** `öökull` in the `detached_morph_1` layer **and** lemma `laulma` in the `detached_morph_2` layer:

In [38]:
q = (LayerQuery(layer_1, lemma='ööbik') | LayerQuery(layer_1, lemma='öökull')) & \
     LayerQuery(layer_2, lemma='laulma')

for key, text in collection.select(query=q):
    print(key, text)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


Finally, we can also combine different types of queries.

Find texts with lemma `ööbik` in the `detached_morph_2` layer **or** with metadata entry `'author': 'Niinemets'`:

In [39]:
q = LayerQuery(layer_2, lemma='öökull') | \
    MetadataQuery({'author': 'Niinemets'}, meta_type='TEXT')

for key, text in collection.select(query=q):
    print(key, text)

1 Text(text='Öökull ei laula.')
2 Text(text='Karu magab.')


### Queries over blocks of texts (parallelization)

You can use `BlockQuery` to make queries over non-overlapping subsets (blocks) of the collection. 
This can be useful for query parallelization: you can launch several parallel query jobs on the collection.

In [40]:
from estnltk.storage.postgres import BlockQuery

In [41]:
q = LayerQuery('morph_analysis', lemma='ööbik') | \
    LayerQuery('morph_analysis', lemma='öökull')

Now, we can add `BlockQuery` constraint to the query. The block of documents is defined by the input parameter `(module, remainder)`, which instructs to select only texts with `text_id % module == remainder`. So, if we want to cover the whole collection with 2 queries, we add `BlockQuery(module=2, remainder=0)` to the first query and `BlockQuery(module=2, remainder=1)` to the second query.
For parallel querying, it is recommended to create a new database connection for each block query, like in the examples below:

In [42]:
# In case of parallel processing: Open a new connection to the database & collection
storage_a = PostgresStorage( dbname='test_db', pgpass_file='~/.pgpass', schema='my_schema' )
collection_a = storage_a["my_collection"]
# Search the first block
for key, txt in collection_a.select(query=q & BlockQuery(2, 0)):
    print(key, txt)
# Close the connection
storage_a.close()

INFO:storage.py:41: connecting to host: 'localhost', port: '5432', dbname: 'test_db', user: 'postgres'
INFO:storage.py:58: schema: 'my_schema', temporary: False, role: 'postgres'
0 Text(text='Ööbik laulab.')


In [43]:
# In case of parallel processing: Open a new connection to the database & collection
storage_b = PostgresStorage( dbname='test_db', pgpass_file='~/.pgpass', schema='my_schema' )
collection_b = storage_b["my_collection"]
# Search the second block
for key, txt in collection_b.select(query=q & BlockQuery(2, 1)):
    print(key, txt)
# Close the connection
storage_b.close()

INFO:storage.py:41: connecting to host: 'localhost', port: '5432', dbname: 'test_db', user: 'postgres'
INFO:storage.py:58: schema: 'my_schema', temporary: False, role: 'postgres'
1 Text(text='Öökull ei laula.')


### Queries over sparse layers

By default, iteration over sparse layers works as the default iteration, yielding all texts (that match the selection query):

In [44]:
# Iterate over collection by selecting a sparse layer (default)
for key, text in collection.select(layers=['compound_tokens_sparse']):
    print(key, text, '| compound_tokens_sparse length:', len(text['compound_tokens_sparse']))

0 Text(text='Ööbik laulab.') | compound_tokens_sparse length: 0
1 Text(text='Öökull ei laula.') | compound_tokens_sparse length: 0
2 Text(text='Karu magab.') | compound_tokens_sparse length: 0
3 Text(text='Vana-Karu lõi trummi.') | compound_tokens_sparse length: 1


However, you can use `keep_all_texts=False` to constrain the query to yield only those texts that have non-empty sparse layers:

In [45]:
# Iterate over collection by selecting texts with non-empty sparse layers
for key, text in collection.select(layers=['compound_tokens_sparse'], keep_all_texts=False):
    print(key, text, '| compound_tokens_sparse length:', len(text['compound_tokens_sparse']))

3 Text(text='Vana-Karu lõi trummi.') | compound_tokens_sparse length: 1


Note that if you have multiple sparse layers selected, then the query yields an intersection of non-empty sparse layers: only texts that have all the selected sparse layers non-empty will be yield.

### Creating a sparse layer from a selection

Esentially, `collection.select(...)` yields a read-only subcollection of texts from the collection (see `PgSubCollection` below). 
This subcollection can be used as a basis for creating a new sparse layer that covers only texts from that subcollection. 
You can use `collection.select(...).create_layer(tagger)` to tag a sparse layer over the selection:

In [46]:
# Create tagger for sparse 'paragraphs' layer
paragraph_tokenizer = ParagraphTokenizer(output_layer='paragraphs_sparse')

# Annotate only a subselection of texts with 'paragraphs'
collection.select(query=IndexQuery([0, 1])).create_layer(paragraph_tokenizer)

INFO:collection.py:696: collection: 'my_collection'
INFO:collection.py:715: preparing to create a new layer: 'paragraphs_sparse'
INFO:collection.py:747: inserting data into the 'paragraphs_sparse' layer table
INFO:collection_detached_layer_inserter.py:86: inserted 2 detached 'paragraphs_sparse' layers into the collection 'my_collection'
INFO:collection.py:782: layer created: 'paragraphs_sparse'


In a similar manner, you can use `collection.select(...).create_layer_block(tagger, block)` to tag a block over the selection. However, remember to create the layer table beforehand via `collection.add_layer(layer_template=tagger.get_layer_template(), sparse=True)`.

Note that the default iteration over the sparse layer still yields all text objects:

In [47]:
# Browse results: iterate over collection by selecting a sparse layer
for key, text in collection.select(layers=['paragraphs_sparse']):
    print(key, text, '| paragraphs_sparse length:', len(text['paragraphs_sparse']))

0 Text(text='Ööbik laulab.') | paragraphs_sparse length: 1
1 Text(text='Öökull ei laula.') | paragraphs_sparse length: 1
2 Text(text='Karu magab.') | paragraphs_sparse length: 0
3 Text(text='Vana-Karu lõi trummi.') | paragraphs_sparse length: 0


And `keep_all_texts=False` can be used to constrain the query to texts with non-empty sparse layers:

In [48]:
for key, text in collection.select(layers=['paragraphs_sparse'], keep_all_texts=False):
    print(key, text, '| paragraphs_sparse length:', len(text['paragraphs_sparse']))

0 Text(text='Ööbik laulab.') | paragraphs_sparse length: 1
1 Text(text='Öökull ei laula.') | paragraphs_sparse length: 1


And we're done with these examples. Delete the collection:

In [49]:
collection.delete()

## Indexing layers

Create a new collection:

In [50]:
collection = storage.get_collection('collection_with_layers')
collection.create()

with collection.insert() as collection_insert:
    collection_insert(Text('See on esimene lause.').tag_layer(["sentences"]))
    collection_insert(Text('See on teine lause.').tag_layer(["sentences"]))

collection

INFO:collection.py:94: new empty collection 'collection_with_layers' created
INFO:collection_text_object_inserter.py:107: inserted 2 texts into the collection 'collection_with_layers'


Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
sentences,attached,(),False,False,,words,[]
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
words,attached,"(normalized_form,)",True,False,,,[]


Ngram index enables to index ngrams in layer attributes.
For example, a bigram index on an attribute with values `['see', 'on', 'esimene', 'lause']` will contain pairs *'see-on'*, *'on-esimene'*, *'esimene-lause'*.
Indices of a higher order are also supported.

To build an ngram index, provide an argument *ngram_index* when creating a new layer.
The following code creates a bi-gram index on an attribute *lemma* for a newly created layer *indexed_layer*:

In [51]:
indexed_layer = 'indexed_layer'
tagger = VabamorfTagger(disambiguate=False, output_layer=indexed_layer)

collection.create_layer(tagger=tagger, ngram_index={"lemma": 2})

INFO:collection.py:696: collection: 'collection_with_layers'
INFO:collection.py:715: preparing to create a new layer: 'indexed_layer'
INFO:collection.py:747: inserting data into the 'indexed_layer' layer table
INFO:collection_detached_layer_inserter.py:86: inserted 2 detached 'indexed_layer' layers into the collection 'collection_with_layers'
INFO:collection.py:782: layer created: 'indexed_layer'


To search an ngram index, use `LayerNgramQuery` query:

Search entries containing lemma bigram 'see-olema':

In [52]:
from estnltk.storage.postgres import LayerNgramQuery

q = LayerNgramQuery( { indexed_layer: {
        "lemma": [("see", "olema")]
    }})
for key, text in collection.select(query=q):
    print(key, text)

0 Text(text='See on esimene lause.')
1 Text(text='See on teine lause.')


Search 'teine-lause' OR 'olema-esimene':

In [53]:
q = LayerNgramQuery( { indexed_layer: {
        "lemma":  [("teine", "lause"), ("olema", "esimene")]
    }})
for key, text in collection.select(query=q):
    print(key, text)

0 Text(text='See on esimene lause.')
1 Text(text='See on teine lause.')


Search 'see-olema' AND 'olema-esimene':

In [54]:
q = LayerNgramQuery( { indexed_layer: {
        "lemma":  [[("see", "olema"), ("olema", "esimene")]]
    }})
for key, text in collection.select(query=q):
    print(key, text)

0 Text(text='See on esimene lause.')


In [55]:
collection

Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
sentences,attached,(),False,False,,words,[]
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
words,attached,"(normalized_form,)",True,False,,,[]
indexed_layer,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]


### Delete layer

Only detched layers can be deleted.

In [56]:
collection

Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
sentences,attached,(),False,False,,words,[]
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
words,attached,"(normalized_form,)",True,False,,,[]
indexed_layer,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]


The only detached layer in this collection is the layer `indexed_layer`. Let's delete it.

In [57]:
collection.delete_layer('indexed_layer')

INFO:collection.py:1097: layer deleted: 'indexed_layer'


Finally delete the collection.

In [58]:
collection.delete()

## `PgSubCollection`

In [59]:
collection = storage.get_collection('my_collection')
collection.create()

texts = ['Esimene tekst.', 'Teine tekst.', 'Kolmas tekst.', 'Neljas tekst.', 'Viies tekst.']

with collection.insert() as collection_insert:
    for t in texts:
        collection_insert(Text(t))

INFO:collection.py:94: new empty collection 'my_collection' created
INFO:collection_text_object_inserter.py:107: inserted 5 texts into the collection 'my_collection'


In [60]:
from estnltk.taggers import TokensTagger

tokens_tagger = TokensTagger()

collection.create_layer(tagger=tokens_tagger)

INFO:collection.py:696: collection: 'my_collection'
INFO:collection.py:715: preparing to create a new layer: 'tokens'
INFO:collection.py:747: inserting data into the 'tokens' layer table
INFO:collection_detached_layer_inserter.py:86: inserted 5 detached 'tokens' layers into the collection 'my_collection'
INFO:collection.py:782: layer created: 'tokens'


The `select` method returns a `PgSubCollection` object that provides read-only access to a subset of the collection.

In [61]:
collection.select(query=None,
                  layers=None,  # Sequence[str] 
                  collection_meta=None,  # Sequence[str] 
                  progressbar=None,  # str
                  return_index=True,  # bool
                  itersize= 10
                  )

PgSubCollection(collection: 'my_collection', selected_layers=[], meta_attributes=(), progressbar=None, return_index=True)

In [62]:
for text in collection.select(progressbar='notebook', return_index=True):
    print(text)

  0%|          | 0/5 [00:00<?, ?doc/s]

(0, Text(text='Esimene tekst.'))
(1, Text(text='Teine tekst.'))
(2, Text(text='Kolmas tekst.'))
(3, Text(text='Neljas tekst.'))
(4, Text(text='Viies tekst.'))


You can also directly access first and last texts of the `PgSubCollection`. The `head` method selects only first `N` texts from the subset:

In [63]:
# Select first 3 texts
for text in collection.select().head( 3 ):
    print(text)

(0, Text(text='Esimene tekst.'))
(1, Text(text='Teine tekst.'))
(2, Text(text='Kolmas tekst.'))


And the `tail` method selects last `N` texts:

In [64]:
# Select only last 2 texts
for text in collection.select().tail( 2 ):
    print(text)

(3, Text(text='Neljas tekst.'))
(4, Text(text='Viies tekst.'))


Get detached layer without `Text` object.

In [65]:
detached_layers = collection.select(return_index=False).detached_layer('tokens')
detached_layers

PgSubCollectionLayer(collection: 'my_collection', detached_layer='tokens', progressbar=None, return_index=False, skip_empty=False)

In [66]:
next(iter(detached_layers))

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,3

start,end
0,7
8,13
13,14


## Working with fragments

In [67]:
collection = storage["collection_with_fragments"].create(description='demo collection')

with collection.insert() as collection_insert:
    text1 = Text('Ööbik laulab.').tag_layer(['morph_analysis'])
    collection_insert(text1)

    text2 = Text('Öökull ei laula.').tag_layer(['morph_analysis'])
    key2 = collection_insert(text2)

    
def fragmenter(layer):
    return [layer]


tagger = VabamorfTagger(disambiguate=False, output_layer='fragmented_morph')

collection.create_fragmented_layer(tagger=tagger, fragmenter=fragmenter)

collection

INFO:collection.py:94: new empty collection 'collection_with_fragments' created
INFO:collection_text_object_inserter.py:107: inserted 2 texts into the collection 'collection_with_fragments'
INFO:collection.py:566: collection: 'collection_with_fragments'
INFO:collection_detached_layer_inserter.py:86: inserted 2 detached 'fragmented_morph' layers into the collection 'collection_with_fragments'
INFO:collection.py:614: fragmented layer created: 'fragmented_morph'


Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
sentences,attached,(),False,False,,words,[]
morph_analysis,attached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech)",True,False,words,,[]
words,attached,"(normalized_form,)",True,False,,,[]
fragmented_morph,fragmented,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]


In [68]:
from estnltk.storage.postgres import RowMapperRecord

table_name = 'fragment_test'
collection = storage.get_collection(table_name)
collection.create()

with collection.insert() as collection_insert:
    text1 = Text('see on esimene lause').tag_layer(["sentences"])
    collection_insert(text1)
    text2 = Text('see on teine lause').tag_layer(["sentences"])
    collection_insert(text2)

layer_fragment_name = "layer_fragment_1"
tagger = VabamorfTagger(disambiguate=False, output_layer=layer_fragment_name)

collection.create_layer(tagger=tagger)

fragment_name = "fragment_1"

def row_mapper(row):
        parent_id, layer = row
        return [{'fragment': layer, 'parent_id': parent_id},
                {'fragment': layer, 'parent_id': parent_id}]

collection.create_fragment(fragment_name,
                    data_iterator=collection.select().fragmented_layer(name=layer_fragment_name),
                    row_mapper=row_mapper,
                    create_index=False,
                    ngram_index=None)

INFO:collection.py:94: new empty collection 'fragment_test' created
INFO:collection_text_object_inserter.py:107: inserted 2 texts into the collection 'fragment_test'
INFO:collection.py:696: collection: 'fragment_test'
INFO:collection.py:715: preparing to create a new layer: 'layer_fragment_1'
INFO:collection.py:747: inserting data into the 'layer_fragment_1' layer table
INFO:collection_detached_layer_inserter.py:86: inserted 2 detached 'layer_fragment_1' layers into the collection 'fragment_test'
INFO:collection.py:782: layer created: 'layer_fragment_1'


In [69]:
collection

Unnamed: 0,layer_type,attributes,ambiguous,sparse,parent,enveloping,meta
sentences,attached,(),False,False,,words,[]
tokens,attached,(),False,False,,,[]
compound_tokens,attached,"(type, normalized)",False,False,,tokens,[]
words,attached,"(normalized_form,)",True,False,,,[]
layer_fragment_1,detached,"(normalized_text, lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,False,words,,[]


In [70]:
collection.delete()

In [71]:
delete_schema(storage)
storage.close()