# Storing of `Text` objects in a PostgreSQL database

This tutorial demonstrates how to store and query EstNLTK `Text` objects in a PostgreSQL database.

In [1]:
from estnltk import Text, logger
from estnltk.taggers import VabamorfTagger, WordTagger
from estnltk.storage.postgres import PostgresStorage, create_schema, delete_schema
from estnltk.storage.postgres import JsonbTextQuery, JsonbLayerQuery, WhereClause

## Access to the database

In [2]:
storage = PostgresStorage(host=None,
                          port=None,
                          dbname='test_db',
                          user=None,
                          password=None,
                          pgpass_file='~/.pgpass',
                          schema='my_schema',
                          role=None,
                          temporary=False)

INFO:storage.py:41: connecting to host: 'localhost', port: '5432', dbname: 'test_db', user: 'pault'
INFO:storage.py:57: schema: 'my_schema', temporary: False, role: 'pault'


If any of the parameters `host`, `port`, `dbname`, `user` or `password` is `None` then the missing values are searced from the `pgpass_file`. The first line of the file that matches the given arguments is used to connect to an existing PostgreSQL database.

File line format:

    host:port:dbname:user:password
 
Example file contents:

    # host:port:dbname:user:password
    localhost:5432:test_db:username:password
    example.com:5432:*:exampleuser:kj3dno34

## Create schema

Probably the schema is already set up in the database. If not and you have enough privileges you can create one:

In [3]:
create_schema(storage)

## Create collections

Now new collections can be created and displayed. Collection stores `Text` objects in the database and provides a read/write API.

In [4]:
storage['my_first_collection'].create('first demo collection')
storage['my_second_collection'].create('second demo collection')
storage

INFO:collection.py:106: new empty collection 'my_first_collection' created
INFO:collection.py:106: new empty collection 'my_second_collection' created


Unnamed: 0_level_0,Unnamed: 1_level_0,total_size,comment
collection,layers,Unnamed: 2_level_1,Unnamed: 3_level_1
my_first_collection,,32 kB,first demo collection
my_second_collection,,32 kB,second demo collection


The collection names as a list of strings is also available

In [5]:
storage.collections

['my_first_collection', 'my_second_collection']

## Delete collections

In [6]:
del storage['my_first_collection']
# or
storage['my_second_collection'].delete()

and the storage is empty again

In [7]:
storage

## Add texts

Let's create a collection

In [8]:
collection = storage["my_collection"].create(description='demo collection')

INFO:collection.py:106: new empty collection 'my_collection' created


and add some data

In [9]:
with collection.insert() as collection_insert:
    text1 = Text('Ööbik laulab.').tag_layer(['morph_analysis'])
    collection_insert(text1)

    text2 = Text('Öökull ei laula.').tag_layer(['morph_analysis'])
    key2 = collection_insert(text2)

INFO:collection.py:325: inserted 1 texts into the collection 'my_collection'


All inserted `Text` objects must have the same layers.

You can see what's inside

In [10]:
collection

Unnamed: 0,layer_type,attributes,ambiguous,parent,enveloping,_base,meta
compound_tokens,attached,"(type, normalized)",False,,tokens,compound_tokens,[]
morph_analysis,attached,"(lemma, root, root_tokens, ending, clitic, for...",True,words,,words,[]
sentences,attached,(),False,,words,sentences,[]
tokens,attached,(),False,,,tokens,[]
words,attached,"(normalized_form,)",False,,,words,[]


The layers inserted with the `Text` objects are stored in the same database table with the `Text` object and are called **attached** layers.

### Create layers

The `create_layer` method creates a new layer for every `Text` object in the collection. These layers are stored in separate database files and are called **detached** layers.

In [11]:
layer_1 = 'detached_morph_1'
layer_2 = 'detached_morph_2'

tagger = VabamorfTagger(disambiguate=False, layer_name=layer_1)
collection.create_layer(tagger=tagger)

tagger = VabamorfTagger(disambiguate=False, layer_name=layer_2)
collection.create_layer(tagger=tagger)

collection

INFO:collection.py:925: collection: 'my_collection'
INFO:collection.py:944: preparing to create a new layer: 'detached_morph_1'
INFO:collection.py:977: inserting data into the 'detached_morph_1' layer table
INFO:collection.py:1012: layer created: 'detached_morph_1'
INFO:collection.py:925: collection: 'my_collection'
INFO:collection.py:944: preparing to create a new layer: 'detached_morph_2'
INFO:collection.py:977: inserting data into the 'detached_morph_2' layer table
INFO:collection.py:1012: layer created: 'detached_morph_2'


Unnamed: 0,layer_type,attributes,ambiguous,parent,enveloping,_base,meta
compound_tokens,attached,"(type, normalized)",False,,tokens,compound_tokens,[]
detached_morph_1,detached,"(lemma, root, root_tokens, ending, clitic, for...",True,words,,words,[]
detached_morph_2,detached,"(lemma, root, root_tokens, ending, clitic, for...",True,words,,words,[]
morph_analysis,attached,"(lemma, root, root_tokens, ending, clitic, for...",True,words,,words,[]
sentences,attached,(),False,,words,sentences,[]
tokens,attached,(),False,,,tokens,[]
words,attached,"(normalized_form,)",False,,,words,[]


### Iterate collection

Number of `Text` objects in the collection.

In [12]:
len(collection)

2

Don't list the collection elements if the collction is large.

In [13]:
list(collection)

[Text(text='Ööbik laulab.'), Text(text='Öökull ei laula.')]

Collection yields `Text` objects with selected layers. The selected layers are by default the attached layers.

In [14]:
collection.selected_layers

['tokens', 'compound_tokens', 'words', 'sentences', 'morph_analysis']

The dependencies are included automatically.

In [15]:
collection.selected_layers = [layer_1]
collection.selected_layers

['words', 'detached_morph_1']

The indexes start fron `0`.

In [16]:
collection[0]

text
Ööbik laulab.

layer name,attributes,parent,enveloping,ambiguous,span count
words,normalized_form,,,False,3
detached_morph_1,"lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore",words,,True,3


### Search collection

Search for a particular entry by key:

In [17]:
list(collection.select(keys=[1]))

[(1, Text(text='Öökull ei laula.'))]

The `Text` objects can be searced by the layer attribute values.

Use `JsonbTextQuery` to search the texts by the attribute values in the attached layers and `JsonbLayerQuery` to search by the detached layers.

In [18]:
collection.select_by_key(1)

text
Öökull ei laula.

layer name,attributes,parent,enveloping,ambiguous,span count
sentences,,,words,False,1
tokens,,,,False,4
compound_tokens,"type, normalized",,tokens,False,0
words,normalized_form,,,False,4
morph_analysis,"lemma, root, root_tokens, ending, clitic, form, partofspeech",words,,True,4


In [19]:
q = JsonbTextQuery('morph_analysis', lemma='laulma')
for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


In [20]:
q = {layer_1: JsonbLayerQuery(layer_name=layer_1, lemma='laulma')}

for key, text in collection.select(layer_query=q):
    print(key, text)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


Search using multiple layer attributes:

In [21]:
q = JsonbTextQuery('morph_analysis', lemma='laulma', form='b')

for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')


In [22]:
q = {layer_1: JsonbLayerQuery(layer_name=layer_1, lemma='laulma', form='b')}

for key, text in collection.select(layer_query=q):
    print(key, text)

0 Text(text='Ööbik laulab.')


Search using "OR" query:

In [23]:
q = JsonbTextQuery('morph_analysis', lemma='ööbik') | \
    JsonbTextQuery('morph_analysis', lemma='öökull')

for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


In [24]:
q = {layer_1: JsonbLayerQuery(layer_name=layer_1, lemma='ööbik') | 
              JsonbLayerQuery(layer_name=layer_1, lemma='öökull')}

for key, text in collection.select(layer_query=q):
    print(key, text)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


Search using "AND" query:

In [25]:
q = JsonbTextQuery('morph_analysis', lemma='ööbik') & \
    JsonbTextQuery('morph_analysis', lemma='öökull')
for key, txt in collection.select(query=q):
    print(key, txt)

In [26]:
q = {layer_1: JsonbLayerQuery(layer_name=layer_1, lemma='ööbik') & 
              JsonbLayerQuery(layer_name=layer_1, lemma='öökull')}

for key, text in collection.select(layer_query=q):
    print(key, text)

Search using a composite query:

In [27]:
q = (JsonbTextQuery('morph_analysis', lemma='ööbik') | JsonbTextQuery('morph_analysis', lemma='öökull')) & \
     JsonbTextQuery('morph_analysis', lemma='laulma')
for key, txt in collection.select(query=q):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


In [28]:
q = {layer_1: (JsonbLayerQuery(layer_name=layer_1, lemma='ööbik') | 
               JsonbLayerQuery(layer_name=layer_1, lemma='öökull')) & 
              JsonbLayerQuery(layer_name=layer_1, lemma='laulma')}

for key, text in collection.select(layer_query=q):
    print(key, text)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


or use a convenience method `find_fingerprint`:

In [29]:
q = {"layer": "morph_analysis",
     "ambiguous": True,
     "field": "lemma",
     "query": [{'ööbik', 'laulma'}, {'öökull', 'laulma'}] # (ööbik AND laulma) OR (öökull AND laulma)
     }

for key, txt in collection.find_fingerprint(query=q, order_by_key=True):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


In [30]:
for key, txt in collection.find_fingerprint(
                    query={
                        "layer": "morph_analysis",
                        "ambiguous": True,
                        "field": "lemma",
                        "query": ['öökull', 'laulma'] # öökull OR laulma
                    },
                    order_by_key=True):
    print(key, txt)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


In [31]:
for key, txt in collection.find_fingerprint(
                    query={
                        "layer": "morph_analysis",
                        "ambiguous": True,
                        "field": "lemma",
                        "query": [{'öökull', 'laulma'}] # öökull AND laulma
                    },
                    order_by_key=True):
    print(key, txt)

1 Text(text='Öökull ei laula.')


In [32]:
for key, text in collection.find_fingerprint(layer_query={
            layer_1: {
                "field": "lemma",
                "query": ["ööbik", "öökull"],
                "ambiguous": True
            },
            layer_2: {
                "field": "lemma",
                "query": ["laulma"],
                "ambiguous": True
            }}):
    print(key, text)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


Search over multiple layers using `JsonbLayerQuery`:

In [33]:
for key, text in collection.select(layer_query={
        layer_1: JsonbLayerQuery(layer_name=layer_1, lemma='ööbik') | \
                 JsonbLayerQuery(layer_name=layer_1, lemma='öökull'),
        layer_2: JsonbLayerQuery(layer_name=layer_2, lemma='laulma')
        }):
    print(key, text)

0 Text(text='Ööbik laulab.')
1 Text(text='Öökull ei laula.')


Delete collection

In [34]:
collection.delete()

### Indexing layers

In [35]:
collection = storage.get_collection('collection_with_layers')
collection.create()

with collection.insert() as collection_insert:
    collection_insert(Text('See on esimene lause.').tag_layer(["sentences"]))
    collection_insert(Text('See on teine lause.').tag_layer(["sentences"]))

collection

INFO:collection.py:106: new empty collection 'collection_with_layers' created
INFO:collection.py:325: inserted 1 texts into the collection 'collection_with_layers'


Unnamed: 0,layer_type,attributes,ambiguous,parent,enveloping,_base,meta
compound_tokens,attached,"(type, normalized)",False,,tokens,compound_tokens,[]
sentences,attached,(),False,,words,sentences,[]
tokens,attached,(),False,,,tokens,[]
words,attached,"(normalized_form,)",False,,,words,[]


Ngram index enables to index ngrams in layer attributes.
For example, a bigram index on an attribute with values `['see', 'on', 'esimene', 'lause']` will contain pairs *'see-on'*, *'on-esimene'*, *'esimene-lause'*.
Indices of a higher order are also supported.

To build an ngram index, provide an argument *ngram_index* when creating a new layer.
The following code creates a bi-gram index on an attribute *lemma* for a newly created layer *indexed_layer*:

In [36]:
indexed_layer = 'indexed_layer'
tagger = VabamorfTagger(disambiguate=False, layer_name=indexed_layer)

collection.create_layer(tagger=tagger, ngram_index={"lemma": 2})

INFO:collection.py:925: collection: 'collection_with_layers'
INFO:collection.py:944: preparing to create a new layer: 'indexed_layer'
INFO:collection.py:977: inserting data into the 'indexed_layer' layer table
INFO:collection.py:1012: layer created: 'indexed_layer'


To search an ngram index, use the `find_fingerprint` method along with `layer_ngram_query` argument.

Search entries containing lemma bigram 'see-olema':

In [37]:
q = {indexed_layer: {
        "lemma": [("see", "olema")]
    }}
for key, text in collection.find_fingerprint(layer_ngram_query=q):
    print(key, text)

0 Text(text='See on esimene lause.')
1 Text(text='See on teine lause.')


Search 'teine-lause' OR 'olema-esimene':

In [38]:
q = {indexed_layer: {
        "lemma": [("teine", "lause"), ("olema", "esimene")]
    }}
for key, text in collection.find_fingerprint(layer_ngram_query=q):
    print(key, text)

0 Text(text='See on esimene lause.')
1 Text(text='See on teine lause.')


Search 'see-olema' AND 'olema-esimene':

In [39]:
q = {indexed_layer: {
        "lemma": [[("see", "olema"), ("olema", "esimene")]]
    }}
for key, text in collection.find_fingerprint(layer_ngram_query=q):
    print(key, text)

0 Text(text='See on esimene lause.')


In [40]:
collection.delete()

## `PgSubCollection`

In [41]:
collection = storage.get_collection('my_collection')
collection.create()

texts = ['Esimene tekst.', 'Teine tekst.', 'Kolmas tekst.']

with collection.insert() as collection_insert:
    for t in texts:
        collection_insert(Text(t))

INFO:collection.py:106: new empty collection 'my_collection' created
INFO:collection.py:325: inserted 2 texts into the collection 'my_collection'


In [42]:
from estnltk.taggers import TokensTagger

tokens_tagger = TokensTagger()

collection.create_layer(tagger=tokens_tagger)

INFO:collection.py:925: collection: 'my_collection'
INFO:collection.py:944: preparing to create a new layer: 'tokens'
INFO:collection.py:977: inserting data into the 'tokens' layer table
INFO:collection.py:1012: layer created: 'tokens'


The `select` method returns a `PgSubCollection` object that provides read-only access to a subset of the collection.

In [43]:
collection.select(query=None,
                  layer_query=None, 
                  layer_ngram_query=None,
                  layers=None,  # Sequence[str] 
                  keys = None,  # Sequence[int] 
                  collection_meta=None,  # Sequence[str] 
                  progressbar=None,  # str
                  missing_layer=None,  # str 
                  return_index=True  # bool
                  )

PgSubCollection(collection: 'my_collection', selected_layers=[], meta_attributes=(), progressbar=None, return_index=True)

In [44]:
for text in collection.select(progressbar='notebook', return_index=True):
    print(text)

HBox(children=(IntProgress(value=0, max=3), HTML(value='')))

(0, Text(text='Esimene tekst.'))
(1, Text(text='Teine tekst.'))
(2, Text(text='Kolmas tekst.'))



Get detached layer without `Text` object.

In [45]:
detached_layers = collection.select(return_index=False).detached_layer('tokens')
detached_layers

PgSubCollectionLayer(collection: 'my_collection', detached_layer='tokens', progressbar=None, return_index=False)

In [46]:
next(iter(detached_layers))

layer name,attributes,parent,enveloping,ambiguous,span count
tokens,,,,False,3

start,end
0,7
8,13
13,14


## Working with fragments

In [47]:
collection = storage["collection_with_fragments"].create(description='demo collection')

with collection.insert() as collection_insert:
    text1 = Text('Ööbik laulab.').tag_layer(['morph_analysis'])
    collection_insert(text1)

    text2 = Text('Öökull ei laula.').tag_layer(['morph_analysis'])
    key2 = collection_insert(text2)

    
def fragmenter(layer):
    return [layer]


tagger = VabamorfTagger(disambiguate=False, layer_name='fragmented_morph')

collection.create_fragmented_layer(tagger=tagger, fragmenter=fragmenter)

collection

INFO:collection.py:106: new empty collection 'collection_with_fragments' created
INFO:collection.py:325: inserted 1 texts into the collection 'collection_with_fragments'
INFO:collection.py:805: collection: 'collection_with_fragments'
INFO:collection.py:861: fragmented layer created: 'fragmented_morph'


Unnamed: 0,layer_type,attributes,ambiguous,parent,enveloping,_base,meta
compound_tokens,attached,"(type, normalized)",False,,tokens,compound_tokens,[]
fragmented_morph,fragmented,"(lemma, root, root_tokens, ending, clitic, form, partofspeech, _ignore)",True,words,,words,[]
morph_analysis,attached,"(lemma, root, root_tokens, ending, clitic, form, partofspeech)",True,words,,words,[]
sentences,attached,(),False,,words,sentences,[]
tokens,attached,(),False,,,tokens,[]
words,attached,"(normalized_form,)",False,,,words,[]


In [48]:
from estnltk.storage.postgres import select_raw, RowMapperRecord

table_name = 'fragment_test'
collection = storage.get_collection(table_name)
collection.create()

with collection.insert() as collection_insert:
    text1 = Text('see on esimene lause').tag_layer(["sentences"])
    collection_insert(text1)
    text2 = Text('see on teine lause').tag_layer(["sentences"])
    collection_insert(text2)

layer_fragment_name = "layer_fragment_1"
tagger = VabamorfTagger(disambiguate=False, layer_name=layer_fragment_name)
collection.old_slow_create_layer(layer_fragment_name,
                                 data_iterator=collection.select(layers=['sentences', 'compound_tokens']),
                                 row_mapper=lambda row: [
                                     RowMapperRecord(layer=tagger.tag(row[1], return_layer=True), meta=None)])

fragment_name = "fragment_1"

def row_mapper(row):
    text_id, text, meta, detached_layers = row
    parent_layer = detached_layers[layer_fragment_name]['layer']
    parent_id = detached_layers[layer_fragment_name]['layer_id']
    return [{'fragment': parent_layer, 'parent_id': parent_id},
            {'fragment': parent_layer, 'parent_id': parent_id}]

collection.create_fragment(fragment_name,
                    data_iterator=select_raw(collection=collection,
                                             detached_layers=[layer_fragment_name]),
                    row_mapper=row_mapper,
                    create_index=False,
                    ngram_index=None)

INFO:collection.py:106: new empty collection 'fragment_test' created
INFO:collection.py:325: inserted 1 texts into the collection 'fragment_test'
INFO:collection.py:696: collection: 'fragment_test'
INFO:collection.py:707: preparing to create a new layer: 'layer_fragment_1'
INFO:collection.py:770: layer created: 'layer_fragment_1'


In [49]:
collection.delete()

In [50]:
delete_schema(storage)
storage.close()