# Sampling `Text` objects and layers from a PostgreSQL database

EstNLTK also provides methods for shuffling a text collection and drawing samples from a collection stored in a PostgreSQL database. 

### Preparing data

To demonstrate these methods, let us first define a function that generates example data:

In [1]:
from estnltk import Text, logger

def create_test_collection_of_docs():
    # Generates a test set containing 100 documents and 2500 sentences
    # 1) Define subject words and phrases:
    subj_words = ['kiisumiisu', 'vanahärra', 'vanama', 'neiu', 'tuttav', \
                  'filharmoonik', 'sahin', 'kärbes', 'teleskoop', 'võsalendur',\
                  'kapsauss', 'klaverijalg', 'sugulane', 'viiuldaja', 'temake', \
                  'kvantarvuti', 'puhvet', 'kuldlõige', 'proua', 'kahvel', \
                  'peremees', 'kaalujälgija', 'lõkats', 'vintraud', 'vahvel']
    new_subj_words =[]
    for adj in ['esimene', 'teine', 'kolmas', 'neljas']:
        for subj in subj_words:
            new_subj_words.append( adj+' '+subj )
    subj_words = new_subj_words
    # 2) Define verbs and objects:
    verb_words = ['loeb', 'keedab', 'kasvatab', 'kiigutab', 'organiseerib']
    obj_words = ['raamatut', 'ruutmeetreid', 'kartuleid', 'kohvrit', 'distantsõpet']
    test_texts = []
    sentence_counter = 0
    # 3) Generate one document per each subject
    for subj in subj_words:
        subj_text = []
        # Generate sentences for the text
        for verb in verb_words:
            for obj in obj_words:
                text_str = (' '.join([subj, verb, obj])).capitalize()+'.'
                subj_text.append( text_str )
                sentence_counter += 1
        text = Text( '\n'.join(subj_text) )
        text.meta['text_id'] = len(test_texts)
        text.meta['text_actor'] = subj.capitalize()
        test_texts.append( text )
    logger.info('Total {} sentences generated.'.format(sentence_counter))
    logger.info('Total {} texts generated.'.format(len(test_texts)))
    return test_texts

In [2]:
test_texts = create_test_collection_of_docs()

INFO:<ipython-input-1-48445a78a642>:34: Total 2500 sentences generated.
INFO:<ipython-input-1-48445a78a642>:35: Total 100 texts generated.


Connect to the database and set up the schema:

In [3]:
from estnltk.storage.postgres import PostgresStorage, create_schema, delete_schema

storage = PostgresStorage(pgpass_file='~/.pgpass', dbname='test_db', schema='my_schema')
create_schema(storage)

INFO:storage.py:42: connecting to host: 'localhost', port: '5432', dbname: 'test_db', user: 'postgres'
INFO:storage.py:58: schema: 'my_schema', temporary: False, role: 'postgres'


Create a new collection. Add tokenization layers to `Text` objects and insert them into the collection:

In [4]:
from collections import OrderedDict

collection = storage['sampling_test_collection']
collection.create(meta=OrderedDict([('text_actor', 'str')]))

with collection.insert() as collection_insert:
    for text_obj in test_texts:
        # Add tokenization layers to the Text
        text_obj.tag_layer('sentences')
        # Insert Text into the database
        collection_insert(text_obj, meta_data={'text_actor':text_obj.meta['text_actor'] })

INFO:collection.py:101: new empty collection 'sampling_test_collection' created
INFO:collection_text_object_inserter.py:110: inserted 100 texts into the collection 'sampling_test_collection'


### Permutating the collection

By default, collection's `select()` will return texts in the order of their insertion (i.e. texts are sorted by `text_id`):

In [5]:
for text_id, text_obj in collection.select().head(10):
    print( text_id, text_obj.text.split('\n')[0]+'..')

0 Esimene kiisumiisu loeb raamatut...
1 Esimene vanahärra loeb raamatut...
2 Esimene vanama loeb raamatut...
3 Esimene neiu loeb raamatut...
4 Esimene tuttav loeb raamatut...
5 Esimene filharmoonik loeb raamatut...
6 Esimene sahin loeb raamatut...
7 Esimene kärbes loeb raamatut...
8 Esimene teleskoop loeb raamatut...
9 Esimene võsalendur loeb raamatut...


You can apply subcollection's method `permutate()` to retrieve texts in random order:

In [6]:
counter = 0
for text_id, text_obj in collection.select().permutate(seed=0.5):
    print( text_id, text_obj.text.split('\n')[0]+'..')
    counter += 1
    if counter > 9:
        break

65 Kolmas kvantarvuti loeb raamatut...
55 Kolmas filharmoonik loeb raamatut...
90 Neljas kvantarvuti loeb raamatut...
45 Teine peremees loeb raamatut...
14 Esimene temake loeb raamatut...
50 Kolmas kiisumiisu loeb raamatut...
26 Teine vanahärra loeb raamatut...
88 Neljas viiuldaja loeb raamatut...
93 Neljas proua loeb raamatut...
36 Teine klaverijalg loeb raamatut...


Fixing `seed` can be used to ensure repeatability of the permutation. The `seed` must be a float from range -1.0 to 1.0.

<p>
<div class="alert alert-block alert-warning">
   <h4><i>Repeatability across platforms with <code>permutate()</code></i></h4> 
   <p>If you are using PostgreSQL's version &lt; 12.0, be aware that same <code>seed</code> value gives different results on different server platforms (Windows vs Linux). In case of PostgreSQL's versions &gt;= 12.0, the behaviour should be uniform across platforms. </p>
</div>
</p>

### Sampling `Text`-s from the collection

You can apply subcollection's method `sample()` to draw a random sample of texts from the subcollection. 
The amount of sample must be specified as a parameter, and by default, it is a _percentage_. 
In the following example, we draw approximately 10% of texts from the collection:

In [7]:
for text_id, text_obj in collection.select().sample( 10, seed=55 ):
    print( text_id, text_obj.text.split('\n')[0]+'..')

16 Esimene puhvet loeb raamatut...
17 Esimene kuldlõige loeb raamatut...
19 Esimene kahvel loeb raamatut...
26 Teine vanahärra loeb raamatut...
36 Teine klaverijalg loeb raamatut...
54 Kolmas tuttav loeb raamatut...
77 Neljas vanama loeb raamatut...
78 Neljas neiu loeb raamatut...
94 Neljas kahvel loeb raamatut...
99 Neljas vahvel loeb raamatut...


Parameter `seed` (a positive integer) can be used to ensure repeatability.

Instead of drawing a percentage, you can also use `amount_type='SIZE'` to require that a specific amount of texts will be drawn. For instance, let's draw _approximately_ 15 texts from the collection:

In [8]:
for text_id, text_obj in collection.select().sample( 15, amount_type='SIZE', seed=25 ):
    print( text_id, text_obj.text.split('\n')[0]+'..')

0 Esimene kiisumiisu loeb raamatut...
5 Esimene filharmoonik loeb raamatut...
10 Esimene kapsauss loeb raamatut...
14 Esimene temake loeb raamatut...
17 Esimene kuldlõige loeb raamatut...
33 Teine teleskoop loeb raamatut...
35 Teine kapsauss loeb raamatut...
37 Teine sugulane loeb raamatut...
42 Teine kuldlõige loeb raamatut...
47 Teine lõkats loeb raamatut...
54 Kolmas tuttav loeb raamatut...
65 Kolmas kvantarvuti loeb raamatut...
69 Kolmas kahvel loeb raamatut...
73 Kolmas vintraud loeb raamatut...
74 Kolmas vahvel loeb raamatut...
78 Neljas neiu loeb raamatut...
92 Neljas kuldlõige loeb raamatut...


_Note:_ Be aware that regardless of the `amount_type`, the number of returned texts **may not correspond exactly to the given amount** -- there can be some fluctuation in size. If you need a sample with exact size, it is advisable to sample a larger amount than needed, shuffle the result (to ensure that all elements have a good chance ending up in the final sample), and then cut the sample to the required size.

### Sampling from collection's layer

Subcollection's method `sample_from_layer()` allows to draw a random sample from a specific layer.

In order to use the method, you first need to specify the target layer in `select()`, e.g. to sample from the 'sentences' layer, first use `select(layers=['sentences'])` on the collection.
Then you can apply `sample_from_layer()` on the subcollection. For instance, let us sample _approximately_ 5% of sentences from the collection:

In [9]:
for text_id, text_obj in collection.select(layers=['sentences']).sample_from_layer('sentences', 5, seed=0.5):
    # Output text id
    print('selected text id:', text_id, '|', end=' ')
    # Output randomly selected sentences
    print( 'selected sentences ({}):'.format(len(text_obj.sentences)) )
    for sentence in text_obj.sentences:
        print( '  ', sentence.enclosing_text )

selected text id: 4 | selected sentences (2):
   Esimene tuttav keedab distantsõpet.
   Esimene tuttav organiseerib ruutmeetreid.
selected text id: 5 | selected sentences (1):
   Esimene filharmoonik loeb raamatut.
selected text id: 6 | selected sentences (3):
   Esimene sahin loeb ruutmeetreid.
   Esimene sahin keedab ruutmeetreid.
   Esimene sahin kiigutab kartuleid.
selected text id: 7 | selected sentences (2):
   Esimene kärbes kasvatab raamatut.
   Esimene kärbes organiseerib ruutmeetreid.
selected text id: 8 | selected sentences (2):
   Esimene teleskoop kasvatab distantsõpet.
   Esimene teleskoop kiigutab ruutmeetreid.
selected text id: 10 | selected sentences (3):
   Esimene kapsauss loeb kartuleid.
   Esimene kapsauss loeb kohvrit.
   Esimene kapsauss kiigutab raamatut.
selected text id: 11 | selected sentences (2):
   Esimene klaverijalg kiigutab kohvrit.
   Esimene klaverijalg organiseerib ruutmeetreid.
selected text id: 15 | selected sentences (2):
   Esimene kvantarvuti ka

While the method `sample_from_layer()` will return `Text` objects with their full textual content, it will only return _randomly chosen spans_ from the sampled layer:

In [10]:
# Display sampled sentences from the first Text object
for text_id, text_obj in collection.select(layers=['sentences']).sample_from_layer('sentences', 5, seed=0.5):
    print('selected text id:', text_id)
    text_obj.sentences.display()
    break

selected text id: 4


Note that sampling _only applies on the target layer_ -- all other selected layers will be returned at their full length.

Additional notes about `sample_from_layer()`:

   * Fixing `seed` (a float from range -1.0 to 1.0) can be used to ensure repeatability of sampling;
   
    * But if you are using PostgreSQL's version &lt; 12.0, then same `seed` gives different results on different platforms (Windows vs Linux). In case of PostgreSQL's versions &gt;= 12.0, the behaviour should be uniform across platforms.


   * Instead of drawing a percentage of spans, you can also use `amount_type='SIZE'` to require that a specific amount of spans will be drawn;
   
    * But be aware that regardless of the `amount_type`, the number of returned texts **may not correspond exactly to the given amount** -- there can be some fluctuation in size.

Finally, clean up the database and disconnect:

In [11]:
collection.delete()

In [12]:
delete_schema(storage)
storage.close()