# Workflow for searching new recall samples

Finds a random sample of potential positive samples. Converts these to labelstudio format for later manual tagging. 

For this specific task, we create subsamples based on partofspeech tags of words preceding geographical terms.

## I. Setup

### Loading the source corpus

In [1]:
from helper_functions import load_configuration, connect_to_database

config = load_configuration('config\example_configuration.ini')
storage = connect_to_database(config)

collection = config['source_database']['collection']
collection = storage[collection]

INFO:storage.py:57: connecting to host: 'postgres.keeleressursid.ee', port: 5432, dbname: 'estonian-text-corpora', user: 'soras'
INFO:storage.py:108: schema: 'estonian_text_corpora', temporary: False, role: 'estonian_text_corpora_read'


In [2]:
collection.selected_layers = ['v171_named_entities', 'v172_geo_terms']

Read the geographical terms (extracted from the WordNet) that can be a part of a named geographical entity:

In [3]:
terms = []
with open('geo_terms.txt', 'r', encoding='UTF-8') as in_f:
    for line in in_f:
        if len(line.strip()) > 0:
            terms.append( line.strip() )
print(f'Loaded {len(terms)} terms.')

Loaded 63 terms.


### Local database for sampling

Initialize SpanSampler that uses a local sqlite database for sampling.

In [4]:
import os.path
from helper_functions import load_local_configuration

config = load_local_configuration('config/example_configuration.ini')
sampling_db = config['local_database']['sqlite_file']
print(f'local_sampling_db:  {sampling_db} (exists: {os.path.exists(sampling_db)})')

local_sampling_db:  geo_terms_pos_sample.db (exists: True)


In [5]:
from span_sampler_sqlite3 import SpanSampler

sampler = SpanSampler(collection=collection, 
                      layer='v172_geo_terms', 
                      attribute='lemma', 
                      termsfile='geo_terms.txt', 
                      db_file_name=config['local_database']['sqlite_file'], 
                      verbose=True)

Loaded 63 terms from geo_terms.txt.


Before sampling, all terms that are subject sampling need to be searched from the large source database and recorded into a  local database, so that sampling can be quick and smooth. This searching and indexing process can take several hours. 

**Note:** If you already have the local database populated with term locations, then the next command produces a warning and skips the local database creation. If you still want to repeat the local database creation from the scratch, then you should delete the local database file before creating SpanSampler.

In [6]:
# build terms index (can take several hours if done from the scratch)
sampler.create_attribute_locations_table()



Once the indexing has been completed, we can start sampling from the local database. 
We can create the samples by calling the sampler, specifying the count of samples we want and a filter which is a list of attribute values: partofspeech tags of words preceding geographical terms.

For testing: lets sample geographical terms preceded by different types of adjectives ("A", "C", "U"):

In [7]:
samples_adjectives = sampler(count=1000, attribute_values=("A", "C", "U"))
display(samples_adjectives[:3])

[(Text(text='“ Tallinna vanalinnas mistahes tehinguid tehes tuleb arvestada kümneprotsendise altkäemaksuga , ” kinnitavad Luubi hästiinformeeritud allikad .'),
  (54915, 115, 141, 'A')),
 (Text(text='Taas jookseb kuum juga üle selja , käed tõmbuvad higiseks , süda klopib .'),
  (55615, 13, 22, 'A')),
 (Text(text='Mõne saare või terve mandri ( Atlantise ? ) merrevajumisest veelgi rohkem võib kaasaja inimese läbi raputada hoopiski tsivilisatsiooni hukk “ otse meie silme all ” .'),
  (56776, 15, 27, 'A'))]

In [8]:
samples_adjectives[0][0]

text
"“ Tallinna vanalinnas mistahes tehinguid tehes tuleb arvestada kümneprotsendise altkäemaksuga , ” kinnitavad Luubi hästiinformeeritud allikad ."

0,1
file,aja_luup_1998_06.xml
sent_end,6319
sent_start,6176
subcorpus,aja_luup
text_no,1054
title,Tallinna vanalinna viimsed päevad Vanalinna 1510 hoonet ähvardab häving
type,artikkel

layer name,attributes,parent,enveloping,ambiguous,span count
v172_geo_terms,lemma,,,True,1


In [9]:
samples_adjectives[0][0]['v172_geo_terms']

layer name,attributes,parent,enveloping,ambiguous,span count
v172_geo_terms,lemma,,,True,1

text,lemma
allikad,allikas


## II. Creating unlabelled samples

In [10]:
import os, os.path
from copy import copy
from estnltk.converters.label_studio.label_studio import LabelStudioExporter

output_dir = 'unlabelled/pos_terms_1000'
os.makedirs(output_dir, exist_ok=True)

# Take samples for all partofspeech:
# https://github.com/estnltk/estnltk/blob/main/tutorials/nlp_pipeline/B_morphology/00_tables_of_morphological_categories.ipynb
# Note that there is a redundancy: not all postags can be inside a named entity phrase
for partofspeech in ('A','C','D','G','H','I','J','K','N','O','P','S','U','V','X','Y','Z'):
    samples = sampler(count=1000, attribute_values="('"+partofspeech+"')")

    for text, sample_span in samples:
        spanstart = sample_span[1]
        spanend = sample_span[2]
        # Remove geo terms spans that are not preceded by the given postag
        for span in copy(text.v172_geo_terms.spans):
            if span.start != spanstart and span.end != spanend:
                text.v172_geo_terms.remove_span(span)
    
    output_path = os.path.join(output_dir, "pos_"+partofspeech+"_1000.json")
    exporter = LabelStudioExporter(output_path, ['v172_geo_terms'], checkbox=True)

    only_texts = [sample[0] for sample in samples]

    exporter.convert(only_texts, append=False)

In [12]:
output_path = os.path.join(output_dir, "pos_"+partofspeech+"_1000.json")
exporter = LabelStudioExporter(output_path, ['v172_geo_terms'], checkbox=True)
print(exporter.interface_generator())


        <View>
            <Labels name="label" toName="text">
	<Label value="v172_geo_terms" background="#04DA21"/> 

            </Labels>
        <Text name="text" value="$text"/>
            
            </View>
