# Workflow for searching new recall samples

Finds a random sample of potential positive samples. Converts these to labelstudio format for manual tagging. For this specific task, we create three different subsamples based on the geographic term. This is to improve the accuracy of the recall estimation by grouping samples to groups where the tagger might get different results. The groups here are: most common geographic terms, geographic terms that are homonyms of some more popular word and all the rest.

## I. Setup

### Loading the corpus

In [3]:
from estnltk.storage.postgres import PostgresStorage
import configparser
import os

config = configparser.ConfigParser()
config_file = 'config\example_configuration.ini'

file_name = os.path.abspath(os.path.expanduser(os.path.expandvars(str(config_file))))
config.read(file_name)

dbname = config['source_database']['database']
user = config['source_database']['username']
password = config['source_database']['password']
host = config['source_database']['host']
port = config['source_database']['port']
role = config['source_database']['role']
schema = config['source_database']['schema']
collection = config['source_database']['collection']


storage = PostgresStorage(host=host,
                          port=int(port),
                          dbname=dbname,
                          user=user,
                          password=password,
                          schema=schema,
                          role=role,
                          temporary=False)

display(storage)

collection = storage[collection]

collection.selected_layers = ['v171_named_entities','v172_geo_terms']

INFO:storage.py:58: connecting to host: 'postgres.keeleressursid.ee', port: 5432, dbname: 'estonian-text-corpora', user: 'rasmusm'
INFO:storage.py:108: schema: 'estonian_text_corpora', temporary: False, role: 'estonian_text_corpora_read'


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,rows,total_size,comment
collection,version,relations,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
koondkorpus_base_subset_of_5000_v2,2.0,,0,12 MB,Collection of 5000 randomly picked Koondkorpus texts (v2)
koondkorpus_base_subset_of_5000_v2,2.0,original_sentences_flat__la,0,5544 kB,created by soras on Fri Jun 12 11:28:06 2020
koondkorpus_base_subset_of_5000_v2,2.0,original_words__layer,0,19 MB,created by soras on Fri Jun 12 09:15:46 2020
koondkorpus_base_subset_of_5000_v2,2.0,original_words_morph_analys,0,96 MB,"Morphological analysis from v1.6.2/3, probably based on commit 349a7c2 (2018-11-22)"
koondkorpus_base_subset_of_5000_v2,2.0,structure,2,32 kB,
koondkorpus_base_subset_of_5000_v2,2.0,v166_compound_tokens__layer,0,5472 kB,created by soras on Thu Jun 4 12:29:42 2020
koondkorpus_base_subset_of_5000_v2,2.0,v166_morph_analysis__layer,0,97 MB,created by soras on Tue Jun 9 14:13:07 2020
koondkorpus_base_subset_of_5000_v2,2.0,v166_sentences__layer,0,21 MB,created by soras on Tue Jun 9 06:01:41 2020
koondkorpus_base_subset_of_5000_v2,2.0,v166_tokens__layer,0,20 MB,created by soras on Thu Jun 4 07:40:39 2020
koondkorpus_base_subset_of_5000_v2,2.0,v166_words__layer,0,20 MB,created by soras on Fri Jun 5 05:49:26 2020


Read the geographical terms from WordNet that can be a part of a named geographical entity

In [123]:
terms = []
with open('geo_terms.txt','r',encoding='UTF-8') as f:
    term = f.readline()
    while term is not '':
        terms.append(term.strip())
        term = f.readline()

### Span Sampler

Python class for sampling spans. Create a local database and a table in it with a row for each span. This means that each term can be sampled separately.

In [124]:
from random import sample, choices
from estnltk.storage.postgres import LayerQuery, IndexQuery
from tqdm import tqdm

class SpanSampler:
    
    def __init__(self, storage, collection, layer, attribute):
        self.storage = storage
        self.conn = storage.conn
        self.cur = self.conn.cursor()
        self.collection = collection
        self.layer = layer
        self.attribute = attribute
    
    def __call__(self, count, attribute, return_index=False, with_replacement=True): 
        # Returns iterator of type Text, Span or int, Text, span
        # count determines the number of samples
        # with replacement means that same span can be sampled several times
        self.conn.commit()
        self.create_sampling_matrix(attribute)
        indices = self.find_sampled_indices(count,with_replacement)
        result_list = []
        only_txt_index = [idx[1] for idx in indices]
        texts = list(collection.select( query=IndexQuery(only_txt_index),layers=[self.layer],return_index=True ))
        for text in texts:
            idx = [index for index in indices if text[0] == index[1]][0]
            if return_index:
                result_list.append((idx[0],text[1],text[1][self.layer][idx[2]]))
            else:
                result_list.append((text[1],text[1][self.layer][idx[2]]))
        self.clear_sampling_matrix()
        return result_list
    
    def attribute_locations_creation(self):
        self.conn.commit()
        self.cur.execute("""SELECT EXISTS (
           SELECT FROM information_schema.tables 
           WHERE  table_schema = 'public'
           AND    table_name   = 'attribute_locations'
           );""")
        res = self.cur.fetchall()
        if not res[0][0]:
            self.cur.execute("CREATE TABLE attribute_locations (layer_id integer, attribute_value varchar, indices integer[], count integer);")
            self.conn.commit()
            for term in terms:
                q = LayerQuery('v172_geo_terms', lemma=term)
                for key, txt in tqdm(collection.select(query=q,layers=['v172_geo_terms'])):
                    indices = [i for i, nertag in enumerate(txt['v172_geo_terms']['lemma']) if nertag[0] ==term]
                    self.cur.execute("INSERT INTO attribute_locations (layer_id, attribute_value,indices,count) VALUES (%s, %s, %s, %s)",(key, term, indices,len(indices)))

        self.conn.commit()

        
    def create_sampling_matrix(self,attribute_val):
        self.cur.execute("CREATE TABLE sampling_matrix (id serial, layer integer, layer_index integer);")
        self.cur.execute("INSERT INTO sampling_matrix (layer,layer_index) (SELECT layer_id as layer, unnest(indices) as layer_index FROM attribute_locations WHERE attribute_value IN " + str(attribute_val) + ");")
        self.conn.commit()
    
    def find_sampled_indices(self,count,with_replacement):
        self.cur.execute("SELECT COUNT(*) FROM sampling_matrix;")
        span_count = self.cur.fetchall()[0][0]
        self.conn.commit()
        if with_replacement:
            sampled = choices(range(span_count),k=count)
        else:
            sampled = sample(range(span_count),count)
        self.cur.execute("SELECT * FROM sampling_matrix WHERE id IN " + str(tuple(sampled)) + ';')
        return self.cur.fetchall()
    
    def clear_sampling_matrix(self):
        self.conn.commit()
        self.cur.execute("DROP TABLE sampling_matrix;")
        self.conn.commit()
        

Initialize a local Postgres collection. This is necessary for the temporary table of spans from which the sampling is done which should not be a public table.

In [1]:
# Second storage to keep the temporary lists used for sampling

# load configuration

import configparser
import os

config = configparser.ConfigParser()
config_file = 'config\example_configuration.ini'

file_name = os.path.abspath(os.path.expanduser(os.path.expandvars(str(config_file))))

if not os.path.exists(file_name):
    raise ValueError("File {file} does not exist".format(file=str(config_file)))

if len(config.read(file_name)) != 1:
    raise ValueError("File {file} is not accessible or is not in valid INI format".format(file=config_file))

for option in ["host", "port", "database", "username", "password", "schema", "collection"]:
    if not config.has_option('target_database', option):
        prelude = "Error in file {}\n".format(file_name) if len(file_name) > 0 else ""
        raise ValueError(
            "{prelude}Missing option {option} in the section [{section}]".format(
                prelude=prelude, option=option, section='target_database'
            )
        )

config.read(file_name)

# connect to database

from estnltk.storage.postgres import PostgresStorage

dbname = config['target_database']['database']
user = config['target_database']['username']
password = config['target_database']['password']
host = config['target_database']['host']
port = config['target_database']['port']
schema = config['target_database']['schema']
collection = config['target_database']['collection']

localstorage = PostgresStorage(host=host,
                          port=int(port),
                          dbname=dbname,
                          user=user,
                          password=password,
                          schema=schema,
                          role=None,
                          temporary=False)

NameError: name 'PostgresStorage' is not defined

The sampler here is initialized to work with the localstorage and the collection connection we opened up. On initializing, if the table is not created yet, it creates a local table of all spans from the layer _v172_geo_terms_ and also saves the attribute _lemma_ for each span.

Then we create the samples by calling the sampler, specifying the count of samples we want and a filter which is a list of attribute values for the attribute specified before (lemma).

In [126]:
sampler = SpanSampler(storage=localstorage,collection=collection, layer='v172_geo_terms',attribute='lemma')

#localstorage.conn.commit()

#sampler.clear_sampling_matrix()

samples = sampler(count=1000,attribute=tuple(filtered_terms))

display(samples[:3])

## II. Creating unlabelled samples 

After creating the samples, they are put to a pickle file so they could be easily reused at a later time or a different place.

Currently, there is no good way to check for duplicates. All pairs of items should be compared in a loop.

In [156]:

import pickle

with open("1000_ulejaanud.pickle",'wb') as f:
    pickle.dump(samples,f)

Once a span is sampled, take its text and remove all other spans from it so that only the sampled span would be displayed.

In [157]:
from copy import copy

for text, sample_span in samples:
    for span in copy(text.v172_geo_terms.spans):
        if span != sample_span:
            text.v172_geo_terms.remove_span(span)

### Getting sentences to labelstudio format

Labelstudio exporter writes labelstudio JSON file to the file given as argument here. This should be exported to the project you set up in labelstudio. Labelstudio offers different labeling interfaces but also a possibility to define it with code. The code outputted by _exporter.labeling_interface_ can be copied to the labeling interface code part.

In [130]:
from estnltk.converters.label_studio.label_studio import LabelStudioExporter

exporter = LabelStudioExporter("koond_1000_ulejaanud.json",'v172_geo_terms',checkbox=True)

print(exporter.labeling_interface)

only_texts = [sample[0] for sample in samples]

exporter.convert(only_texts,append=False)