# Workflow for searching new recall samples

Finds a random sample of potential positive samples. Converts these to labelstudio format for manual tagging. For this specific task, we create three different subsamples based on the geographic term. This is to improve the accuracy of the recall estimation by grouping samples to groups where the tagger might get different results. The groups here are: most common geographic terms, geographic terms that are homonyms of some more popular word and all the rest.

## I. Setup

### Loading the source corpus

In [1]:
from helper_functions import load_configuration, connect_to_database

config = load_configuration('config\example_configuration.ini')
storage = connect_to_database(config)

display(storage)

collection = config['source_database']['collection']
collection = storage[collection]

collection.selected_layers = ['v171_named_entities','v172_geo_terms']

INFO:storage.py:57: connecting to host: 'postgres.keeleressursid.ee', port: 5432, dbname: 'estonian-text-corpora', user: 'soras'
INFO:storage.py:108: schema: 'estonian_text_corpora', temporary: False, role: 'estonian_text_corpora_read'


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,rows,total_size,comment
collection,version,relations,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
koondkorpus_base_subset_of_5000_v2,2.0,,0,12 MB,Collection of 5000 randomly picked Koondkorpus texts (v2)
koondkorpus_base_subset_of_5000_v2,2.0,original_sentences_flat__la,0,5544 kB,created by soras on Fri Jun 12 11:28:06 2020
koondkorpus_base_subset_of_5000_v2,2.0,original_words__layer,0,19 MB,created by soras on Fri Jun 12 09:15:46 2020
koondkorpus_base_subset_of_5000_v2,2.0,original_words_morph_analys,0,96 MB,"Morphological analysis from v1.6.2/3, probably based on commit 349a7c2 (2018-11-22)"
koondkorpus_base_subset_of_5000_v2,2.0,structure,0,32 kB,
koondkorpus_base_subset_of_5000_v2,2.0,v166_compound_tokens__layer,0,5472 kB,created by soras on Thu Jun 4 12:29:42 2020
koondkorpus_base_subset_of_5000_v2,2.0,v166_morph_analysis__layer,0,97 MB,created by soras on Tue Jun 9 14:13:07 2020
koondkorpus_base_subset_of_5000_v2,2.0,v166_sentences__layer,0,21 MB,created by soras on Tue Jun 9 06:01:41 2020
koondkorpus_base_subset_of_5000_v2,2.0,v166_tokens__layer,0,20 MB,created by soras on Thu Jun 4 07:40:39 2020
koondkorpus_base_subset_of_5000_v2,2.0,v166_words__layer,0,20 MB,created by soras on Fri Jun 5 05:49:26 2020


Read the geographical terms from WordNet that can be a part of a named geographical entity:

In [2]:
terms = []
with open('geo_terms.txt', 'r', encoding='UTF-8') as in_f:
    for line in in_f:
        if len(line.strip()) > 0:
            terms.append( line.strip() )
print(f'Loaded {len(terms)} terms.')

Loaded 63 terms.


Initialize SpanSampler that uses a local sqlite database.

In [3]:
import os.path
from helper_functions import load_local_configuration

config = load_local_configuration('config/example_configuration.ini')
sampling_db = config['local_database']['sqlite_file']
print(f'local_sampling_db:  {sampling_db} (exists: {os.path.exists(sampling_db)})')

local_sampling_db:  geo_terms_sample.db (exists: True)


In [4]:
from span_sampler_sqlite3 import SpanSampler

sampler = SpanSampler(collection=collection, 
                      layer=config['source_database']['terms_layer'], 
                      attribute='lemma', 
                      termsfile='geo_terms.txt', 
                      db_file_name=config['local_database']['sqlite_file'], 
                      verbose=True)

Loaded 63 terms from geo_terms.txt.


Before sampling, all terms that are subject sampling need to be searched from the large source database and recorded into a smaller local database, so that sampling can be quick and smooth. This searching and indexing process can take several hours. 

**Note:** If you already have the local database populated with term locations, then the next command produces a warning and skips the local database creation. If you still want to repeat the local database creation from the scratch, then you should delete the local database file before creating SpanSampler.

In [5]:
# search and index terms (can take several hours if done from the scratch)
sampler.create_attribute_locations_table()



Once indexing is completed, we can start sampling from the local database. 
We can create the samples by calling the sampler, specifying the count of samples we want and a filter which is a list of attribute values for the attribute specified before (lemma).

In [6]:
samples = sampler(count=1000, attribute_values=tuple(terms))
display(samples[:3])

[(Text(text='Ükstaspuha , mis kanali telekas lahti teed , igal pool võid näha .'),
  Span('lahti', [{'lemma': 'laht'}])),
 (Text(text='Eduard veerib Edda käekirja pärast lahti ja toimetab üle .'),
  Span('lahti', [{'lemma': 'laht'}])),
 (Text(text='Mäletame plaane muuta suurte Siberi jõgede voolusuunda , ehitada tamm Beringi väina jne.'),
  Span('jõgede', [{'lemma': 'jõgi'}]))]

Note that our terms have been divided into **subpopulations** by their frequency/ambiguity:

|Subpopulation     | Description | Examples |
|:--- |:---|:---|
|Levinumad         | Geographic locations with most frequent suffixes | Niiluse jõgi, Aasovi meri, Peipsi järv   |  
|Mitmetäheduslikud | Geographic locations with ambigous suffixes      | Panama kanal, Panga pank, Kura kurk      |
|Ülejäänud         | Other geographic locations                       | Vaikne ookean, Liivi laht, Tehvandi mägi |  

Let's load subpopulation information:

In [7]:
from helper_functions import load_term_subpopulations
term_subpopulations = load_term_subpopulations()
term_subpopulations.keys()

dict_keys(['levinumad', 'mitmetahenduslikud', 'ulejaanud'])

In [8]:
term_subpopulations['levinumad']

{'järv', 'jõgi', 'meri', 'saar'}

Now we can draw samples from each subpopulation separately:

In [9]:
samples_levinumad = sampler(count=1000, attribute_values=tuple(term_subpopulations['levinumad']))
display(samples_levinumad[:3])

[(Text(text='Algul segas armunute vaadet imekaunile järvele ilmatu suur pruun kuur , tänaseks on see maha lõhutud .'),
  Span('järvele', [{'lemma': 'järv'}])),
 (Text(text='Ja teen kontserte selliste rituaalide ja suunitlusega , et soovime Eesti riigile samasugust saatust , mis valitseb praegu Bali saarel .'),
  Span('saarel', [{'lemma': 'saar'}])),
 (Text(text='Üks tehase omanike esindajaist nimetas “ haiglaseks urgitsemiseks ” küsimust , millise summa eest müüdi hotell Bahama saarte firmale Rahmsad Investors Ltd.'),
  Span('saarte', [{'lemma': 'saar'}]))]

In [10]:
samples_mitmetahenduslikud = sampler(count=1000, attribute_values=tuple(term_subpopulations['mitmetahenduslikud']))
display(samples_mitmetahenduslikud[:3])

[(Text(text='Ühelt poolt domineerivad börsil pankade endi aktsiad , teisalt aga tegutsevad pangad ise või oma tütarfirmade kaudu börsil maakleritena .'),
  Span('pankade', [{'lemma': 'pank'}])),
 (Text(text='Peamisi samme oli pankade kapitali adekvaatsusnormatiivi tõstmine , lisaks sellele suurendasime riskikaalusid ka kohalike omavalitsuste laenudele .'),
  Span('pankade', [{'lemma': 'pank'}])),
 (Text(text='Kui uued juhid olid panga paremini tööle pannud , müüsime oma osa järgmisele pangale juba tunduvalt suurema summa eest , ” on pankade saneerimist lähedalt näinud Preatoni konkreetne .'),
  Span('pangale', [{'lemma': 'pank'}]))]

In [11]:
samples_ulejaanud = sampler(count=1000, attribute_values=tuple(term_subpopulations['ulejaanud']))
display(samples_ulejaanud[:3])

[(Text(text='Suudan tõtt vaadata ainult nende asjadega , mis on mu ninast paari sentimeetri kaugusel .'),
  Span('ninast', [{'lemma': 'ninas'}])),
 (Text(text='Kui nooremas ja depressiivsemas vanuses oli mul selles kahtlusi , siis praeguseks olen jõudnud tõdemuseni , et see on väga lahe - lihtsalt elada !'),
  Span('lahe', [{'lemma': 'laht'}])),
 (Text(text='Teise allika sõnul tegi Katariina kunagi New Yorgis hea partii , abielludes kellegi sealse miljonäriga .'),
  Span('allika', [{'lemma': 'allikas'}]))]

## II. Creating unlabelled samples 

Currently, there is no good way to check for duplicates. All pairs of items should be compared in a loop.

Once a span is sampled, take its text and remove all other spans from it so that only the sampled span would be displayed.

In [12]:
from copy import copy

for text, sample_span in samples_levinumad:
    for span in copy(text.v172_geo_terms.spans):
        if span != sample_span:
            text.v172_geo_terms.remove_span(span)

for text, sample_span in samples_mitmetahenduslikud:
    for span in copy(text.v172_geo_terms.spans):
        if span != sample_span:
            text.v172_geo_terms.remove_span(span)

for text, sample_span in samples_ulejaanud:
    for span in copy(text.v172_geo_terms.spans):
        if span != sample_span:
            text.v172_geo_terms.remove_span(span)

After samples have been finalized, they are put into a pickle file so they could be easily reused at a later time or a different place.

In [13]:
import pickle

with open("unlabelled_data/sampled_sentences/1000_levinumad.pickle",'wb') as f:
    pickle.dump(samples_levinumad, f)

with open("unlabelled_data/sampled_sentences/1000_mitmetahenduslikud.pickle",'wb') as f:
    pickle.dump(samples_mitmetahenduslikud, f)
    
with open("unlabelled_data/sampled_sentences/1000_ulejaanud.pickle",'wb') as f:
    pickle.dump(samples_ulejaanud, f)

### Getting sentences to labelstudio format

Labelstudio exporter writes labelstudio JSON file to the file given as argument here. This should be exported to the project you set up in labelstudio. Labelstudio offers different labeling interfaces but also a possibility to define it with code. The code outputted by _exporter.labeling_interface_ can be copied to the labeling interface code part.

In [7]:
import pickle

samples = {}
for subpopulation in ['levinumad', 'mitmetahenduslikud', 'ulejaanud']:
    with open(f"unlabelled_data/sampled_sentences/1000_{subpopulation}.pickle", 'rb') as f:
        samples[subpopulation] = pickle.load(f)

In [130]:
from estnltk.converters.label_studio.label_studio import LabelStudioExporter

for subpopulation in ['levinumad', 'mitmetahenduslikud', 'ulejaanud']:
    exporter = LabelStudioExporter(f"unlabelled_data/sampled_sentences_ls_format/koond_1000_{subpopulation}.json",
                                   'v172_geo_terms',
                                   checkbox=True)
    print(exporter.labeling_interface)
    only_texts = [sample[0] for sample in samples[subpopulation]]
    exporter.convert(only_texts, append=False)