# Workflow for updating the benchmark

This workflow adds texts to the benchmark.

## I. Initial split  into subpopulations  

The entire recall set consists of all geographic named entities for which the last word is among [the predefined suffix_words](geo_terms.txt). 


|Subpopulation     | Description | Examples |
|:--- |:---|:---|
|Levinumad         | Geographic locations with most frequent suffixes | Niiluse jõgi, Aasovi meri, Peipsi järv   |  
|Mitmetäneduslikud | Geographic locations with ambigous suffixes      | Panama kanal, Panga pank, Kura kurk      |
|Ülejäänud         | Other geographic locations                       | Vaikne ookean, Liivi laht, Tehvandi mägi |   

## II. Consistency with the benchmark setup

* Check that necessary recall_sets exist and are readable CSV files
* Check that there are no invalid nor duplicate annotations

In [1]:
from helper_functions import load_term_subpopulations
term_subpopulations = load_term_subpopulations()
term_subpopulations.keys()

dict_keys(['levinumad', 'mitmetahenduslikud', 'ulejaanud'])

In [2]:
import os.path
from pandas import read_csv
from helper_functions import DuplicatesChecker
checker = DuplicatesChecker()

for subpopulation in term_subpopulations.keys():
    print(f'Validating {subpopulation!r} ...')
    filename = f'labelled_data/recall_sets/koond_1000_{subpopulation}.csv'
    # Validate file
    assert os.path.exists(filename), f'(!) Missing file: {filename}'
    try:
        data = read_csv(filename)
    except Exception as csv_parsing_err:
        raise ValueError(f'(!) Bad input file format: unable to open {filename!r} as a CSV file: ') from csv_parsing_er
    # Validate file's contents 
    for txt, span in zip(data.text, data.span):
        checker.check_for_duplicates(txt, span)
    print('OK')

Validating 'levinumad' ...
OK
Validating 'mitmetahenduslikud' ...
OK
Validating 'ulejaanud' ...
OK


## III. Update the benchmark data

### Gather necessary counts

First, we need to obtain total term counts in each subpopulation. 
This information is available in the local database file created by the SpanSampler.
Reload the database and get the information:

In [3]:
from helper_functions import load_configuration, connect_to_database
from helper_functions import load_term_subpopulations, count_terms_by_subpopulations
from span_sampler_sqlite3 import SpanSampler

config = load_configuration('config\example_configuration.ini')
storage = connect_to_database(config)
collection = config['source_database']['collection']
collection = storage[collection]

sampler = SpanSampler(collection=collection, 
                      layer=config['source_database']['terms_layer'], 
                      attribute='lemma', 
                      termsfile='geo_terms.txt', 
                      db_file_name=config['local_database']['sqlite_file'], 
                      verbose=True)

INFO:storage.py:57: connecting to host: 'postgres.keeleressursid.ee', port: 5432, dbname: 'estonian-text-corpora', user: 'soras'
INFO:storage.py:108: schema: 'estonian_text_corpora', temporary: False, role: 'estonian_text_corpora_read'
Loaded 63 terms from geo_terms.txt.


In [4]:
# Load total counts of each subpopulation
subpopulation_totals = \
    count_terms_by_subpopulations(sampler, subpopulations_dir='config/subpopulations')
print(subpopulation_totals)

{'levinumad': 84822, 'mitmetahenduslikud': 81893, 'ulejaanud': 185897}


Second, get numbers of positive cases (detected entities) for each subpopulation.

In [5]:
# Collect numbers of positive cases
from pandas import read_csv

positives = {}
for subpopulation in term_subpopulations.keys():
    positives[subpopulation] = 0
    filename = f'labelled_data/recall_sets/koond_1000_{subpopulation}.csv'
    try:
        data = read_csv(filename)
    except Exception as csv_parsing_err:
        raise ValueError(f'(!) Bad input file format: unable to open {filename!r} as a CSV file: ') from csv_parsing_err
    positives[subpopulation] += len(data.text)

### Create dataset description CSV file

In [6]:
import numpy as np 
from pandas import DataFrame

In [7]:
# Add initial statistics about sub populations
sorted_populations = sorted(term_subpopulations.keys())
df = DataFrame({
    'population': sorted_populations,
    'occurences': [subpopulation_totals[s_pop] for s_pop in sorted_populations],
    'labelled': [1000 for s_pop in sorted_populations],
    'positive': [positives[s_pop] for s_pop in sorted_populations]
})
# Compute some additional statistics
df['occurence_ratio'] = df['occurences']/sum(df['occurences'])
df['detection_ratio'] = df['positive']/df['labelled']
df['relative_frequency'] = df['occurence_ratio'] * df['positive']/sum(df['occurence_ratio'] * df['positive'])
df

Unnamed: 0,population,occurences,labelled,positive,occurence_ratio,detection_ratio,relative_frequency
0,levinumad,84822,1000,350,0.240553,0.35,0.473287
1,mitmetahenduslikud,81893,1000,13,0.232247,0.013,0.016972
2,ulejaanud,185897,1000,172,0.5272,0.172,0.50974


In [8]:
# add file names (full paths)
df['file'] = [f'amundsen_01/data/recall_sets/koond_1000_{s_pop}.csv' for s_pop in sorted_populations]

In [9]:
# export as csv
df.to_csv('data_description.csv')