# Finding subpopulations for geographic named entities  


## I. Initial split  into subpopulations  

The entire recall set consists of all geographic named entities for which the last word is among [the predefined suffix_words](geo_terms.txt). 


|Subpopulation     | Description | Examples |
|:--- |:---|:---|
|Levinumad         | Geographic locations with most frequent suffixes | Niiluse jõgi, Aasovi meri, Peipsi järv   |  
|Mitmetäheduslikud | Geographic locations with ambigous suffixes      | Panama kanal, Panga pank, Kura kurk      |
|Ülejäänud         | Other geographic locations                       | Vaikne ookean, Liivi laht, Tehvandi mägi |   

Subpopulations were formed by first selecting all occurences of suffic words and then manually labeling whether a randomly sampled occurence is a part of a geographic named entity.

Load the description of the manually labelled dataset:

In [1]:
from pandas import read_csv
df=read_csv('data_description.csv', index_col=0)
df=df.rename(columns={"population": 'subpopulation'})
df=df.drop(columns=['file'])
df

Unnamed: 0,subpopulation,occurences,labelled,positive,occurence_ratio,detection_ratio,relative_frequency
0,levinumad,84822,1000,350,0.240553,0.35,0.473287
1,mitmetahenduslikud,81893,1000,13,0.232247,0.013,0.016972
2,ulejaanud,185897,1000,172,0.5272,0.172,0.50974


## II. Estimations

Now we can estimate 95% confidence intervals for different subpopulation counts. 

In [2]:
import numpy as np 
from pandas import DataFrame
from statsmodels.stats.proportion import proportion_confint

In [3]:
df['detection_ratio_lower_ci'] = np.nan 
df['detection_ratio_upper_ci'] = np.nan 

for idx, (count, obs) in df[['positive', 'labelled']].iterrows():
    df.loc[idx, ['detection_ratio_lower_ci', 'detection_ratio_upper_ci']] = proportion_confint(count, obs)

In [4]:
# For the confidence intervals, lets neglect the uncertainty coming form the variance of normalising factor in the denominator    
df['relative_frequency_lower_ci'] = df['occurence_ratio'] * df['detection_ratio_lower_ci']/sum(df['occurence_ratio'] * df['detection_ratio'])
df['relative_frequency_upper_ci'] = df['occurence_ratio'] * df['detection_ratio_upper_ci']/sum(df['occurence_ratio'] * df['detection_ratio'])

In [5]:
display(df)

Unnamed: 0,subpopulation,occurences,labelled,positive,occurence_ratio,detection_ratio,relative_frequency,detection_ratio_lower_ci,detection_ratio_upper_ci,relative_frequency_lower_ci,relative_frequency_upper_ci
0,levinumad,84822,1000,350,0.240553,0.35,0.473287,0.320438,0.379562,0.433312,0.513263
1,mitmetahenduslikud,81893,1000,13,0.232247,0.013,0.016972,0.005979,0.020021,0.007806,0.026138
2,ulejaanud,185897,1000,172,0.5272,0.172,0.50974,0.14861,0.19539,0.440422,0.579059


**Optimal sample sizes:** We can now compute the optimal sample sizes for the recall set and estimated number of samples one needs to look through in order to collect such sampling sets. 
* Our goal here is to minimise the variance of the overall recall estimate that is comoted as a weighted average over subpopulation recalls.
* Next we find the number of samples needed to estimate recall with precision 3%, 2% and 1% provided that recall is above 75%.  


In [6]:
from pandas import concat, merge
from statsmodels.stats.proportion import samplesize_confint_proportion

In [7]:
import sys
sys.path.append("..")
from common.sampling import balance_sample_sizes

In [8]:
df=df.rename(columns={"population": 'subpopulation'})

In [9]:
target = DataFrame({'precision': [0.03, 0.02, 0.01]})
target['sample_size'] = round(samplesize_confint_proportion(proportion=0.75, half_length=target['precision'], alpha=0.05)).astype(int)
display(target)

designs = [None] * len(target)

for idx, (precision, sample_size) in target.iterrows():
    designs[idx] = (balance_sample_sizes(df.set_index('subpopulation')['relative_frequency'], sample_size)
                    .reset_index()
                    .rename(columns={'relative_frequency': 'recall_sample_size'})
                    .assign(precision=precision))

designs = merge(concat(designs, axis=0), df[['subpopulation', 'detection_ratio', 'relative_frequency']], on='subpopulation')

designs['label_sample_size'] = round(designs['recall_sample_size']/designs['detection_ratio']).astype(int)

designs = designs[['precision', 'subpopulation', 'relative_frequency', 'detection_ratio', 'recall_sample_size', 'label_sample_size']]
designs.sort_values(['precision', 'subpopulation'], ascending=[False, True], inplace=True)
display(designs)

Unnamed: 0,precision,sample_size
0,0.03,800
1,0.02,1801
2,0.01,7203


Unnamed: 0,precision,subpopulation,relative_frequency,detection_ratio,recall_sample_size,label_sample_size
0,0.03,levinumad,0.473287,0.35,379,1083
3,0.03,mitmetahenduslikud,0.016972,0.013,14,1077
6,0.03,ulejaanud,0.50974,0.172,408,2372
1,0.02,levinumad,0.473287,0.35,852,2434
4,0.02,mitmetahenduslikud,0.016972,0.013,31,2385
7,0.02,ulejaanud,0.50974,0.172,918,5337
2,0.01,levinumad,0.473287,0.35,3409,9740
5,0.01,mitmetahenduslikud,0.016972,0.013,122,9385
8,0.01,ulejaanud,0.50974,0.172,3672,21349
