# Finding subpopulations for geographic named entities  


In [1]:
import sys
sys.path.append("..")

## I. Initial split  into subpopulations  

The entire recall set consists of all geographic named entities for which the last word is among [the predefined suffix_words](../../benchmarks/amundsen_01/suffix_words.txt). 


|Subpopulation     | Description | Examples |
|:--- |:---|:---|
|Levinumad         | Geographic locations with most frequent suffixes | Niiluse jõgi, Aasovi meri, Peipsi järv   |  
|Mitmetäneduslikud | Geographic locations with ambigous suffixes      | Panama kanal, Panga pank, Kura kurk      |
|Ülejäänud         | Other geographic locations                       | Vaikne ookean, Liivi laht, Tehvandi mägi |   

Subpopulations were formed by first selecting all occurences of suffic words and them manually labeling whether a random occurence is a part of a geographic named entity.    


|Subpopulation     | Occurences | Labelled | Positive |
|:--- |---:|---:| --:|
|Levinumad         |  80943 | 1000 | 353 | 
|Mitmetäheduslikud |  75853 | 1000 |  13 |
|Ülejäänud         | 182057 | 1000 | 177 |
|**Total**         | 338853 | 3000 | 543 |

Based on that we can estimate 95% confidence intervals for subpopulation frequencies and their frequencies. 

In [2]:
import numpy as np 
from pandas import DataFrame
from statsmodels.stats.proportion import proportion_confint

In [3]:
df = DataFrame({
    'subpopulation': ['Levinumad', 'Mitmetäheduslikud', 'Ülejäänud'],
    'occurences': [80943, 75853, 182057],
    'labelled': [1000, 1000, 1000],
    'detections': [353, 13, 177]
})

df['occurence_ratio'] = df['occurences']/sum(df['occurences'])
df['detection_ratio'] = df['detections']/df['labelled']

df['detection_ratio_lower_ci'] = np.nan 
df['detection_ratio_upper_ci'] = np.nan 

for idx, (count, obs) in df[['detections', 'labelled']].iterrows():
    df.loc[idx, ['detection_ratio_lower_ci', 'detection_ratio_upper_ci']] = proportion_confint(count, obs)
    
# For the confidence intervals, lets neglect the uncertainty coming form the variance of normalising factor in the denominator    
df['relative_frequency'] = df['occurence_ratio'] * df['detection_ratio']/sum(df['occurence_ratio'] * df['detection_ratio']) 
df['relative_frequency_lower_ci'] = df['occurence_ratio'] * df['detection_ratio_lower_ci']/sum(df['occurence_ratio'] * df['detection_ratio'])
df['relative_frequency_upper_ci'] = df['occurence_ratio'] * df['detection_ratio_upper_ci']/sum(df['occurence_ratio'] * df['detection_ratio'])


display(df)

Unnamed: 0,subpopulation,occurences,labelled,detections,occurence_ratio,detection_ratio,detection_ratio_lower_ci,detection_ratio_upper_ci,relative_frequency,relative_frequency_lower_ci,relative_frequency_upper_ci
0,Levinumad,80943,1000,353,0.238873,0.353,0.32338,0.38262,0.462471,0.423665,0.501277
1,Mitmetäheduslikud,75853,1000,13,0.223852,0.013,0.005979,0.020021,0.015961,0.007341,0.02458
2,Ülejäänud,182057,1000,177,0.537274,0.177,0.153344,0.200656,0.521568,0.451862,0.591275


**Optimal sample sizes:** We can now compute the optimal sample sizes for the recall set and estimated number of samples one needs to look through in order to collect such sampling sets. 
* Our goal here is to minimise the variance of the overall recall estimate that is comoted as a weighted average over subpopulation recalls.
* Next we find the number of samples needed to estimate recall with precision 3%, 2% and 1% provided that recall is above 75%.  


In [4]:
from pandas import concat, merge
from common.sampling import balance_sample_sizes
from statsmodels.stats.proportion import samplesize_confint_proportion

In [5]:
target = DataFrame({'precision': [0.03, 0.02, 0.01]})
target['sample_size'] = round(samplesize_confint_proportion(proportion=0.75, half_length=target['precision'], alpha=0.05)).astype(int)
display(target)

designs = [None] * len(target)

for idx, (precision, sample_size) in target.iterrows():
    designs[idx] = (balance_sample_sizes(df.set_index('subpopulation')['relative_frequency'], sample_size)
                    .reset_index()
                    .rename(columns={'relative_frequency': 'recall_sample_size'})
                    .assign(precision=precision))

designs = merge(concat(designs, axis=0), df[['subpopulation', 'detection_ratio', 'relative_frequency']], on='subpopulation')

designs['label_sample_size'] = round(designs['recall_sample_size']/designs['detection_ratio']).astype(int)

designs = designs[['precision', 'subpopulation', 'relative_frequency', 'detection_ratio', 'recall_sample_size', 'label_sample_size']]
designs.sort_values(['precision', 'subpopulation'], ascending=[False, True], inplace=True)
display(designs)

Unnamed: 0,precision,sample_size
0,0.03,800
1,0.02,1801
2,0.01,7203


Unnamed: 0,precision,subpopulation,relative_frequency,detection_ratio,recall_sample_size,label_sample_size
0,0.03,Levinumad,0.462471,0.353,370,1048
3,0.03,Mitmetäheduslikud,0.015961,0.013,13,1000
6,0.03,Ülejäänud,0.521568,0.177,417,2356
1,0.02,Levinumad,0.462471,0.353,833,2360
4,0.02,Mitmetäheduslikud,0.015961,0.013,29,2231
7,0.02,Ülejäänud,0.521568,0.177,939,5305
2,0.01,Levinumad,0.462471,0.353,3331,9436
5,0.01,Mitmetäheduslikud,0.015961,0.013,115,8846
8,0.01,Ülejäänud,0.521568,0.177,3757,21226
