# Data validation & proxy selection

1. *Which proxies among those that we initially considered should be used for the calculations (based on the data availability)?*
2. *How should selected proxies be weighted to reflect our understanding of state capacity?*

## Cases
* ARM, 2018
* GEO, 2003
* KGZ, 2010
* KGZ, 2005
* MDA, 2009
* SRB, 2000
* UKR, 2014
* UKR, 2004

In [1]:
import pandas as pd

In [2]:
def count_missing(data, yearly=False):
    """ Count missing values per proxy (and year). """
    
    df = data.copy()
    if yearly:
        by = ["year", "indicator"]
    else:
        by = "indicator"
        
    missing = (
        df.groupby(by).count()
        .rsub(df.groupby(by).size(), axis=0)
        .rename(columns={"value": "Missing"})
    )
    expected = (
        df.fillna(-1)
        .groupby(by).count()
        .rename(columns={"value": "Max possible"})
    )
    
    return pd.merge(
        missing[["Missing"]], expected[["Max possible"]],
        left_index=True, right_index=True
    )

In [4]:
# all post-soviet countries since 1991
full_dataset = pd.read_excel("./../data/interim/world-bank-data_2020-05-27 22_30.xlsx")

# the sample we use to answer RQ #1
selected_cases = pd.read_excel("./../data/interim/world-bank-selected-cases_2020-05-27 22_30.xlsx")

# ARM, 2018 excluded
limited_sample = selected_cases.loc[selected_cases["year"].ne(2018)].copy()

In [5]:
full_dataset_counts = count_missing(full_dataset)

# proxies that are fully covered given the dataset
full_dataset_counts.loc[full_dataset_counts["Missing"].eq(0)]

Unnamed: 0_level_0,Missing,Max possible
indicator,Unnamed: 1_level_1,Unnamed: 2_level_1


In [6]:
selected_cases_counts = count_missing(selected_cases)

# proxies that are fully covered given the dataset
selected_cases_counts.loc[selected_cases_counts["Missing"].eq(0)]

Unnamed: 0_level_0,Missing,Max possible
indicator,Unnamed: 1_level_1,Unnamed: 2_level_1
Military expenditure (% of GDP),0,8
"Mortality rate, under-5 (per 1,000 live births)",0,8


In [7]:
limited_sample_counts = count_missing(limited_sample)

# proxies that are fully covered given the dataset
limited_sample_counts.loc[limited_sample_counts["Missing"].eq(0)]

Unnamed: 0_level_0,Missing,Max possible
indicator,Unnamed: 1_level_1,Unnamed: 2_level_1
Armed forces personnel (% of total labor force),0,7
Electric power consumption (kWh per capita),0,7
Military expenditure (% of GDP),0,7
"Mortality rate, under-5 (per 1,000 live births)",0,7
People using at least basic drinking water services (% of population),0,7
People using safely managed drinking water services (% of population),0,7


---

Regarding the first question, I think the limited sample that excludes ARM, 2018 would be the best fit. 

1. Such configuration gives 6 fully-covered proxies to choose from. 
2. While it does exclude one of the cases we initially selected and reduce the sample size to 7, we could easily explain why we did so (due to the methodological limitations).   

Regarding the second one, see the next notebook.