# Imputing missing values

We work with the data set how it is present now and apply a common machine learning methods to compute the imputations for missing values: 

The **weighted k nearest neighbour (w-kNN)** algorithm, which imputes missing values with weights equal to the inverse Euclidean distance. 

**Assumptions:** we believe the values missing lie in between the boundaries of the highest and lowest value present in the data set. This might work well for most, but it could also be that the missing values are outliers.

### A note on `countries`

We want to **remind** ourselves that our list `groupings` does not only contain countries, it also contains groups of countries like continents, economic zones, etc. and islands which belong to certain countries, but could somehow be very different as e.g. territories. We save all these entries in a new list `countries`. 

Furthermore, some sub-indicators are of ordinal type, i.e. they are defined as 'number of countries which...'. For all countries, we have here either a 1 or a 0 for 'yes' or 'no', respectively. For all other groupings, we have there most likely a number larger than 1. This could be tricky for our imputations later, because these are based on the similarity of two groupings and these similarities could be strong between, say, France and the World Trade Organisation (WTO). If the WTO had a missing value for an ordinal sub-indicator, the imputation would most likely be very similar to the one of France, so 1 or 0. But this is unrealistic, and all other countries being member of the WTO and behave almost as similar as France, would make the imputation just closer to 1 and not larger than 1.

Therefore, we focus on `countries` only from hereon. 

In [1]:
import numpy as np
import pandas as pd
import math
import os
import pickle
import copy
import itertools
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
# loading original and standardised data set
dict_all = pickle.load(open('utils/data/dict_all.pkl', 'rb'))
dict_all_std = pickle.load(open('utils/data/dict_all_std.pkl', 'rb'))

In [3]:
# check
print('Original values: ')
print(dict_all['Africa'].loc['AG_LND_FRST'])

print('--------')

print('Standardised values: ')
print(dict_all_std['Africa'].loc['AG_LND_FRST'])

Original values: 
TimePeriod
1990        NaN
1991        NaN
1992        NaN
1993        NaN
1994        NaN
1995        NaN
1996        NaN
1997        NaN
1998        NaN
1999        NaN
2000    22.3355
2001        NaN
2002        NaN
2003        NaN
2004        NaN
2005     21.787
2006        NaN
2007        NaN
2008        NaN
2009        NaN
2010    21.2175
2011        NaN
2012        NaN
2013        NaN
2014        NaN
2015    20.7224
2016        NaN
2017        NaN
2018        NaN
2019        NaN
Name: AG_LND_FRST, dtype: object
--------
Standardised values: 
TimePeriod
1990          NaN
1991          NaN
1992          NaN
1993          NaN
1994          NaN
1995          NaN
1996          NaN
1997          NaN
1998          NaN
1999          NaN
2000      1.53215
2001          NaN
2002          NaN
2003          NaN
2004          NaN
2005     -1.21414
2006          NaN
2007          NaN
2008          NaN
2009          NaN
2010    0.0937949
2011          NaN
2012          NaN
20

Now, we open the `csv` file in a GUI and delete the groupings which are *not* countries or part of countries. We call these non-country groupings and examples are North America, Western Asia, Least Developed Countries (LDC), Land Locked Developing Countries (LLDC), Small Island Developing States (SIDS).

In [4]:
# read amended csv file
c = pd.read_csv('utils/countries.csv', dtype=str)
countries = list(c['Countries'].unique())

#check
countries

['Iraq (Central Iraq)',
 'Afghanistan',
 'Albania',
 'Algeria',
 'Angola',
 'Anguilla',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Aruba',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bermuda',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cabo Verde',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo',
 'Costa Rica',
 "Côte d'Ivoire",
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czechia',
 "Democratic People's Republic of Korea",
 'Democratic Republic of the Congo',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'French Guiana',
 'French Polynesia',
 'Gabon',
 'Gambia',
 'Georgia'

# 1) weighted k nearest neighbour (w-kNN)

The w-kNN algorithm is straightforward: we calculate how similar countries are in each given year $y$ with the standardised Euclidean distance $E_y$ and take the inverse of the absolute $|E_y|$ as the weight to impute missing values in any given country $c_i$ in each given year $y$ for each given sub-indicator $j$.

A good start to understand this algorithm is to understand high-dimensional space: https://youtu.be/wvsE8jm1GzE

### Euclidean distance
The Euclidean distance $e_y$ for year $y$ for any given pair of countries $(c_i, c_k)$ for any given sub-indicator $j$ is calculated by:
$$ e_y(c_{i}, c_{k}) = \lVert c_{i}, c_{k} \rVert_2 = \sqrt{ \sum_{j=1}^J(c_{ij} - c_{kj})^2} $$

We calculate the squared distances between any given pair of countries $(c_i, c_k)$, but do not consider the country $k+1$ which has the largest distance $e_y$ to country $i$. We do so for any given sub-indicator $j$ and take the square root of it. $c_{ij}$ is the sub-indicator $j$ of country $i$, and $i \neq k$. Thus, any unique pair of countries $i$ and $k$, $i \neq k$, has **one** Euclidean distance $e_y$ for year $y$ only.

Afterwards, we normalise this with respect to the country $k+1$ which has the largest distance $e_y$ to country $i$ by the following equation:

$$ E_y(c_{i}, c_{k}) = \frac{e_y(c_{i}, c_{k})}{e_y(c_{i}, c_{k+1})} $$

This is equivalent to the well-known normalisation equation:

$$
x_n = \frac{x - x_{min}}{x_{max}-x_{min}}
$$

since $x_{min}$ is always 0, because the distance "between" the same country is 0.

### Imputations
We want that our imputations $x^{j}_{i,y}$ for missing sub-indicator $j$ in country $i$ in year $y$ are similar to sub-indicators $j$ of countries $k$ which have a **small** Euclidean distance $E_y$ and dissimilar to sub-indicators $j$ of countries $k$ which have a **large** Euclidean distance $E_y$. Consequently, the imputations $x^{j}_{i,y}$ are the *weighted* averages where the *weights* are equal to the inverse standardised Euclidean distance $\frac{1}{|E_y(c_{i}, c_{k})|}$.

First, we compute $E_y$ for all available pairs of sub-indicators $j$ amongst two countries $i$ and $k$. Since countries have different amounts of available data points, we average by multiplying the sum by $1/J$, where $J$ is the total number of sub-indicators taken into account here. Note, this does not necessarily be 375, because we have missing values for many sub-indicators. Second, we sum over $k$ to add together all weighted $x^j_{k,y}$ of each unique pair of countries $i$ and $k$ and compute its average by dividing by $K$.

$$ x^{j}_{i,y} = \frac{1}{K} \sum_k \frac{1}{|E_y(c_{i}, c_{k})|} \cdot x^j_{k,y} $$

**Assumptions:** we calculate how similar countries are according to their values for *all* sub-indicators in a given year. We assume that the specific sub-indicators which do not have values in this given year are exactly as similar as the ones which we can calculate a distance for.

Let's have a final check before we start our w-kNN algorithm to compute all missing values. We have here negative values which might sound confusing, but bear in mind that we have standardised the data before, i.e. the data distribution has mean 0 and standard deviation 1.

In [5]:
dict_all_std['Iraq (Central Iraq)'].head()

TimePeriod,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
SeriesCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
VC_IHR_PSRC,,,,,,,,,,,...,-1.02039,0.743375,0.606428,1.05681,-1.38622,,,,,
VC_IHR_PSRCN,,,,,,,,,,,...,-1.92745,0.155164,0.23182,0.871678,0.668792,,,,,
SI_POV_DAY1,,,,,,,,,,,...,,,,,,,,,,
SI_POV_EMP1,,,,,,,,,,,...,,,,,,,,,,
SI_POV_NAHC,,,,,,,,,,,...,,,,,,,,,,


In [6]:
# CHECKPOINT
dict_e = pickle.load(open('utils/data/distances_unstd.pkl', 'rb'))
dict_E = pickle.load(open('utils/data/distances_std.pkl', 'rb'))

In [7]:
# check
print('unstandardised distance e:', dict_e['2000', 'Afghanistan', 'Colombia'])
print('standardised distance E:  ', dict_E['2000', 'Afghanistan', 'Colombia'])

unstandardised distance e: 1.2235415051891592
standardised distance E:   0.48166382098188765


### Calculating the Euclidean distance

We can calculate the standardised distance $E_y(c_{i}, c_{k})$ after having prepared everything. We do this for each unique pair of two *countries* in each year. In other words, we do not want to calculate $E_y(c_{i}, c_{k})$ for $i = k$ and $E_y(c_{i}, c_{k}) = E_y(c_{k}, c_{i})$.

The python package <code>itertools</code> can help us generating the unique pairs of countries.

In [8]:
# create list out of all unique combinations
countrycombinations = list(itertools.combinations(countries, 2))
countrycombinations

[('Iraq (Central Iraq)', 'Afghanistan'),
 ('Iraq (Central Iraq)', 'Albania'),
 ('Iraq (Central Iraq)', 'Algeria'),
 ('Iraq (Central Iraq)', 'Angola'),
 ('Iraq (Central Iraq)', 'Anguilla'),
 ('Iraq (Central Iraq)', 'Antigua and Barbuda'),
 ('Iraq (Central Iraq)', 'Argentina'),
 ('Iraq (Central Iraq)', 'Armenia'),
 ('Iraq (Central Iraq)', 'Aruba'),
 ('Iraq (Central Iraq)', 'Australia'),
 ('Iraq (Central Iraq)', 'Austria'),
 ('Iraq (Central Iraq)', 'Azerbaijan'),
 ('Iraq (Central Iraq)', 'Bahamas'),
 ('Iraq (Central Iraq)', 'Bahrain'),
 ('Iraq (Central Iraq)', 'Bangladesh'),
 ('Iraq (Central Iraq)', 'Barbados'),
 ('Iraq (Central Iraq)', 'Belarus'),
 ('Iraq (Central Iraq)', 'Belgium'),
 ('Iraq (Central Iraq)', 'Belize'),
 ('Iraq (Central Iraq)', 'Benin'),
 ('Iraq (Central Iraq)', 'Bermuda'),
 ('Iraq (Central Iraq)', 'Bhutan'),
 ('Iraq (Central Iraq)', 'Bolivia (Plurinational State of)'),
 ('Iraq (Central Iraq)', 'Bosnia and Herzegovina'),
 ('Iraq (Central Iraq)', 'Botswana'),
 ('Iraq (Cent

In [9]:
# check
countrycombinations[0][0]

'Iraq (Central Iraq)'

Here, we calculate the standardised distance $E_y(c_{i}, c_{k})$ for each unique pair of two countries in each year.

In [10]:
from scipy.spatial import distance

First, we compute the (not standardised) distances $e_y$ and insert them into a new dictionary `dict_e`.

While exploring the data, we see that nearly no data are available for the years `1990` to `1999`. Consequently, imputations in those years will be based on very weak foundations and we do not consider these years for now. For our similarity investigations later, it does not matter much how many data points we have totally available per country, it is more important that all countries have the same amount of data points. For now, we also omit data for the year `2019`, because it seems not all countries have reported their data yet. Hence, there aren't too many data points available neither.

We set the `period` of years we want to consider in our computations for $e_y$.

In [11]:
period = ['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018']

In [12]:
# call seriescodes again
info = pd.read_csv('utils/info.csv', dtype=str)
seriescodes = list(info['SeriesCode'])
seriescodes

['SI_POV_DAY1',
 'SI_POV_EMP1',
 'SI_POV_NAHC',
 'SI_COV_MATNL',
 'SI_COV_POOR',
 'SI_COV_SOCAST',
 'SI_COV_SOCASTPQ',
 'SI_COV_SOCINS',
 'SI_COV_CHLD',
 'SI_COV_SOCINSPQ',
 'SI_COV_UEMP',
 'SI_COV_VULN',
 'SI_COV_WKINJRY',
 'SI_COV_BENFTS',
 'SI_COV_DISAB',
 'SI_COV_LMKT',
 'SI_COV_LMKTPQ',
 'SI_COV_PENSN',
 'SP_ACS_BSRVH2O',
 'SP_ACS_BSRVSAN',
 'VC_DSR_GDPLS',
 'VC_DSR_MISS',
 'VC_DSR_AFFCT',
 'VC_DSR_MORT',
 'VC_DSR_MTMP',
 'VC_DSR_MTMN',
 'VC_DSR_DAFF',
 'VC_DSR_IJILN',
 'VC_DSR_PDAN',
 'VC_DSR_PDYN',
 'VC_DSR_PDLN',
 'SG_DSR_LGRGSR',
 'SG_DSR_SILS',
 'SG_DSR_SILN',
 'SG_GOV_LOGV',
 'VC_DSR_LSGP',
 'VC_DSR_AGLN',
 'VC_DSR_HOLN',
 'VC_DSR_CILN',
 'VC_DSR_CHLN',
 'VC_DSR_DDPA',
 'SD_XPD_ESED',
 'SN_ITK_DEFC',
 'AG_PRD_FIESSI',
 'AG_PRD_FIESSIN',
 'SN_ITK_DEFCN',
 'SH_STA_STUNT',
 'SH_STA_STUNTN',
 'SH_STA_WASTE',
 'SH_STA_WASTEN',
 'SH_STA_OVRWGT',
 'SH_STA_OVRWGTN',
 'PD_AGR_SSFP',
 'SI_AGR_SSFP',
 'ER_GRF_ANIMRCNT',
 'ER_GRF_PLNTSTOR',
 'ER_RSK_LBREDS',
 'AG_PRD_ORTIND',
 'DC_TOF_A

### Comparing Euclidean distances computed with vectors of different dimensions

Simply computing the Euclidean distance between all the countries we have won't give us the results we like, because each of the pairs of countries we calculate the Euclidean distance for has different many measurements to take into account.

We consider this by counting how many measurements `j` we have for each pair of countries and setting the weight `w` as the invesre of the number of measurements.

In [13]:
# very memory-intensive (~3 hours computing time)
# no need to run every time again, just see CHECKPOINT above and load pickle file

dict_e = {}    

for year in period:
    
    for countrycombination in countrycombinations:
        
        country0_e = []    # create two empty lists for the two groupings we consider at the moment
        country1_e = []    # these lists contain series codes with data available in both groupings
        j = 0    # counter
        
        for seriescode in seriescodes:
            # we can only consider sub-indicators with data available in both groupings
            if pd.isna(dict_all_std[countrycombination[0]].loc[seriescode, year]) is False and pd.isna(dict_all_std[countrycombination[1]].loc[seriescode, year]) is False:
                country0_e.append(dict_all_std[countrycombination[0]].loc[seriescode, year])
                country1_e.append(dict_all_std[countrycombination[1]].loc[seriescode, year])
                
                j += 1
        
        #print('number of data points available: ', j)    # check
        if j > 0:
            e = distance.euclidean(country0_e, country1_e, w=1/j)
        else:
            e = np.nan    # make NaN
            
        #print('e in {} between {} and {}:'.format(year, countrycombination[0], countrycombination[1]), e)
        
        dict_e[year, countrycombination[1], countrycombination[0]] = e
        dict_e[year, countrycombination[0], countrycombination[1]] = dict_e[year, countrycombination[1], countrycombination[0]]


In [14]:
# better save this precious data
f = open('utils/data/distances_unstd.pkl', 'wb')
pickle.dump(dict_e, f)
f.close()

In [15]:
# check
print(dict_e['2011', 'Switzerland', 'Iraq (Central Iraq)'])

0.16997936185573015


Normalise these distances $e_y$ and save them in `dict_E`:

In [16]:
dict_E = copy.deepcopy(dict_e)

for year in period:
    
    max_e = 0    # maximum value per year
    min_e = 0
    dict_e_year = {}    # auxiliary dictionary with all distances per year
    
    for k in dict_e.keys():
        if year in k:
            dict_e_year[k] = dict_e[k]            

    max_e = np.nanmax(list(dict_e_year.values()))

    #print('------------')
    #print('max_e in {}'.format(year), max_e)
    #print('------------')
       
    for k in dict_e_year.keys():
        #print(k)
        if np.isnan(dict_e_year[k]) == False:
            #print('unstandardised distance e:', dict_e_year[k])
            dict_E[k] = dict_e_year[k] / max_e    # standardise distance
            
        else:
            dict_E[k] = np.nan    # keep as NaN
        
        #print('standardised distance E:', dict_E[k])
        #print('------------')

In [17]:
# check
print('unstandardised distance e:', dict_e['2018', 'Afghanistan', 'Colombia'])
print('standardised distance E:  ', dict_E['2018', 'Afghanistan', 'Colombia'])

unstandardised distance e: 1.032646617556478
standardised distance E:   0.39086287545542586


In [18]:
# check
print(0 in dict_e.values())
print(0 in dict_E.values())

False
False


In [19]:
# check 
min_value = 0.1

for key, value in dict_E.items():
    if 0 < value < min_value:
        min_value = value
        print('smallest:', key, value)

smallest: ('2013', 'Kenya', 'Iraq (Central Iraq)') 0.02282516469393203
smallest: ('2010', 'Iraq (Kurdistan Region)', 'Lesotho') 0.013714356426899981
smallest: ('2010', 'Czechia', 'Iraq (Kurdistan Region)') 0.005436630255815093
smallest: ('2001', 'Kiribati', 'Western Sahara') 0.0033793946145553744


In [20]:
# better save this precious data
f = open('utils/data/distances_std.pkl', 'wb')
pickle.dump(dict_E, f)
f.close()

### Imputations for countries

Now, we impute the missing values according to the equation we previously derived:

$$ x^{j'}_{i,y} = \frac{1}{K} \sum_k \frac{1}{|E_y(c_{i}, c_{k})|} \cdot x^j_{k,y} $$

To recap, our imputations $x^{j'}_{i,y}$ for missing sub-indicator $j'$ in country $i$ in year $y$ should be similar to sub-indicators $j$ of country $k$, according to the inverse standardised Euclidean distance $\frac{1}{|E_y(c_{i}, c_{k})|}$ between $i$ and $k$. 

As aforementioned and shown in the equation of $E_y$, $E_y$ is dependent on the number of pairs we have in both countries data for. Our `dict_E` has already entries for $E$ normalised according to this number of available pairs of data. We multiply our weight for each imputation, i.e. the inverse standardised Euclidean distance $\frac{1}{|E_y(c_{i}, c_{k})|}$ between $i$ and $k$, by the value $x^j_{k,y}$ of the other country $k$ in year $y$ for sub-indicator $j$. We sum over $k$ to add together all weighted $x^j_{k,y}$ of each unique pair of countries $i$ and $k$ and compute its average by dividing by $K$. $K$ is the number of countries which have values available for the sub-indicator $x^{j'}_{k,y}$ to be computed. 

We also know that some indicators are **binary**, i.e. 1 for 'yes' and 0 for 'no', and ordinal for groupings of countries. These indicators start usually with 'number of countries which...'. The imputations for these must be handled differently: we round the imputed value to an integer.

In [21]:
# CHECKPOINT
dict_all_i = pickle.load(open('utils/data/dict_all_i.pkl', 'rb'))

In [22]:
binary = ['1.5.3', '5.1.1', '5.6.2', '5.a.2', '5.c.1', '8.b.1', '10.7.2', '11.b.1', '12.1.1', '12.7.1', '13.1.2', '13.2.1', '13.3.1', '13.3.2', '13.b.1', '14.c.1', '15.6.1', '15.8.1', '16.10.2', '17.5.1', '17.14.1', '17.16.1', '17.18.2', '17.18.3', '17.19.2']

In [23]:
info_binary = info.loc[info['Indicator'].isin(binary)]
binary_seriescodes = list(info_binary['SeriesCode'])

binary_seriescodes

['SG_DSR_LGRGSR',
 'SG_LGL_GENEQLFP',
 'SG_LGL_GENEQVAW',
 'SG_LGL_GENEQEMP',
 'SG_LGL_GENEQMAR',
 'SG_GEN_EQPWN',
 'SG_CPA_MIGR',
 'SG_CPA_MIGRP',
 'SG_DSR_LGRGSR',
 'SG_SCP_CNTRY',
 'SG_SCP_CORMEC',
 'SG_SCP_MACPOL',
 'SG_SCP_POLINS',
 'SG_DSR_LGRGSR',
 'ER_CBD_SMTA',
 'ER_CBD_NAGOYA',
 'ER_CBD_ABSCLRHS',
 'ER_CBD_ORSPGRFA',
 'ER_CBD_PTYPGRFA',
 'SG_INF_ACCSS',
 'SG_PLN_MSTKSDG',
 'SG_STT_NSDSFDGVT',
 'SG_STT_FPOS',
 'SG_STT_NSDSFDDNR',
 'SG_STT_NSDSFDOTHR',
 'SG_STT_NSDSIMPL',
 'SG_STT_NSDSFND',
 'SG_REG_BRTH90',
 'SG_REG_DETH75',
 'SG_REG_CENSUS',
 'SG_REG_CENSUSN',
 'SG_REG_BRTH90N',
 'SG_REG_DETH75N']

In [24]:
# memory-intensive (~3 hours)

dict_all_i = copy.deepcopy(dict_all_std)

for country in countries: 
    
    not_countries = [c for c in countries if c != country]
    
    for seriescode in seriescodes:
        for year in period:            
            if pd.isna(dict_all_std[country].loc[seriescode, year]) is True:
                K = 0
                all_k = []
                
                for not_country in not_countries: 
                    if pd.isna(dict_all_std[not_country].loc[seriescode, year]) is False and pd.isna(dict_E[(year, country, not_country)]) is False and dict_E[(year, country, not_country)]!=0:    # not_country can also have NaN -> exclude those
                        K += 1
                        # print('value:', dict_all_std[not_country].loc[seriescode, year])
                        # print('distance:', dict_E[(year, country, not_country)])
                        k = (dict_all_std[not_country].loc[seriescode, year]) / (dict_E[(year, country, not_country)])
                        # print('k =', k)
                        all_k.append(k)
                        
                sum_k = np.sum(all_k)
                    
                #print('K =', K)
                    
                if K > 0:
                    # print('sum k =', sum_k)
                                                                
                    if seriescode in binary_seriescodes:
                        dict_all_i[country].loc[seriescode, year] = np.around(sum_k / K)    # round to have binary
                    else:
                        dict_all_i[country].loc[seriescode, year] = sum_k / K
                            
                else:
                    dict_all_i[country].loc[seriescode, year] = 0    # when data is standardised, values which are unknown can be made 0
                    

                
                # print('Imputation for {} in {} in {}'.format(seriescode, country, year), dict_all_i[country].loc[seriescode, year])
                

In [25]:
# check 
max_values = []

for c in dict_all_i.keys():
    max_values.append(dict_all_i[c].max().max())
    
max(max_values)

46.403227535450206

In [26]:
# check
print('NaN here', dict_all_std['Iraq (Central Iraq)'].loc['SI_POV_DAY1', '2018'])
print('Imputed value', dict_all_i['Iraq (Central Iraq)'].loc['SI_POV_DAY1', '2018'])

NaN here nan
Imputed value 0.0


In [27]:
# check
print('Imputed value', dict_all_i['Iraq (Central Iraq)'].loc['SI_POV_DAY1'])

Imputed value TimePeriod
1990         NaN
1991         NaN
1992         NaN
1993         NaN
1994         NaN
1995         NaN
1996         NaN
1997         NaN
1998         NaN
1999         NaN
2000           0
2001           0
2002           0
2003           0
2004           0
2005           0
2006           0
2007           0
2008           0
2009           0
2010    -1.54223
2011    -1.36914
2012   -0.354238
2013   -0.600999
2014    -1.17889
2015           0
2016           0
2017           0
2018           0
2019         NaN
Name: SI_POV_DAY1, dtype: object


Deleting all groupings which are not countries:

In [28]:
groupings = []

for group in dict_all_i.keys():
    if group not in countries:
        #print(group)
        groupings.append(group)

In [29]:
## delete groupings which we did not impute for
for group in groupings:
    dict_all_i.pop(group)

We want to save the imputations to have another checkpoint here.

In [30]:
# as csv files
if not os.path.exists('csv_imputed'):
    os.mkdir('csv_imputed')

for c in countries:
    dict_all_i[c].to_csv(r'csv_imputed/{}.csv'.format(c))
    
# as pkl files
imp = open('utils/data/dict_all_i.pkl', 'wb')
pickle.dump(dict_all_i, imp)
imp.close()

# Averaging and concatenating data to higher levels

### Averaging and concatenating series codes data to indicator-level

1. We can average all series codes, i.e. sub-indicators, belonging to one indicator to this indicator.
2. We can see the series codes as multiple samples of the same indicator and concatenate the series codes into indicators. Consequently, we have more than one measurement per time point for any indicators having more than one series code.

In [31]:
indicators = list(info.Indicator)

dict_indicators = {}

for indicator in indicators:
    i = info['SeriesCode'].where(info['Indicator'] == indicator)

    dict_indicators[indicator] = [s for s in i if str(s) != 'nan']

In [32]:
# check
float(dict_all_i['Germany'].loc['SI_POV_DAY1', '2001'])

0.21121571845570566

In [33]:
#indicators_values = {}
#indicators_values_std = {}
indicators_values_i = {}

for country in countries:
    #print(country)
    
    #indicators_values[country] = pd.DataFrame(columns=period, index=indicators)
    #indicators_values_std[country] = pd.DataFrame(columns=period, index=indicators)
    indicators_values_i[country] = pd.DataFrame(columns=period, index=list(dict_indicators.keys()))
    
    for year in period:
        
        for indicator in dict_indicators.keys():
            #list_subindicators_values = []
            #list_subindicators_values_std = []
            list_subindicators_values_i = []
    
            for subindicator in list(dict_indicators[indicator]):
                #list_subindicators_values.append(dict_all[country].loc[subindicator, year])
                #list_subindicators_values_std.append(dict_all_std[country].loc[subindicator, year])
                list_subindicators_values_i.append(dict_all_i[country].loc[subindicator, year])
    
            # print(list_subindicators_values_i)
            
            # 1. averaging
            #indicators_values[country].loc[indicator, year] = np.nanmean(list_subindicators_values)
            #indicators_values_std[country].loc[indicator, year] = np.nanmean(list_subindicators_values_std)
            #indicators_values_i[country].loc[indicator, year] = np.nanmean(list_subindicators_values_i)
            
            # 2. concatenating
            indicators_values_i[country].loc[indicator, year] = list_subindicators_values_i

In [34]:
# check (duplicate indicator labels should be in index)
indicators_values_i['Germany'].loc['1.1.1']

2000      [-1.9217680442426106, 2.2008214894842855]
2001       [0.21121571845570566, 1.638789854810854]
2002        [0.4765093257490252, 2.052401791117212]
2003       [-0.8457396181957801, 1.450278128223658]
2004       [1.5231366170848841, 0.9488815360593101]
2005       [1.1967487577633398, 0.6050117060629797]
2006    [0.008352203829039376, 0.23113708789500637]
2007      [0.6330551056315804, 0.18678212275752157]
2008      [1.3311920796158387, -0.2482393953251333]
2009       [0.6036211264341681, -0.674539111104468]
2010     [0.48499639490114665, -0.7730997835682288]
2011      [-1.249270498430822, -0.8026337371230474]
2012        [-0.73336165128779, -0.745337022068899]
2013      [-1.568163034735681, -1.1367546537264637]
2014       [-1.190691630275826, -1.628094875108401]
2015     [-0.6322221582062566, -1.2710366185055673]
2016      [-1.1761956044470434, -1.097135492598621]
2017     [-1.5752372099919585, -1.2303106882176178]
2018                                         [0, 0]
Name: 1.1.1,

In [35]:
# better save these precious data
#ind_val = open('utils/data/indicators_values.pkl', 'wb')
#ind_val_std = open('utils/data/indicators_values_std.pkl', 'wb')
ind_val_i = open('utils/data/indicators_values_i.pkl', 'wb')
#pickle.dump(indicators_values, ind_val)
#pickle.dump(indicators_values_std, ind_val_std)
pickle.dump(indicators_values_i, ind_val_i)
#ind_val.close()
#ind_val_std.close()
ind_val_i.close()

### Averaging and concatenating indicator data to target-level
We must also generate two lists of indicators which are meant to increase and decrease over time.

In [36]:
# gone though all targets by hand and checked which indicators are meant to increase and which are meant to decrease over time.
increase = ['1.3.1', '1.4.1', '1.4.2', '1.5.3', '1.5.4', '1.a.1', '1.a.2', '1.a.3', '1.b.1', '2.3.1', '2.3.2', '2.4.1', '2.5.1', '2.a.1', '2.a.2', '3.1.2', '3.5.1', '3.7.1', '3.8.1', '3.b.1', '3.b.2', '3.b.3', '3.c.1', '3.d.1', '4.1.1', '4.2.1', '4.2.2', '4.3.1', '4.4.1', '4.6.1', '4.7.1', '4.a.1', '4.b.1', '4.c.1', '5.1.1', '5.5.1', '5.5.2', '5.6.1', '5.6.2', '5.a.1', '5.a.2', '5.b.1', '5.c.1', '6.1.1', '6.2.1', '6.3.1', '6.3.2', '6.4.1', '6.5.1', '6.5.2', '6.6.1', '6.a.1', '6.b.1', '7.1.1', '7.1.2', '7.2.1', '7.3.1', '7.a.1', '7.b.1', '8.1.1', '8.2.1', '8.3.1', '8.5.1', '8.8.2', '8.9.1', '8.9.2', '8.10.1', '8.10.2', '8.a.1', '8.b.1', '9.1.1', '9.1.2', '9.2.1', '9.2.2', '9.3.1', '9.3.2', '9.5.1', '9.5.2', '9.a.1', '9.b.1', '9.c.1', '10.1.1', '10.4.1', '10.5.1', '10.6.1', '10.7.2', '10.a.1', '10.b.1', '11.2.1', '11.3.2', '11.4.1', '11.6.1', '11.7.1', '11.a.1', '11.b.1', '11.b.2', '11.c.1', '12.1.1', '12.4.1', '12.5.1', '12.6.1', '12.7.1', '12.8.1', '12.a.1', '12.b.1', '13.1.2', '13.1.3', '13.2.1', '13.3.1', '13.3.2', '13.a.1', '13.b.1', '14.2.1', '14.3.1', '14.4.1', '14.5.1', '14.6.1', '14.7.1', '14.a.1', '14.b.1', '14.c.1', '15.1.1', '15.1.2', '15.2.1', '15.4.1', '15.4.2', '15.6.1', '15.8.1', '15.9.1', '15.a.1', '15.b.1', '16.1.4', '16.6.2', '16.7.1', '16.7.2', '16.8.1', '16.9.1', '16.10.2', '16.a.1', '17.1.1', '17.1.2', '17.2.1', '17.3.1', '17.3.2', '17.4.1', '17.5.1', '17.6.1', '17.6.2', '17.7.1', '17.8.1', '17.9.1', '17.11.1', '17.13.1', '17.14.1', '17.15.1', '17.16.1', '17.17.1', '17.18.1', '17.18.2', '17.18.3', '17.19.1', '17.19.2']
decrease = ['1.1.1', '1.2.1', '1.2.2', '1.5.1', '1.5.2', '2.1.1', '2.1.2', '2.2.1', '2.2.2', '2.5.2','2.b.1', '2.c.1', '3.1.1', '3.2.1', '3.2.2', '3.3.1', '3.3.2', '3.3.3', '3.3.4', '3.3.5', '3.4.1', '3.4.2', '3.5.2', '3.6.1', '3.7.2', '3.8.2', '3.9.1', '3.9.2', '3.9.3', '3.a.1', '4.5.1', '5.2.1', '5.2.2', '5.3.1', '5.3.2', '5.4.1', '6.4.2', '8.4.1', '8.4.2', '8.5.2', '8.6.1', '8.7.1', '8.8.1', '9.4.1', '10.2.1', '10.3.1', '10.7.1', '10.c.1', '11.1.1', '11.3.1', '11.5.1', '11.5.2', '11.6.2', '11.7.2', '12.2.1', '12.2.2', '12.3.1', '12.4.2', '12.c.1', '13.1.1', '14.1.1', '15.3.1', '15.5.1', '15.7.1', '15.c.1', '16.1.1', '16.1.2', '16.1.3', '16.2.1', '16.2.2', '16.2.3', '16.3.1', '16.3.2', '16.4.1', '16.4.2', '16.5.1', '16.5.2', '16.6.1', '16.10.1', '16.b.1', '17.10.1', '17.12.1']

In [37]:
# making all time-series "pointing" upwards when they are meant to increase
#indicators_values_up = copy.deepcopy(indicators_values)
#indicators_values_std_up = copy.deepcopy(indicators_values_std)
indicators_values_i_up = copy.deepcopy(indicators_values_i)

for country in countries:
    
    for indicator in dict_indicators.keys():
        if indicator in decrease:
            #indicators_values_up[country].loc[indicator] = indicators_values[country].loc[indicator]*(-1)
            #indicators_values_std_up[country].loc[indicator] = indicators_values_std[country].loc[indicator]*(-1)
            indicators_values_i_up[country].loc[indicator] = list(np.multiply(list(indicators_values_i[country].loc[indicator]), -1))
        else:
            #indicators_values_up[country].loc[indicator] = indicators_values[country].loc[indicator]
            #indicators_values_std_up[country].loc[indicator] = indicators_values_std[country].loc[indicator]
            indicators_values_i_up[country].loc[indicator] = indicators_values_i[country].loc[indicator]

In [38]:
# better save these precious data
#ind_val = open('utils/data/indicators_values_up.pkl', 'wb')
#ind_val_std = open('utils/data/indicators_values_std_up.pkl', 'wb')
ind_val_i = open('utils/data/indicators_values_i_up.pkl', 'wb')
#pickle.dump(indicators_values_up, ind_val)
#pickle.dump(indicators_values_std_up, ind_val_std)
pickle.dump(indicators_values_i_up, ind_val_i)
#ind_val.close()
#ind_val_std.close()
ind_val_i.close()

Defining dictionaries for targets:

In [39]:
targets = list(info['Target'].unique())

dict_targets = {}

for target in targets:
    t = info['Indicator'].where(info['Target'] == target)

    dict_targets[target] = [i for i in t if str(i) != 'nan']

In [40]:
# check
list(indicators_values_i_up['Germany'].loc['1.1.1', '2000'])

[1.9217680442426106, -2.2008214894842855]

Now, we can simply average or concatenate:

In [41]:
targets_values_i = {}
#targets_values_up = {}
#targets_values_std_up = {}
targets_values_i_up = {}    # for Granger-causality

for country in countries:
    
    #targets_values_up[country] = pd.DataFrame(columns=period, index=targets)
    #targets_values_std_up[country] = pd.DataFrame(columns=period, index=targets)
    targets_values_i[country] = pd.DataFrame(columns=period, index=list(dict_targets.keys()))
    targets_values_i_up[country] = pd.DataFrame(columns=period, index=list(dict_targets.keys()))
    
    for year in period:
        
        for target in list(dict_targets.keys()):
            #list_indicators_values = []
            #list_indicators_values_std = []
            list_indicators_values_i = []
            list_indicators_values_i_up = []
    
            for indicator in list(dict_targets[target]):
                #list_indicators_values.append(indicators_values[country].loc[indicator, year])
                #list_indicators_values_std.append(indicators_values_std[country].loc[indicator, year])
                list_indicators_values_i.extend(indicators_values_i[country].loc[indicator, year])
                list_indicators_values_i_up.extend(indicators_values_i_up[country].loc[indicator, year])
    
            #print(list_indicators_values_i)
            
            # 1. averaging
            #targets_values_up[country].loc[target, year] = np.mean(list_indicators_values)
            #targets_values_std_up[country].loc[target, year] = np.mean(list_indicators_values_std)
            #targets_values_i_up[country].loc[target, year] = np.mean(list_indicators_values_i)
            
            # 2. concatenating
            targets_values_i[country].loc[target, year] = list_indicators_values_i
            targets_values_i_up[country].loc[target, year] = list_indicators_values_i_up

In [42]:
# check (each goal should have list in cells with values for sub-indicators)
targets_values_i_up['Germany'].loc['1.1']

2000    [1.9217680442426106, -2.2008214894842855, 1.92...
2001    [-0.21121571845570566, -1.638789854810854, -0....
2002    [-0.4765093257490252, -2.052401791117212, -0.4...
2003    [0.8457396181957801, -1.450278128223658, 0.845...
2004    [-1.5231366170848841, -0.9488815360593101, -1....
2005    [-1.1967487577633398, -0.6050117060629797, -1....
2006    [-0.008352203829039376, -0.23113708789500637, ...
2007    [-0.6330551056315804, -0.18678212275752157, -0...
2008    [-1.3311920796158387, 0.2482393953251333, -1.3...
2009    [-0.6036211264341681, 0.674539111104468, -0.60...
2010    [-0.48499639490114665, 0.7730997835682288, -0....
2011    [1.249270498430822, 0.8026337371230474, 1.2492...
2012    [0.73336165128779, 0.745337022068899, 0.733361...
2013    [1.568163034735681, 1.1367546537264637, 1.5681...
2014    [1.190691630275826, 1.628094875108401, 1.19069...
2015    [0.6322221582062566, 1.2710366185055673, 0.632...
2016    [1.1761956044470434, 1.097135492598621, 1.1761...
2017    [1.575

The first inner parentheses in each cell contains the values from the first indicator; the second inner parentheses in each cell contains the values from the second indicator; etc. Each indicator represents a couple of sub-indicators.

In [43]:
# better save these precious data
#tar_val = open('utils/data/targets_values_up.pkl', 'wb')
#tar_val_std = open('utils/data/targets_values_std_up.pkl', 'wb')
tar_val_i = open('utils/data/targets_values_i.pkl', 'wb')
tar_val_i_up = open('utils/data/targets_values_i_up.pkl', 'wb')
#pickle.dump(targets_values_up, tar_val)
#pickle.dump(targets_values_std_up, tar_val_std)
pickle.dump(targets_values_i, tar_val_i)
pickle.dump(targets_values_i_up, tar_val_i_up)
#tar_val.close()
#tar_val_std.close()
tar_val_i.close()
tar_val_i_up.close()

### Averaging and concatenating target data to goal-level

Defining dictionaries for goals.

In [44]:
goals = list(info['Goal'].unique())

dict_goals = {}

for goal in goals:
    g = info['Target'].where(info['Goal'] == goal)

    dict_goals[goal] = [t for t in g if str(t) != 'nan']

In [45]:
goals_values_i = {}
#goals_values_up = {}
#goals_values_std_up = {}
goals_values_i_up = {}    # for Granger-causality

for country in countries:
    
    #goals_values_up[country] = pd.DataFrame(columns=period, index=goals)
    #goals_values_std_up[country] = pd.DataFrame(columns=period, index=goals)
    goals_values_i[country] = pd.DataFrame(columns=period, index=list(dict_goals.keys()))
    goals_values_i_up[country] = pd.DataFrame(columns=period, index=list(dict_goals.keys()))
    
    for year in period:
        
        for goal in goals:
            #list_targets_values = []
            #list_targets_values_std = []
            list_targets_values_i = []
            list_targets_values_i_up = []
    
            for target in list(dict_goals[goal]):
                #list_targets_values.append(targets_values_up[country].loc[target, year])
                #list_targets_values_std.append(targets_values_std_up[country].loc[target, year])
                list_targets_values_i.extend(targets_values_i[country].loc[target, year])
                list_targets_values_i_up.extend(targets_values_i_up[country].loc[target, year])
    
            #print(list_targets_values_i)
            
            # 1. averaging
            #goals_values_up[country].loc[goal, year] = np.mean(list_targets_values)
            #goals_values_std_up[country].loc[goal, year] = np.mean(list_targets_values_std)
            #goals_values_i_up[country].loc[goal, year] = np.mean(list_targets_values_i)
            
            # 2. concatenating
            goals_values_i[country].loc[goal, year] = list_targets_values_i
            goals_values_i_up[country].loc[goal, year] = list_targets_values_i_up

In [46]:
# check (each goal should have list in cells with values for sub-indicators)
goals_values_i_up['Germany'].loc['1']

2000    [1.9217680442426106, -2.2008214894842855, 1.92...
2001    [-0.21121571845570566, -1.638789854810854, -0....
2002    [-0.4765093257490252, -2.052401791117212, -0.4...
2003    [0.8457396181957801, -1.450278128223658, 0.845...
2004    [-1.5231366170848841, -0.9488815360593101, -1....
2005    [-1.1967487577633398, -0.6050117060629797, -1....
2006    [-0.008352203829039376, -0.23113708789500637, ...
2007    [-0.6330551056315804, -0.18678212275752157, -0...
2008    [-1.3311920796158387, 0.2482393953251333, -1.3...
2009    [-0.6036211264341681, 0.674539111104468, -0.60...
2010    [-0.48499639490114665, 0.7730997835682288, -0....
2011    [1.249270498430822, 0.8026337371230474, 1.2492...
2012    [0.73336165128779, 0.745337022068899, 0.733361...
2013    [1.568163034735681, 1.1367546537264637, 1.5681...
2014    [1.190691630275826, 1.628094875108401, 1.19069...
2015    [0.6322221582062566, 1.2710366185055673, 0.632...
2016    [1.1761956044470434, 1.097135492598621, 1.1761...
2017    [1.575

In [47]:
# better save these precious data
#goa_val = open('utils/data/goals_values_up.pkl', 'wb')
#goa_val_std = open('utils/data/goals_values_std_up.pkl', 'wb')
goa_val_i = open('utils/data/goals_values_i.pkl', 'wb')
goa_val_i_up = open('utils/data/goals_values_i_up.pkl', 'wb')
#pickle.dump(goals_values_up, goa_val)
#pickle.dump(goals_values_std_up, goa_val_std)
pickle.dump(goals_values_i, goa_val_i)
pickle.dump(goals_values_i_up, goa_val_i_up)
#goa_val.close()
#goa_val_std.close()
goa_val_i.close()
goa_val_i_up.close()