# Imputing missing values

We work with the data set how it is present now and apply a common machine learning methods to compute the imputations for missing values: 

The **weighted k nearest neighbour (w-kNN)** algorithm, which imputes missing values with weights equal to the inverse Euclidean distance. 

**Assumptions:** we believe the values missing lie in between the boundaries of the highest and lowest value present in the data set. This might work well for most, but it could also be that the missing values are outliers.

### A note on `countries`

We want to **remind** ourselves that our list `groupings` does not only contain countries, it also contains groups of countries like continents, economic zones, etc. and islands which belong to certain countries, but could somehow be very different as e.g. territories. We save all these entries in a new list `countries`. 

Furthermore, some sub-indicators are of ordinal type, i.e. they are defined as 'number of countries which...'. For all countries, we have here either a 1 or a 0 for 'yes' or 'no', respectively. For all other groupings, we have there most likely a number larger than 1. This could be tricky for our imputations later, because these are based on the similarity of two groupings and these similarities could be strong between, say, France and the World Trade Organisation (WTO). If the WTO had a missing value for an ordinal sub-indicator, the imputation would most likely be very similar to the one of France, so 1 or 0. But this is unrealistic, and all other countries being member of the WTO and behave almost as similar as France, would make the imputation just closer to 1 and not larger than 1.

Therefore, we focus on `countries` only from hereon. 

In [1]:
import numpy as np
import pandas as pd
import math
import os
import pickle
import copy
import itertools
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
from tqdm.notebook import tqdm
from sklearn.preprocessing import scale

In [2]:
# loading original and standardised data set
dict_all = pickle.load(open('utils/data/dict_all_wb.pkl', 'rb'))
dict_all_std = pickle.load(open('utils/data/dict_all_wb_std.pkl', 'rb'))

In [3]:
# check
print('Original values: ')
print(dict_all['Belgium'].loc['ER.H2O.FWTL.ZS'])

print('--------')

print('Standardised values: ')
print(dict_all_std['Belgium'].loc['ER.H2O.FWTL.ZS'])

Original values: 
1990          NaN
1991          NaN
1992          NaN
1993          NaN
1994          NaN
1995          NaN
1996          NaN
1997    64.083333
1998          NaN
1999          NaN
2000          NaN
2001          NaN
2002    56.125000
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    51.783333
2008          NaN
2009          NaN
2010          NaN
2011          NaN
2012    50.016667
2013          NaN
2014          NaN
2015          NaN
2016          NaN
2017          NaN
2018          NaN
2019          NaN
Name: ER.H2O.FWTL.ZS, dtype: float64
--------
Standardised values: 
1990         NaN
1991         NaN
1992         NaN
1993         NaN
1994         NaN
1995         NaN
1996         NaN
1997    1.580306
1998         NaN
1999         NaN
2000         NaN
2001         NaN
2002    0.114715
2003         NaN
2004         NaN
2005         NaN
2006         NaN
2007   -0.684838
2008         NaN
2009         NaN
2010         NaN
2011         NaN


Let's calculate the total number of values we have:

In [4]:
# number of values
s = 0
for country in dict_all_std.keys():
    s += np.sum(dict_all_std[country].count())

print(s)

850591


Now, we open the `csv` file in a GUI and delete the groupings which are *not* countries or part of countries. We call these non-country groupings and examples are North America, Western Asia, Least Developed Countries (LDC), Land Locked Developing Countries (LLDC), Small Island Developing States (SIDS).

In [5]:
# read amended csv file
c = pd.read_csv('utils/countries_wb.csv', dtype=str, delimiter=';', header=None)
countries = list(c[0])

# check
countries

['Afghanistan',
 'Albania',
 'Algeria',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt, Arab Rep.',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia, The',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Grenada',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Honduras',
 'Hungary',
 'Ic

We append these temperature series to the data frame of indicators. 

In [6]:
temp = pickle.load(open('utils/data/temp.pkl', 'rb'))

In [7]:
temp_years = temp['Germany']['YEAR'].astype(str).tolist()

*Hint:* When the data are downloaded, the names of countries do not match exactly. For example 'Republic of Serbia' is the name in the temperature data set, whereas 'Serbia' is the name in the SDG data set. These are aligned manually.

In [8]:
print('# countries in dict_all_std:', len(dict_all_std.keys()))
print('# countries in temp:', len(temp.keys()))

# countries in dict_all_std: 181
# countries in temp: 183


In [9]:
# which country is in dict_all_std but not in temp?
for key in dict_all_std.keys():
    if key not in temp.keys():
        print(key)

print('--------')

for key in temp.keys():
    if key not in dict_all_std.keys():
        print(key)

Micronesia, Fed. Sts.
--------
Greenland
Liechtenstein
Korea, Dem. People's Rep.


In [10]:
# removes countries in-place
countries.remove('Micronesia, Fed. Sts.')
dict_all_std.pop('Micronesia, Fed. Sts.', None)
temp.pop('Greenland', None)
temp.pop('Liechtenstein', None)
temp.pop("Korea, Dem. People's Rep.", None)

Unnamed: 0,YEAR,JAN,FEB,MAR,APR,MAY,JUN,JUL,AUG,SEP,OCT,NOV,DEC,MAM,JJA,SON,DJF,ANN,AVG
0,1901 -10.5 -10.1,-2.0,6.7,12.1,16.7,19.4,20.9,15.1,8.7,-0.3 -10.4,5.6,19.0,7.8 -10.5,5.5,,,,,14.228571
1,1902 -12.0,-9.0,-0.9,5.8,11.5,16.2,19.3,19.3,14.9,8.8,1.5,-7.3,5.5,18.3,8.4,-8.6,5.7,,12.300000
2,1903 -10.5,-7.9,-0.6,7.2,11.7,16.6,19.8,21.1,15.8,7.4,-1.1 -10.0,6.1,19.2,7.4 -10.4,5.8,,,,13.085714
3,1904 -12.9,-8.5,-2.7,6.6,12.1,17.6,21.0,21.3,14.9,7.3,-1.0,-8.6,5.3,20.0,7.1,-8.9,5.6,,12.971429
4,1905,-8.3,-9.9,-2.1,4.9,11.4,16.8,20.3,19.9,14.9,8.2,-1.0,-7.2,4.8,19.0,7.4 -10.5,5.7,,8.757143
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,2015,-9.6,-6.5,0.1,8.0,14.0,18.0,21.2,21.5,16.2,9.0,-0.0,-6.5,7.4,20.2,8.4,-8.6,7.1,10.900000
115,2016 -11.9,-7.3,0.7,8.3,14.2,18.1,21.6,22.4,16.4,8.6,-1.2,-6.8,7.8,20.7,8.0,-7.8,6.9,,14.528571
116,2017,-9.7,-6.8,0.4,9.1,14.5,18.0,23.0,21.1,16.2,8.8,-0.8,-9.7,8.0,20.7,8.1 -10.5,7.0,,11.328571
117,2018 -12.3,-9.4,0.5,8.5,13.5,18.5,23.2,22.0,15.5,8.1,1.1,-8.3,7.5,21.3,8.2,-7.7,6.7,,14.528571


In [None]:
"""
temp['Serbia'] = temp.pop('Republic of Serbia')
temp['Montenegro'] = temp.pop('Republic of Montenegro')
temp['Timor-Leste'] = temp.pop('Timor Leste')
temp['Slovak Republic'] = temp.pop('Slovakia')
temp['Micronesia, Fed. Sts.'] = temp.pop('Federated States of Micronesia')
temp['Yemen, Rep.'] = temp.pop('Yemen')
temp['Syrian Arab Republic'] = temp.pop('Syria')
temp.pop('Swaziland')
temp['Egypt, Arab Rep.'] = temp.pop('Egypt')
temp['Myanmar'] = temp.pop('Myanmar (Burma)')
temp['Congo, Dem. Rep.'] = temp.pop('Congo (Democratic Republic of the)')
temp['Bahamas, The'] = temp.pop('Bahamas')
temp.pop('Northern Mariana Islands')
temp.pop('Marshall Islands')
temp.pop('Monaco')
temp.pop('St. Vincent and the Grenadines')
temp.pop('St. Lucia')
temp.pop('Andorra')
temp.pop('Faroe Islands')
temp.pop('Cape Verde')
temp.pop('Macedonia')
temp['Congo, Rep.'] = temp.pop('Congo (Republic of the)')
temp['Iran, Islamic Rep.'] = temp.pop('Iran')
temp['Brunei Darussalam'] = temp.pop('Brunei')
temp.pop('St. Kitts and Nevis')
temp['Kyrgyz Republic'] = temp.pop('Kyrgyzstan')
temp['Venezuela, RB'] = temp.pop('Venezuela')
temp.pop('New Caledonia')
temp['Lao PDR'] = temp.pop('Laos')
temp['Russian Federation'] = temp.pop('Russia')
temp['Korea, Dem. People\'s Rep.'] = temp.pop('Korea')
temp['Gambia, The'] = temp.pop('Gambia')
"""

In [11]:
for country in temp.keys():
    print(country)
    temp[country] = temp[country].set_index('YEAR').T
    temp[country].columns = temp[country].columns.astype(str)
    temp[country].rename(index={'AVG': 'Temperature'}, inplace=True)
    temp[country] = temp[country].loc['Temperature'] #, period]
    temp[country] = pd.Series(scale(temp[country]), index=temp_years, name='Temperature')

Afghanistan
Albania
Algeria
Angola
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia
Botswana
Brazil
Bulgaria
Burundi
Cambodia
Cameroon
Canada
Chad
Chile
China
Colombia
Comoros
Croatia
Cuba
Cyprus
Denmark
Djibouti
Dominica
Ecuador
Eritrea
Estonia
Ethiopia
Fiji
Finland
France
Gabon
Georgia
Germany
Ghana
Greece
Grenada
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Honduras
Hungary
Iceland
India
Indonesia
Iraq
Ireland
Israel
Italy
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Kuwait
Latvia
Lebanon
Lesotho
Liberia
Libya
Lithuania
Luxembourg
Madagascar
Malawi
Malaysia
Maldives
Mali
Malta
Mauritania
Mauritius
Mexico
Moldova
Mongolia
Montenegro
Morocco
Mozambique
Myanmar
Namibia
Nepal
Netherlands
Nicaragua
Niger
Nigeria
Norway
Oman
Pakistan
Palau
Panama
Paraguay
Peru
Philippines
Poland
Portugal
Qatar
Romania
Rwanda
Samoa
Senegal
Serbia
Seychelles
Singapore
Slovenia
Somalia
Spain
Sudan
Suriname
Sweden
Switzerland
Tajikistan
Ta

In [12]:
# check
temp['Azerbaijan']

1901    0.486906
1902   -0.394132
1903   -0.415886
1904   -0.937982
1905   -0.600795
          ...   
2015    1.726886
2016    0.976372
2017    1.259174
2018    2.499153
2019    1.944426
Name: Temperature, Length: 119, dtype: float64

In [13]:
dict_all_t = {}

for country in temp.keys():
    temp[country].index = temp[country].index.astype(int).astype(str)
    dict_all_t[country] = dict_all_std[country].append(temp[country])
    dict_all_std[country] = dict_all_t[country][temp_years]

In [14]:
countries_to_drop = list(set(list(dict_all_std.keys())).difference(list(temp.keys())))

for c in countries_to_drop:
    dict_all_std.pop(c)

# 1) weighted k nearest neighbour (w-kNN)

The w-kNN algorithm is straightforward: we calculate how similar countries are in each given year $y$ with the standardised Euclidean distance $E_y$ and take the inverse of the absolute $|E_y|$ as the weight to impute missing values in any given country $c_i$ in each given year $y$ for each given sub-indicator $j$.

A good start to understand this algorithm is to understand high-dimensional space: https://youtu.be/wvsE8jm1GzE

### Euclidean distance
The Euclidean distance $e_y$ for year $y$ for any given pair of countries $(c_i, c_k)$ for any given sub-indicator $j$ is calculated by:
$$ e_y(c_{i}, c_{k}) = \lVert c_{i}, c_{k} \rVert_2 = \sqrt{ \sum_{j=1}^J(c_{ij} - c_{kj})^2} $$

We calculate the squared distances between any given pair of countries $(c_i, c_k)$, but do not consider the country $k+1$ which has the largest distance $e_y$ to country $i$. We do so for any given sub-indicator $j$ and take the square root of it. $c_{ij}$ is the sub-indicator $j$ of country $i$, and $i \neq k$. Thus, any unique pair of countries $i$ and $k$, $i \neq k$, has **one** Euclidean distance $e_y$ for year $y$ only.

Afterwards, we normalise this with respect to the country $k+1$ which has the largest distance $e_y$ to country $i$ by the following equation:

$$ E_y(c_{i}, c_{k}) = \frac{e_y(c_{i}, c_{k})}{e_y(c_{i}, c_{k+1})} $$

This can be seen as equivalent to the well-known normalisation equation:

$$
x_n = \frac{x - x_{min}}{x_{max}-x_{min}}
$$

since $x_{min}$ is always 0, because the distance "between" the same country is 0.

### Imputations
We want that our imputations $x^{j}_{i,y}$ for missing sub-indicator $j$ in country $i$ in year $y$ are similar to sub-indicators $j$ of countries $k$ which have a **small** Euclidean distance $E_y$ and dissimilar to sub-indicators $j$ of countries $k$ which have a **large** Euclidean distance $E_y$. Consequently, the imputations $x^{j}_{i,y}$ are the *weighted* averages where the *weights* are equal to the inverse standardised Euclidean distance $\frac{1}{|E_y(c_{i}, c_{k})|}$.

First, we compute $E_y$ for all available pairs of sub-indicators $j$ amongst two countries $i$ and $k$. Since countries have different amounts of available data points, we average by multiplying the sum by $1/J$, where $J$ is the total number of sub-indicators taken into account here. Note, this does not necessarily be 375, because we have missing values for many sub-indicators. Second, we sum over $k$ to add together all weighted $x^j_{k,y}$ of each unique pair of countries $i$ and $k$ and compute its average by dividing by $K$.

$$ x^{j}_{i,y} = \frac{1}{K} \sum_k \frac{1}{|E_y(c_{i}, c_{k})|} \cdot x^j_{k,y} $$

**Assumptions:** we calculate how similar countries are according to their values for *all* sub-indicators in a given year. We assume that the specific sub-indicators which do not have values in this given year are exactly as similar as the ones which we can calculate a distance for.

Let's have a final check before we start our w-kNN algorithm to compute all missing values. We have here negative values which might sound confusing, but bear in mind that we have standardised the data before, i.e. the data distribution has mean 0 and standard deviation 1.

In [15]:
dict_all_std['Iraq']

Unnamed: 0,1901,1902,1903,1904,1905,1906,1907,1908,1909,1910,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
EG.CFT.ACCS.ZS,,,,,,,,,,,...,0.641392,0.732207,0.799586,0.877218,0.925555,0.976822,1.011976,,,
EG.ELC.ACCS.ZS,,,,,,,,,,,...,-0.783718,-0.942031,0.694039,0.003795,0.369409,0.768609,1.190220,1.461546,1.449148,
EG.ELC.ACCS.RU.ZS,,,,,,,,,,,...,-0.755535,-1.354031,0.870741,-0.154748,0.206070,0.637647,1.126894,1.450451,1.542347,
EG.ELC.ACCS.UR.ZS,,,,,,,,,,,...,-0.621267,1.023083,-0.269228,0.565059,0.822012,0.971654,1.017390,1.023083,0.606209,
FX.OWN.TOTL.ZS,,,,,,,,,,,...,,-0.744066,,,-0.669492,,,1.413558,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VC_DSR_PDAN,,,,,,,,,,,...,,,,,,,,,0.000000,
VC_DSR_PDLN,,,,,,,,,,,...,,,,,,,,,0.000000,
VC_DSR_PDYN,,,,,,,,,,,...,,,,,,,,,0.000000,
VC_DSR_MISS,,,,,,,,,,,...,,,,,,,,,,


In [None]:
# CHECKPOINT
dict_e = pickle.load(open('utils/data/distances_unstd.pkl', 'rb'))
dict_E = pickle.load(open('utils/data/distances_std.pkl', 'rb'))

In [None]:
# check
print('unstandardised distance e:', dict_e['2000', 'Afghanistan', 'Colombia'])
print('standardised distance E:  ', dict_E['2000', 'Afghanistan', 'Colombia'])

### Calculating the Euclidean distance

We can calculate the standardised distance $E_y(c_{i}, c_{k})$ after having prepared everything. We do this for each unique pair of two *countries* in each year. In other words, we do not want to calculate $E_y(c_{i}, c_{k})$ for $i = k$ and $E_y(c_{i}, c_{k}) = E_y(c_{k}, c_{i})$.

The python package <code>itertools</code> can help us generating the unique pairs of countries.

In [16]:
# create list out of all unique combinations
countrycombinations = list(itertools.combinations(countries, 2))
countrycombinations

[('Afghanistan', 'Albania'),
 ('Afghanistan', 'Algeria'),
 ('Afghanistan', 'Angola'),
 ('Afghanistan', 'Antigua and Barbuda'),
 ('Afghanistan', 'Argentina'),
 ('Afghanistan', 'Armenia'),
 ('Afghanistan', 'Australia'),
 ('Afghanistan', 'Austria'),
 ('Afghanistan', 'Azerbaijan'),
 ('Afghanistan', 'Bahamas, The'),
 ('Afghanistan', 'Bahrain'),
 ('Afghanistan', 'Bangladesh'),
 ('Afghanistan', 'Barbados'),
 ('Afghanistan', 'Belarus'),
 ('Afghanistan', 'Belgium'),
 ('Afghanistan', 'Belize'),
 ('Afghanistan', 'Benin'),
 ('Afghanistan', 'Bhutan'),
 ('Afghanistan', 'Bolivia'),
 ('Afghanistan', 'Bosnia and Herzegovina'),
 ('Afghanistan', 'Botswana'),
 ('Afghanistan', 'Brazil'),
 ('Afghanistan', 'Brunei Darussalam'),
 ('Afghanistan', 'Bulgaria'),
 ('Afghanistan', 'Burkina Faso'),
 ('Afghanistan', 'Burundi'),
 ('Afghanistan', 'Cambodia'),
 ('Afghanistan', 'Cameroon'),
 ('Afghanistan', 'Canada'),
 ('Afghanistan', 'Central African Republic'),
 ('Afghanistan', 'Chad'),
 ('Afghanistan', 'Chile'),
 ('Af

In [17]:
# check
countrycombinations[0][0]

'Afghanistan'

Here, we calculate the standardised distance $E_y(c_{i}, c_{k})$ for each unique pair of two countries in each year.

In [18]:
from scipy.spatial import distance

First, we compute the (not standardised) distances $e_y$ and insert them into a new dictionary `dict_e`.

While exploring the data, we see that nearly no data are available for the years `1990` to `1999`. Consequently, imputations in those years will be based on very weak foundations and we do not consider these years for now. For our similarity investigations later, it does not matter much how many data points we have totally available per country, it is more important that all countries have the same amount of data points. For now, we also omit data for the year `2019`, because it seems not all countries have reported their data yet. Hence, there aren't too many data points available neither.

We set the `period` of years we want to consider in our computations for $e_y$.

In [22]:
period = ['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']

In [23]:
# call seriescodes again
info = pd.read_csv('utils/wb_info.csv', header=None, dtype=str)
info.drop_duplicates(subset=[2], inplace=True)
#seriescodes = list(info['Series Code'])
seriescodes = list(dict_all['Germany'].index)
print(len(seriescodes))
print(len(info))

399
400


In [21]:
# finding which seriescodes are not in other list and adding them
for seriescode in seriescodes:
    if seriescode not in list(info[0]):
        print(seriescode)

print('-------')

for seriescode in list(info[0]):
    if seriescode not in seriescodes:
        print(seriescode)
        seriescodes.append(seriescode)

-------
Temperature


In [22]:
# checking if '8.10' became 8.1 (8 is only SDG with a target number 10 in the current version of data)
# could be amended manually, but does not matter as long as analyses are done on goal-level
info[info[3]=='8']

Unnamed: 0,0,1,2,3,4,5
224,FB.CBK.BRCH.P5,Financial Sector: Access,"Commercial bank branches (per 100,000 adults)",8,8.1,1
225,FX.OWN.TOTL.40.ZS,,Account ownership at a financial institution o...,8,8.1,1
226,FX.OWN.TOTL.60.ZS,,Account ownership at a financial institution o...,8,8.1,1
227,FX.OWN.TOTL.FE.ZS,,Account ownership at a financial institution o...,8,8.1,1
228,FX.OWN.TOTL.MA.ZS,,Account ownership at a financial institution o...,8,8.1,1
229,FX.OWN.TOTL.OL.ZS,,Account ownership at a financial institution o...,8,8.1,1
230,FX.OWN.TOTL.PL.ZS,,Account ownership at a financial institution o...,8,8.1,1
231,FX.OWN.TOTL.SO.ZS,,Account ownership at a financial institution o...,8,8.1,1
232,FX.OWN.TOTL.YG.ZS,,Account ownership at a financial institution o...,8,8.1,1
233,FX.OWN.TOTL.ZS,,Account ownership at a financial institution o...,8,8.1,1


In [23]:
# check
dict_all_std['Iraq'].shape

(400, 119)

### Comparing Euclidean distances computed with vectors of different dimensions

Simply computing the Euclidean distance between all the countries we have won't give us the results we like, because each of the pairs of countries we calculate the Euclidean distance for has different many measurements to take into account.

We consider this by counting how many measurements `j` we have for each pair of countries and setting the weight `w` as the invesre of the number of measurements.

In [24]:
# ~ one hour computing time
# no need to run every time again, just see CHECKPOINT above and load pickle file

dict_e = {}    

for year in period:
    print(year)
    
    for countrycombination in countrycombinations:
        
        country0_e = []    # create two empty lists for the two groupings we consider at the moment
        country1_e = []    # these lists contain series codes with data available in both groupings
        j = 0    # counter
        
        for seriescode in seriescodes:
            # we can only consider sub-indicators with data available in both groupings
            if pd.isna(dict_all_std[countrycombination[0]].loc[seriescode, year]) is False and pd.isna(dict_all_std[countrycombination[1]].loc[seriescode, year]) is False:
                country0_e.append(dict_all_std[countrycombination[0]].loc[seriescode, year])
                country1_e.append(dict_all_std[countrycombination[1]].loc[seriescode, year])
                
                j += 1
        
        #print('number of data points available: ', j)    # check
        if j > 0:
            e = distance.euclidean(country0_e, country1_e, w=1/j)
        else:
            e = np.nan    # make NaN
            
        #print('e in {} between {} and {}:'.format(year, countrycombination[0], countrycombination[1]), e)
        
        dict_e[year, countrycombination[1], countrycombination[0]] = e
        dict_e[year, countrycombination[0], countrycombination[1]] = dict_e[year, countrycombination[1], countrycombination[0]]

2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019


In [25]:
# better save these precious data
f = open('utils/data/distances_unstd.pkl', 'wb')
pickle.dump(dict_e, f)
f.close()

In [None]:
# CHECKPOINT
dict_e = pickle.load(open('utils/data/distances_unstd.pkl', 'rb'))

In [26]:
# check
print(dict_e['2011', 'Switzerland', 'Iraq'])

0.819235652376648


Normalise these distances $e_y$ and save them in `dict_E`:

In [27]:
dict_E = {}

for year in period:
    print(year)
    
    max_e = 0    # maximum value per year
    min_e = 0
    dict_e_year = {}    # auxiliary dictionary with all distances per year
    
    for k in dict_e.keys():
        if year in k:
            dict_e_year[k] = dict_e[k]            

    max_e = np.nanmax(list(dict_e_year.values()))

    #print('------------')
    #print('max_e in {}'.format(year), max_e)
    #print('------------')
       
    for k in dict_e_year.keys():
        #print(k)
        if np.isnan(dict_e_year[k]) == False:
            #print('unstandardised distance e:', dict_e_year[k])
            dict_E[k] = dict_e_year[k] / max_e    # standardise distance
            
        else:
            dict_E[k] = np.nan    # keep as NaN
        
        #print('standardised distance E:', dict_E[k])
        #print('------------')

2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019


In [28]:
# check
print('unstandardised distance e:', dict_e['2016', 'Afghanistan', 'Colombia'])
print('standardised distance E:  ', dict_E['2016', 'Afghanistan', 'Colombia'])

unstandardised distance e: 1.0770541228951773
standardised distance E:   0.5196050842596447


In [29]:
# check (both should be False)
print(0 in dict_e.values())
print(0 in dict_E.values())

False
False


In [30]:
# check 
min_value = 0.1

for key, value in dict_E.items():
    if 0 < value < min_value:
        min_value = value
        print('smallest:', key, value)

In [31]:
# better save these precious data
f = open('utils/data/distances_std.pkl', 'wb')
pickle.dump(dict_E, f)
f.close()

### Imputations for countries

Now, we impute the missing values according to the equation we previously derived:

$$ x^{j'}_{i,y} = \frac{1}{K} \sum_k \frac{1}{|E_y(c_{i}, c_{k})|} \cdot x^j_{k,y} $$

To recap, our imputations $x^{j'}_{i,y}$ for missing sub-indicator $j'$ in country $i$ in year $y$ should be similar to sub-indicators $j$ of country $k$, according to the inverse standardised Euclidean distance $\frac{1}{|E_y(c_{i}, c_{k})|}$ between $i$ and $k$. 

As aforementioned and shown in the equation of $E_y$, $E_y$ is dependent on the number of pairs we have in both countries data for. Our `dict_E` has already entries for $E$ normalised according to this number of available pairs of data. We multiply our weight for each imputation, i.e. the inverse standardised Euclidean distance $\frac{1}{|E_y(c_{i}, c_{k})|}$ between $i$ and $k$, by the value $x^j_{k,y}$ of the other country $k$ in year $y$ for sub-indicator $j$. We sum over $k$ to add together all weighted $x^j_{k,y}$ of each unique pair of countries $i$ and $k$ and compute its average by dividing by $K$. $K$ is the number of countries which have values available for the sub-indicator $x^{j'}_{k,y}$ to be computed. 

We also know that some indicators are **binary**, i.e. 1 for 'yes' and 0 for 'no', and ordinal for groupings of countries. These indicators start usually with 'number of countries which...'. The imputations for these must be handled differently: we round the imputed value to an integer.

In [None]:
# in UN data set exist binary indicators
binary = ['1.5.3', '5.1.1', '5.6.2', '5.a.2', '5.c.1', '8.b.1', '10.7.2', '11.b.1', '12.1.1', '12.7.1', '13.1.2', '13.2.1', '13.3.1', '13.3.2', '13.b.1', '14.c.1', '15.6.1', '15.8.1', '16.10.2', '17.5.1', '17.14.1', '17.16.1', '17.18.2', '17.18.3', '17.19.2']

In [None]:
info_binary = info.loc[info['Indicator'].isin(binary)]
binary_seriescodes = list(info_binary['SeriesCode'])

binary_seriescodes

In [32]:
# ~30 minutes

dict_all_i = {}

for country in tqdm(countries):
    
    dict_all_i[country] = pd.DataFrame(index=seriescodes, columns=period)
    
    not_countries = [c for c in countries if c != country]
    
    for seriescode in seriescodes:
        for year in period:            
            if pd.isna(dict_all_std[country].loc[seriescode, year]) is True:
                K = 0
                all_k = []
                
                for not_country in not_countries: 
                    if pd.isna(dict_all_std[not_country].loc[seriescode, year]) is False and pd.isna(dict_E[(year, country, not_country)]) is False: # and dict_E[(year, country, not_country)]!=0:    # not_country can also have NaN -> exclude those
                        K += 1
                        # print('value:', dict_all_std[not_country].loc[seriescode, year])
                        # print('distance:', dict_E[(year, country, not_country)])
                        k = (dict_all_std[not_country].loc[seriescode, year]) / (dict_E[(year, country, not_country)])
                        # print('k =', k)
                        all_k.append(k)
                        
                sum_k = np.sum(all_k)
                    
                #print('K =', K)
                    
                if K > 0:
                    # print('sum k =', sum_k)
                    
                    """
                    # only UN data set has binary seriescodes
                    if seriescode in binary_seriescodes:
                        dict_all_i[country].loc[seriescode, year] = np.around(sum_k / K)    # round to have binary
                    else:
                        dict_all_i[country].loc[seriescode, year] = sum_k / K
                    """
                    dict_all_i[country].loc[seriescode, year] = sum_k / K
                            
                else:
                    dict_all_i[country].loc[seriescode, year] = np.nan    # only impute when data of other countries is available, 0 cannot be imputed because time-series are non-stationary
                
                #print('Imputation for {} in {} in {}'.format(seriescode, country, year), dict_all_i[country].loc[seriescode, year])        
                
            else:
                dict_all_i[country].loc[seriescode, year] = dict_all_std[country].loc[seriescode, year]

HBox(children=(FloatProgress(value=0.0, max=180.0), HTML(value='')))




We need to delete all keys which are not part of the list of countries.

In [33]:
# list of keys to delete
delete_keys = []

for key in dict_all_i.keys():
    if key not in countries:
        delete_keys.append(key)
        
# delete
for dk in delete_keys:
    dict_all_i.pop(dk, None)

In [34]:
delete_keys

[]

In [35]:
# check 
max_values = []

for c in dict_all_i.keys():
    max_values.append(dict_all_i[c].max().max())
    
max(max_values)

7.8232424225359845

In [36]:
# check
print('NaN here', dict_all_std['Afghanistan'].loc['SE.SEC.UNER.LO.ZS', '2016'])
print('Imputed value', dict_all_i['Afghanistan'].loc['SE.SEC.UNER.LO.ZS', '2016'])

NaN here nan
Imputed value -0.4647248878236216


In [37]:
# check
dict_all_i['Afghanistan'].loc['SE.SEC.UNER.LO.ZS']

2000     0.699648
2001     0.812976
2002    0.0895247
2003    0.0136518
2004    -0.338087
2005    -0.394271
2006    -0.320402
2007    -0.281633
2008    -0.177067
2009    0.0878838
2010   -0.0788393
2011    -0.186653
2012    -0.277259
2013    -0.393072
2014     -0.46631
2015    -0.936052
2016    -0.464725
2017     -0.47778
2018    -0.778671
2019     -2.99926
Name: SE.SEC.UNER.LO.ZS, dtype: object

We want to save the imputations to have another checkpoint here.

In [38]:
# as csv files
if not os.path.exists('csv_imputed'):
    os.mkdir('csv_imputed')

for c in countries:
    dict_all_i[c].to_csv(r'csv_imputed/{}_wb.csv'.format(c))
    
# as pkl files
imp = open('utils/data/dict_all_i_wb.pkl', 'wb')
pickle.dump(dict_all_i, imp)
imp.close()

In [18]:
# CHECKPOINT
dict_all_i = pickle.load(open('utils/data/dict_all_i_wb.pkl', 'rb'))

Let's calculate the total number of values:

In [39]:
# number of values
s_imp = 0
for country in dict_all_i.keys():
    s_imp += np.sum(dict_all_i[country].count())

print('Total # values before imputations:', s)
print('Total # values after imputations:', s_imp)
print('How many have been imputed?', s_imp - s)
print('This accounts for', round(100*(s_imp-s)/s, 2), '% of the total data available now')

Total # values before imputations: 850591
Total # values after imputations: 1172340
How many have been imputed? 321749
This accounts for 37.83 % of the total data available now


# Averaging and concatenating data to higher levels

### *(UN data set only)* Averaging and concatenating series codes data to indicator-level

1. We can average all series codes, i.e. sub-indicators, belonging to one indicator to this indicator.
2. We can see the series codes as multiple samples of the same indicator and concatenate the series codes into indicators. Consequently, we have more than one measurement per time point for any indicators having more than one series code.

In [None]:
indicators = list(info.Indicator)

dict_indicators = {}

for indicator in indicators:
    i = info['SeriesCode'].where(info['Indicator'] == indicator)

    dict_indicators[indicator] = [s for s in i if str(s) != 'nan']

In [None]:
# check
np.isnan(dict_all_i['Germany'].loc['SI_POV_DAY1', '2001'])

In [None]:
#indicators_values = {}
#indicators_values_std = {}
indicators_values_i = {}

for country in countries:
    #print(country)
    
    #indicators_values[country] = pd.DataFrame(columns=period, index=indicators)
    #indicators_values_std[country] = pd.DataFrame(columns=period, index=indicators)
    indicators_values_i[country] = pd.DataFrame(columns=period, index=list(dict_indicators.keys()))
    
    for year in period:
        
        for indicator in dict_indicators.keys():
            #list_subindicators_values = []
            #list_subindicators_values_std = []
            list_subindicators_values_i = []
    
            for subindicator in list(dict_indicators[indicator]):
                if np.isnan(dict_all_i[country].loc[subindicator, year]):
                    pass
                else:
                    #list_subindicators_values.append(dict_all[country].loc[subindicator, year])
                    #list_subindicators_values_std.append(dict_all_std[country].loc[subindicator, year])
                    list_subindicators_values_i.append(dict_all_i[country].loc[subindicator, year])
            
            # 1. averaging
            #indicators_values[country].loc[indicator, year] = np.nanmean(list_subindicators_values)
            #indicators_values_std[country].loc[indicator, year] = np.nanmean(list_subindicators_values_std)
            #indicators_values_i[country].loc[indicator, year] = np.nanmean(list_subindicators_values_i)
            
            # 2. concatenating
            array_subindicators_values_i = np.asarray(list_subindicators_values_i)
            indicators_values_i[country].loc[indicator, year] = array_subindicators_values_i

In [None]:
# check (duplicate indicator labels should be in index)
indicators_values_i['Germany']

In [None]:
# better save these precious data
#ind_val = open('utils/data/indicators_values.pkl', 'wb')
#ind_val_std = open('utils/data/indicators_values_std.pkl', 'wb')
ind_val_i = open('utils/data/indicators_values_i.pkl', 'wb')
#pickle.dump(indicators_values, ind_val)
#pickle.dump(indicators_values_std, ind_val_std)
pickle.dump(indicators_values_i, ind_val_i)
#ind_val.close()
#ind_val_std.close()
ind_val_i.close()

### *(UN data set)* Averaging and concatenating indicator data to target-level
We must also generate two lists of indicators which are meant to increase and decrease over time.

In [None]:
# gone though all targets by hand and checked which indicators are meant to increase and which are meant to decrease over time.
increase = ['1.3.1', '1.4.1', '1.4.2', '1.5.3', '1.5.4', '1.a.1', '1.a.2', '1.a.3', '1.b.1', '2.3.1', '2.3.2', '2.4.1', '2.5.1', '2.a.1', '2.a.2', '3.1.2', '3.5.1', '3.7.1', '3.8.1', '3.b.1', '3.b.2', '3.b.3', '3.c.1', '3.d.1', '4.1.1', '4.2.1', '4.2.2', '4.3.1', '4.4.1', '4.6.1', '4.7.1', '4.a.1', '4.b.1', '4.c.1', '5.1.1', '5.5.1', '5.5.2', '5.6.1', '5.6.2', '5.a.1', '5.a.2', '5.b.1', '5.c.1', '6.1.1', '6.2.1', '6.3.1', '6.3.2', '6.4.1', '6.5.1', '6.5.2', '6.6.1', '6.a.1', '6.b.1', '7.1.1', '7.1.2', '7.2.1', '7.3.1', '7.a.1', '7.b.1', '8.1.1', '8.2.1', '8.3.1', '8.5.1', '8.8.2', '8.9.1', '8.9.2', '8.10.1', '8.10.2', '8.a.1', '8.b.1', '9.1.1', '9.1.2', '9.2.1', '9.2.2', '9.3.1', '9.3.2', '9.5.1', '9.5.2', '9.a.1', '9.b.1', '9.c.1', '10.1.1', '10.4.1', '10.5.1', '10.6.1', '10.7.2', '10.a.1', '10.b.1', '11.2.1', '11.3.2', '11.4.1', '11.6.1', '11.7.1', '11.a.1', '11.b.1', '11.b.2', '11.c.1', '12.1.1', '12.4.1', '12.5.1', '12.6.1', '12.7.1', '12.8.1', '12.a.1', '12.b.1', '13.1.2', '13.1.3', '13.2.1', '13.3.1', '13.3.2', '13.a.1', '13.b.1', '14.2.1', '14.3.1', '14.4.1', '14.5.1', '14.6.1', '14.7.1', '14.a.1', '14.b.1', '14.c.1', '15.1.1', '15.1.2', '15.2.1', '15.4.1', '15.4.2', '15.6.1', '15.8.1', '15.9.1', '15.a.1', '15.b.1', '16.1.4', '16.6.2', '16.7.1', '16.7.2', '16.8.1', '16.9.1', '16.10.2', '16.a.1', '17.1.1', '17.1.2', '17.2.1', '17.3.1', '17.3.2', '17.4.1', '17.5.1', '17.6.1', '17.6.2', '17.7.1', '17.8.1', '17.9.1', '17.11.1', '17.13.1', '17.14.1', '17.15.1', '17.16.1', '17.17.1', '17.18.1', '17.18.2', '17.18.3', '17.19.1', '17.19.2']
decrease = ['1.1.1', '1.2.1', '1.2.2', '1.5.1', '1.5.2', '2.1.1', '2.1.2', '2.2.1', '2.2.2', '2.5.2','2.b.1', '2.c.1', '3.1.1', '3.2.1', '3.2.2', '3.3.1', '3.3.2', '3.3.3', '3.3.4', '3.3.5', '3.4.1', '3.4.2', '3.5.2', '3.6.1', '3.7.2', '3.8.2', '3.9.1', '3.9.2', '3.9.3', '3.a.1', '4.5.1', '5.2.1', '5.2.2', '5.3.1', '5.3.2', '5.4.1', '6.4.2', '8.4.1', '8.4.2', '8.5.2', '8.6.1', '8.7.1', '8.8.1', '9.4.1', '10.2.1', '10.3.1', '10.7.1', '10.c.1', '11.1.1', '11.3.1', '11.5.1', '11.5.2', '11.6.2', '11.7.2', '12.2.1', '12.2.2', '12.3.1', '12.4.2', '12.c.1', '13.1.1', '14.1.1', '15.3.1', '15.5.1', '15.7.1', '15.c.1', '16.1.1', '16.1.2', '16.1.3', '16.2.1', '16.2.2', '16.2.3', '16.3.1', '16.3.2', '16.4.1', '16.4.2', '16.5.1', '16.5.2', '16.6.1', '16.10.1', '16.b.1', '17.10.1', '17.12.1']

In [None]:
# making all time-series "pointing" upwards when they are meant to increase

indicators_values_i_up = {}

for country in countries:
    indicators_values_i_up[country] = pd.DataFrame(index=list(dict_indicators.keys()), columns=period)
    
    for indicator in dict_indicators.keys():
        if indicator in decrease:
            #indicators_values_up[country].loc[indicator] = indicators_values[country].loc[indicator]*(-1)
            #indicators_values_std_up[country].loc[indicator] = indicators_values_std[country].loc[indicator]*(-1)
            indicators_values_i_up[country].loc[indicator] = list(np.multiply(list(indicators_values_i[country].loc[indicator]), -1))
        else:
            #indicators_values_up[country].loc[indicator] = indicators_values[country].loc[indicator]
            #indicators_values_std_up[country].loc[indicator] = indicators_values_std[country].loc[indicator]
            indicators_values_i_up[country].loc[indicator] = indicators_values_i[country].loc[indicator]

In [None]:
# check
indicators_values_i_up['Germany']

In [None]:
# better save these precious data
#ind_val = open('utils/data/indicators_values_up.pkl', 'wb')
#ind_val_std = open('utils/data/indicators_values_std_up.pkl', 'wb')
ind_val_i = open('utils/data/indicators_values_i_up.pkl', 'wb')
#pickle.dump(indicators_values_up, ind_val)
#pickle.dump(indicators_values_std_up, ind_val_std)
pickle.dump(indicators_values_i_up, ind_val_i)
#ind_val.close()
#ind_val_std.close()
ind_val_i.close()

Defining dictionaries for targets:

In [None]:
targets = list(info['Target'].unique())

dict_targets = {}

for target in targets:
    t = info['Indicator'].where(info['Target'] == target)

    dict_targets[target] = [i for i in t if str(i) != 'nan']

In [None]:
# check
list(indicators_values_i_up['Germany'].loc['1.1.1', '2000'])

Now, we can simply average or concatenate:

In [None]:
targets_values_i = {}
#targets_values_up = {}
#targets_values_std_up = {}
targets_values_i_up = {}    # for Granger-causality

for country in countries:
    
    #targets_values_up[country] = pd.DataFrame(columns=period, index=targets)
    #targets_values_std_up[country] = pd.DataFrame(columns=period, index=targets)
    targets_values_i[country] = pd.DataFrame(columns=period, index=list(dict_targets.keys()))
    targets_values_i_up[country] = pd.DataFrame(columns=period, index=list(dict_targets.keys()))
    
    for year in period:
        
        for target in list(dict_targets.keys()):
            #list_indicators_values = []
            #list_indicators_values_std = []
            list_indicators_values_i = []
            list_indicators_values_i_up = []
    
            for indicator in list(dict_targets[target]):
                #list_indicators_values.append(indicators_values[country].loc[indicator, year])
                #list_indicators_values_std.append(indicators_values_std[country].loc[indicator, year])
                list_indicators_values_i.extend(indicators_values_i[country].loc[indicator, year])
                list_indicators_values_i_up.extend(indicators_values_i_up[country].loc[indicator, year])
    
            #print(list_indicators_values_i)
            
            # 1. averaging
            #targets_values_up[country].loc[target, year] = np.mean(list_indicators_values)
            #targets_values_std_up[country].loc[target, year] = np.mean(list_indicators_values_std)
            #targets_values_i_up[country].loc[target, year] = np.mean(list_indicators_values_i)
            
            # 2. concatenating
            targets_values_i[country].loc[target, year] = list_indicators_values_i
            targets_values_i_up[country].loc[target, year] = list_indicators_values_i_up

In [None]:
# check (each goal should have list in cells with values for sub-indicators)
targets_values_i_up['Germany'].loc['1.1']

The first inner parentheses in each cell contains the values from the first indicator; the second inner parentheses in each cell contains the values from the second indicator; etc. Each indicator represents a couple of sub-indicators.

In [None]:
# better save these precious data
#tar_val = open('utils/data/targets_values_up.pkl', 'wb')
#tar_val_std = open('utils/data/targets_values_std_up.pkl', 'wb')
tar_val_i = open('utils/data/targets_values_i.pkl', 'wb')
tar_val_i_up = open('utils/data/targets_values_i_up.pkl', 'wb')
#pickle.dump(targets_values_up, tar_val)
#pickle.dump(targets_values_std_up, tar_val_std)
pickle.dump(targets_values_i, tar_val_i)
pickle.dump(targets_values_i_up, tar_val_i_up)
#tar_val.close()
#tar_val_std.close()
tar_val_i.close()
tar_val_i_up.close()

### *(UN data set)* Averaging and concatenating target data to goal-level

Defining dictionaries for goals.

In [None]:
goals = list(info['Goal'].unique())

dict_goals = {}

for goal in goals:
    g = info['Target'].where(info['Goal'] == goal)

    dict_goals[goal] = [t for t in g if str(t) != 'nan']

In [None]:
goals_values_i = {}
#goals_values_up = {}
#goals_values_std_up = {}
goals_values_i_up = {}    # for Granger-causality

for country in countries:
    
    #goals_values_up[country] = pd.DataFrame(columns=period, index=goals)
    #goals_values_std_up[country] = pd.DataFrame(columns=period, index=goals)
    goals_values_i[country] = pd.DataFrame(columns=period, index=list(dict_goals.keys()))
    goals_values_i_up[country] = pd.DataFrame(columns=period, index=list(dict_goals.keys()))
    
    for year in period:
        
        for goal in goals:
            #list_targets_values = []
            #list_targets_values_std = []
            list_targets_values_i = []
            list_targets_values_i_up = []
    
            for target in list(dict_goals[goal]):
                #list_targets_values.append(targets_values_up[country].loc[target, year])
                #list_targets_values_std.append(targets_values_std_up[country].loc[target, year])
                list_targets_values_i.extend(targets_values_i[country].loc[target, year])
                list_targets_values_i_up.extend(targets_values_i_up[country].loc[target, year])
    
            #print(list_targets_values_i)
            
            # 1. averaging
            #goals_values_up[country].loc[goal, year] = np.mean(list_targets_values)
            #goals_values_std_up[country].loc[goal, year] = np.mean(list_targets_values_std)
            #goals_values_i_up[country].loc[goal, year] = np.mean(list_targets_values_i)
            
            # 2. concatenating
            goals_values_i[country].loc[goal, year] = list_targets_values_i
            goals_values_i_up[country].loc[goal, year] = list_targets_values_i_up

In [None]:
# check (each goal should have list in cells with values for sub-indicators)
goals_values_i_up['Germany'].loc['13']

In [None]:
# better save these precious data
#goa_val = open('utils/data/goals_values_up.pkl', 'wb')
#goa_val_std = open('utils/data/goals_values_std_up.pkl', 'wb')
goa_val_i = open('utils/data/goals_values_i.pkl', 'wb')
goa_val_i_up = open('utils/data/goals_values_i_up.pkl', 'wb')
#pickle.dump(goals_values_up, goa_val)
#pickle.dump(goals_values_std_up, goa_val_std)
pickle.dump(goals_values_i, goa_val_i)
pickle.dump(goals_values_i_up, goa_val_i_up)
#goa_val.close()
#goa_val_std.close()
goa_val_i.close()
goa_val_i_up.close()

### *(WorldBank data set)* Concatenating indicator data to target-level
As we have done with the UN data set, we concatenate indicator data to target-level. We jump over the indicators, because sub-indicators are not mapped to specific indicators in the WorldBank data set, only to targets directly.

In [28]:
# check
dict_all_std['Afghanistan']

Unnamed: 0,1901,1902,1903,1904,1905,1906,1907,1908,1909,1910,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
EG.CFT.ACCS.ZS,,,,,,,,,,,...,0.279600,0.506205,0.746543,1.033575,1.283527,1.573306,1.894672,,,
EG.ELC.ACCS.ZS,,,,,,,,,,,...,-0.695603,-0.675679,0.312007,0.305643,1.090614,0.403608,1.403584,1.403584,1.442255,
EG.ELC.ACCS.RU.ZS,,,,,,,,,,,...,-0.686047,-0.706401,0.279194,0.292840,1.087533,0.396553,1.421530,1.421297,1.458510,
EG.ELC.ACCS.UR.ZS,,,,,,,,,,,...,-0.896915,-0.423962,0.634500,0.342497,1.098945,0.320685,1.199366,1.199366,1.262129,
FX.OWN.TOTL.ZS,,,,,,,,,,,...,,-0.884230,,,-0.513711,,,1.397941,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VC_DSR_MTMP,,,,,,,,,,,...,,,,,,,,-1.000000,,1.000000
VC_DSR_PDAN,,,,,,,,,,,...,,,,,,,,-0.737766,-0.675998,1.413764
VC_DSR_PDLN,,,,,,,,,,,...,,,,,,,,-1.000000,1.000000,
VC_DSR_PDYN,,,,,,,,,,,...,,,,,,,,-0.535038,-0.866192,1.401230


In [25]:
# CHECKPOINT
indicators_values_i_up = pickle.load(open('utils/data/indicators_values_i_up_wb.pkl', 'rb'))

In [31]:
# making all time-series "pointing" upwards when they are meant to increase
# not necessary when computing non-linear dependence

indicators_values_i_up = {}
indicators_values_std_up = {}

for country in countries:
    indicators_values_i_up[country] = pd.DataFrame(index=list(dict_all_i[country].index), columns=period)
    indicators_values_std_up[country] = pd.DataFrame(index=list(dict_all_std[country].index), columns=period)
    
    for seriescode in list(dict_all_std['France'].index):
        #indicators_values_i_up[country].at[seriescode] = list(np.multiply(list(dict_all_i[country].loc[seriescode]), int(info.loc[info[0] == seriescode][5])))
        # leaving them as they are to investigate dependencies between indicators (direction does not matter then)
        indicators_values_i_up[country].at[seriescode] = list(dict_all_i[country].loc[seriescode])
        indicators_values_std_up[country].at[seriescode] = list(dict_all_std[country].loc[seriescode][period])

In [32]:
# check 
dict_all_std['France'].loc['ER.H2O.FWTL.ZS']

1901   NaN
1902   NaN
1903   NaN
1904   NaN
1905   NaN
        ..
2015   NaN
2016   NaN
2017   NaN
2018   NaN
2019   NaN
Name: ER.H2O.FWTL.ZS, Length: 119, dtype: float64

In [33]:
# check
indicators_values_std_up['France'].loc['ER.H2O.FWTL.ZS']

2000          NaN
2001          NaN
2002    0.0832496
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    -0.331485
2008          NaN
2009          NaN
2010          NaN
2011          NaN
2012    -0.688701
2013          NaN
2014    -0.688701
2015          NaN
2016          NaN
2017          NaN
2018          NaN
2019          NaN
Name: ER.H2O.FWTL.ZS, dtype: object

In [34]:
# better save these precious data
ind_val_i = open('utils/data/indicators_values_i_up_wb.pkl', 'wb')
pickle.dump(indicators_values_i_up, ind_val_i)
ind_val_i.close()

ind_val_std = open('utils/data/indicators_values_std_up_wb.pkl', 'wb')
pickle.dump(indicators_values_std_up, ind_val_std)
ind_val_std.close()

Defining dictionaries for targets:

In [35]:
targets = list(info[4].unique())

dict_targets = {}

for target in targets:
    t = info[0].where(info[4] == target)

    dict_targets[target] = [i for i in t if str(i) != 'nan']

In [36]:
# check
dict_targets['13.1']

['EN.CLC.MDAT.ZS',
 'SG_DSR_SILN',
 'SG_DSR_SILS',
 'SG_GOV_LOGV',
 'VC_DSR_AFFCT',
 'VC_DSR_DAFF',
 'VC_DSR_IJILN',
 'VC_DSR_MISS',
 'VC_DSR_MORT',
 'VC_DSR_MTMN',
 'VC_DSR_MTMP',
 'VC_DSR_PDAN',
 'VC_DSR_PDLN',
 'VC_DSR_PDYN']

In [40]:
indicators_values_std_up['Germany']

Unnamed: 0,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
EG.CFT.ACCS.ZS,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,
EG.ELC.ACCS.ZS,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
EG.ELC.ACCS.RU.ZS,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
EG.ELC.ACCS.UR.ZS,0.435264,-0.12284,-0.873248,-1.51735,-1.94323,-2.08537,-1.97205,-1.65538,-1.1873,-0.619743,-0.0299162,0.434609,0.695226,0.779821,0.791425,0.791425,0.791425,0.791425,0.791425,
FX.OWN.TOTL.ZS,,,,,,,,,,,,-1.31145,,,0.1974,,,1.11406,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VC_DSR_PDYN,,,,,,,,,,,,,,,,,,,,
VC_DSR_PDAN,,,,,,,,,,,,,,,,,,,,
VC_DSR_MISS,,,,,,,,,,,,,,,,,,,,
VC_DSR_AFFCT,,,,,,,,,,,,,,,,,,,,


Target values:

In [41]:
# with imputed values

targets_values_i_up = {}
targets_values_i_up_arr = {}
targets_values_i_up_avg = {}

for country in tqdm(countries):
    targets_values_i_up_arr[country] = []
    targets_values_i_up_avg[country] = pd.DataFrame(index=targets, columns=period)
    
    for t, target in enumerate(list(dict_targets.keys())):
        list_indicators_values_i_up_arr = []

        for y, year in enumerate(period):
            list_indicators_values_i_up_avg = []
            for i, indicator in enumerate(list(dict_targets[target])):
                # do not append NaNs                  
                if np.isnan(indicators_values_i_up[country].loc[indicator, year])==False:
                    list_indicators_values_i_up_arr.append(indicators_values_i_up[country].loc[indicator, year])
                    list_indicators_values_i_up_avg.append(indicators_values_i_up[country].loc[indicator, year])
            
            # 1. averaging
            targets_values_i_up_avg[country].loc[target, year] = np.mean(list_indicators_values_i_up_avg)

        # 2. concatenating
        targets_values_i_up_arr[country].append(list_indicators_values_i_up_arr)
    
    targets_values_i_up[country] = pd.DataFrame(data=targets_values_i_up_arr[country], index=list(dict_targets.keys()))

HBox(children=(FloatProgress(value=0.0, max=180.0), HTML(value='')))




In [42]:
# with standardised values

targets_values_std_up = {}
targets_values_std_up_arr = {}
targets_values_std_up_avg = {}

for country in tqdm(countries):
    targets_values_std_up_arr[country] = []
    targets_values_std_up_avg[country] = pd.DataFrame(index=targets, columns=period)
    
    for t, target in enumerate(list(dict_targets.keys())):
        list_indicators_values_std_up_arr = []

        for y, year in enumerate(period):
            list_indicators_values_std_up_avg = []
            for i, indicator in enumerate(list(dict_targets[target])):
                # do not append NaNs                  
                if np.isnan(indicators_values_std_up[country].loc[indicator, year])==False:
                    list_indicators_values_std_up_arr.append(indicators_values_std_up[country].loc[indicator, year])
                    list_indicators_values_std_up_avg.append(indicators_values_std_up[country].loc[indicator, year])
            
            # 1. averaging
            targets_values_std_up_avg[country].loc[target, year] = np.mean(list_indicators_values_std_up_avg)

        # 2. concatenating
        targets_values_std_up_arr[country].append(list_indicators_values_std_up_arr)
    
    targets_values_std_up[country] = pd.DataFrame(data=targets_values_std_up_arr[country], index=list(dict_targets.keys()))

HBox(children=(FloatProgress(value=0.0, max=180.0), HTML(value='')))




In [43]:
# check whether averages were correctly computed
print(targets_values_i_up_avg['France'].loc['2.1', '2017'])
print(np.mean(indicators_values_i_up['France'].loc[info[info[4]=='2.1'][0]]['2017']))

0.0870694138599098
0.0870694138599098


In [44]:
# check for std
for t, target in enumerate(targets):
    print(target, len(targets_values_std_up_arr['Azerbaijan'][t]))

1.3 18
1.1 5
1.2 7
2.3 18
2.2 71
2.1 29
3.5 6
3.3 136
3.2 80
3.4 30
3.7 21
3.8 48
3.c 30
3.a 18
3.9 19
3.1 35
3.6 2
4.6 70
4.1 371
4.5 34
4.2 76
4.c 163
4.4 167
4.3 39
5.5 24
5.6 0
5.1 22
5.4 42
5.2 1
5.3 4
6.4 26
6.1 72
6.2 130
7.1 74
7.3 16
7.2 32
8.1 79
8.3 11
8.2 260
8.5 282
8.7 6
8.6 0
9.4 68
9.5 19
9.1 74
9.2 40
9.b 19
10.b 38
10.2 0
10.c 7
10.1 0
11.6 50
11.1 60
12.2 133
13.2 0
13.1 1
14.4 51
14.5 3
15.1 40
15.5 4
16.6 13
16.5 6
16.9 5
16.1 40
17.3 60
17.2 19
17.4 19
17.1 22
17.17 2
17.6 32
17.19 48
17.18 16
17.8 17
17.11 20
17.13 447
17.12 60
T 20


In [45]:
# check for i
for t, target in enumerate(targets):
    print(target, len(targets_values_i_up_arr['Azerbaijan'][t]))

1.3 324
1.1 19
1.2 92
2.3 18
2.2 317
2.1 45
3.5 6
3.3 138
3.2 80
3.4 30
3.7 38
3.8 78
3.c 38
3.a 18
3.9 19
3.1 37
3.6 2
4.6 133
4.1 420
4.5 80
4.2 80
4.c 420
4.4 376
4.3 60
5.5 69
5.6 13
5.1 25
5.4 76
5.2 14
5.3 57
6.4 97
6.1 108
6.2 216
7.1 74
7.3 16
7.2 32
8.1 82
8.3 73
8.2 260
8.5 300
8.7 51
8.6 60
9.4 68
9.5 38
9.1 76
9.2 40
9.b 19
10.b 38
10.2 1
10.c 14
10.1 8
11.6 50
11.1 65
12.2 133
13.2 1
13.1 166
14.4 51
14.5 3
15.1 40
15.5 4
16.6 16
16.5 32
16.9 89
16.1 76
17.3 60
17.2 19
17.4 19
17.1 40
17.17 60
17.6 38
17.19 48
17.18 16
17.8 20
17.11 20
17.13 520
17.12 114
T 20


In [46]:
# better save these precious imputeddata
tar_val_i_up = open('utils/data/targets_values_i_up_wb.pkl', 'wb')
pickle.dump(targets_values_i_up, tar_val_i_up)
tar_val_i_up.close()

tar_val_i_up_arr = open('utils/data/targets_values_i_up_arr_wb.pkl', 'wb')
pickle.dump(targets_values_i_up_arr, tar_val_i_up_arr)
tar_val_i_up_arr.close()

tar_val_i_up_avg = open('utils/data/targets_values_i_up_avg_wb.pkl', 'wb')
pickle.dump(targets_values_i_up_avg, tar_val_i_up_avg)
tar_val_i_up_avg.close()

# better save these precious standardised data
tar_val_std_up = open('utils/data/targets_values_std_up_wb.pkl', 'wb')
pickle.dump(targets_values_std_up, tar_val_std_up)
tar_val_std_up.close()

tar_val_std_up_arr = open('utils/data/targets_values_std_up_arr_wb.pkl', 'wb')
pickle.dump(targets_values_std_up_arr, tar_val_std_up_arr)
tar_val_std_up_arr.close()

tar_val_std_up_avg = open('utils/data/targets_values_std_up_avg_wb.pkl', 'wb')
pickle.dump(targets_values_std_up_avg, tar_val_std_up_avg)
tar_val_std_up_avg.close()

### *(WorldBank data set)* Concatenating target data to goal-level

Defining dictionaries for goals.

In [47]:
goals = list(info[3].unique())

dict_goals = {}

for goal in goals:
    g = info[4].where(info[3] == goal)

    dict_goals[goal] = [t for t in g if str(t) != 'nan']
    dict_goals[goal] = list(set(dict_goals[goal]))

In [48]:
# check
dict_goals['1']

['1.1', '1.3', '1.2']

Concatenating:

Recall that we define our dimensionality as $d \times T$, and countries are independent samples. We do this in the next cell exemplarily by appending the years for all indicators of SDG 1 to a list.

In [49]:
# example with SDG 1
len([x for x in list(targets_values_i_up[country].loc['1.1']) if str(x) != 'nan']) + len([x for x in list(targets_values_i_up[country].loc['1.2']) if str(x) != 'nan']) + len([x for x in list(targets_values_i_up[country].loc['1.3']) if str(x) != 'nan'])

435

In [50]:
goals_values_i_up = {}
goals_values_i_up_arr = {}
goals_values_i_up_avg = {}

for country in tqdm(countries):
    goals_values_i_up_arr[country] = []
    goals_values_i_up_avg[country] = []   # define this list with target values being averages over indicators
    
    for g, goal in enumerate(list(dict_goals.keys())):
        list_targets_values_i_up = []
        list_targets_values_i_up_avg = []

        for t in dict_goals[goal]:   # do not append NaN's
            list_targets_values_i_up.extend([x for x in list(targets_values_i_up[country].loc[t]) if np.isnan(x)==False])
            list_targets_values_i_up_avg.extend([x for x in list(targets_values_i_up_avg[country].loc[t]) if np.isnan(x)==False])
            
        # 1. append target averages
        goals_values_i_up_avg[country].append(np.asarray(list_targets_values_i_up_avg))

        # 2. concatenating
        goals_values_i_up_arr[country].append(np.asarray(list_targets_values_i_up))
    
    goals_values_i_up_avg[country] = np.asarray(goals_values_i_up_avg[country])
    goals_values_i_up[country] = pd.DataFrame(data=goals_values_i_up_arr[country], index=list(dict_goals.keys()))

HBox(children=(FloatProgress(value=0.0, max=180.0), HTML(value='')))




In [51]:
goals_values_std_up = {}
goals_values_std_up_arr = {}
goals_values_std_up_avg = {}

for country in tqdm(countries):
    goals_values_std_up_arr[country] = []
    goals_values_std_up_avg[country] = []   # define this list with target values being averages over indicators
    
    for g, goal in enumerate(list(dict_goals.keys())):
        list_targets_values_std_up = []
        list_targets_values_std_up_avg = []

        for t in dict_goals[goal]:   # do not append NaN's
            list_targets_values_std_up.extend([x for x in list(targets_values_std_up[country].loc[t]) if np.isnan(x)==False])
            list_targets_values_std_up_avg.extend([x for x in list(targets_values_std_up_avg[country].loc[t]) if np.isnan(x)==False])
            
        # 1. append target averages
        goals_values_std_up_avg[country].append(np.asarray(list_targets_values_std_up_avg))

        # 2. concatenating
        goals_values_std_up_arr[country].append(np.asarray(list_targets_values_std_up))
    
    goals_values_std_up_avg[country] = np.asarray(goals_values_std_up_avg[country])
    goals_values_std_up[country] = pd.DataFrame(data=goals_values_std_up_arr[country], index=list(dict_goals.keys()))

HBox(children=(FloatProgress(value=0.0, max=180.0), HTML(value='')))




In [52]:
# concatenating target averages
goals_values_std_up_avg['France'].shape

(18,)

In [62]:
# check for std
for g, goal in enumerate(list(dict_goals.keys())):
    print(goal, len(goals_values_std_up_arr['Mozambique'][g]))

1 39
2 140
3 414
4 715
5 96
6 227
7 120
8 516
9 195
10 45
11 115
12 128
13 114
14 54
15 44
16 39
17 855
T 20


In [58]:
# check for i
for g, goal in enumerate(list(dict_goals.keys())):
    print(goal, len(goals_values_i_up_arr['France'][g]))

1 435
2 380
3 484
4 1569
5 254
6 421
7 122
8 826
9 241
10 61
11 115
12 133
13 167
14 54
15 44
16 213
17 974
T 20


The advantage from saving these numbers in arrays and not dataframes should be clear now: in arrays we only save the actual numbers, but dataframes have a fixed number of columns, hence add NaN's to the goals which do not have data available.

In [61]:
# better save these precious imputed data
goa_val_i_up = open('utils/data/goals_values_i_up_wb.pkl', 'wb')
pickle.dump(goals_values_i_up, goa_val_i_up)
goa_val_i_up.close()

goa_val_i_up_arr = open('utils/data/goals_values_i_up_arr_wb.pkl', 'wb')
pickle.dump(goals_values_i_up_arr, goa_val_i_up_arr)
goa_val_i_up_arr.close()

goa_val_i_up_avg = open('utils/data/goals_values_i_up_avg_wb.pkl', 'wb')
pickle.dump(goals_values_i_up_avg, goa_val_i_up_avg)
goa_val_i_up_avg.close()

# better save these precious standardised data
goa_val_std_up = open('utils/data/goals_values_std_up_wb.pkl', 'wb')
pickle.dump(goals_values_std_up, goa_val_std_up)
goa_val_std_up.close()

goa_val_std_up_arr = open('utils/data/goals_values_std_up_arr_wb.pkl', 'wb')
pickle.dump(goals_values_std_up_arr, goa_val_std_up_arr)
goa_val_std_up_arr.close()

goa_val_std_up_avg = open('utils/data/goals_values_std_up_avg_wb.pkl', 'wb')
pickle.dump(goals_values_std_up_avg, goa_val_std_up_avg)
goa_val_std_up_avg.close()