# Data preparation for the SDG Indicators by (1) UN and (2) WorldBank

## (1) UN data set

**We use UN SDG's data set and convert this data set, so every country, continent, etc. is in a separate <code>csv</code> file.**

To get started, we download the entire available data from https://unstats.un.org/sdgs/indicators/database/ and call it <code>un_data.csv</code>.


Let's load the data set and look at its columns and rows to figure out how it is structured.


**We aim to have one pandas data frame per country, with all indicators. We save them as separate <code>csv</code> files.** 

Let's start with the usual imports and loading the data set.

In [1]:
import numpy as np
import pandas as pd
import math
import os
import pickle
import copy
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import warnings
warnings.filterwarnings('ignore')

In [25]:
# loading data set
all_data = pd.read_csv('utils/data/data.csv', dtype=object)
# the percentage of targets we have data far
print(round(len(all_data.Target.unique())/169, 3)*100, '%')

62.1 %


In [2]:
# UN data for SDG 13
SDG13_data = pd.read_csv('utils/SDG13_data.csv', dtype=object)

In [3]:
# check
SDG13_data.head()

Unnamed: 0,Goal,Target,Indicator,SeriesCode,SeriesDescription,GeoAreaCode,GeoAreaName,TimePeriod,Value,Time_Detail,TimeCoverage,UpperBound,LowerBound,BasePeriod,Source,GeoInfoUrl,FootNote,Nature,Reporting Type,Units
0,1,1.5,1.5.1,VC_DSR_MISS,Number of missing persons due to disaster (num...,4,Afghanistan,2017,1,2017,,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,NUMBER
1,1,1.5,1.5.1,VC_DSR_MISS,Number of missing persons due to disaster (num...,8,Albania,2011,8,2011,,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,NUMBER
2,1,1.5,1.5.1,VC_DSR_MISS,Number of missing persons due to disaster (num...,8,Albania,2012,1,2012,,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,NUMBER
3,1,1.5,1.5.1,VC_DSR_MISS,Number of missing persons due to disaster (num...,8,Albania,2015,2,2015,,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,NUMBER
4,1,1.5,1.5.1,VC_DSR_MISS,Number of missing persons due to disaster (num...,24,Angola,2006,1,2006,,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,NUMBER


In [4]:
# take out data not belonging to SDG 13
sdg13_data = SDG13_data[SDG13_data['Goal']=='13']

The data set is structured by indicators and years in rows in one large data frame with all countries. We would like to have one data frame per country. Hence, we first extract the names of *regional groupings*, i.e. countries, continents, etc., and the names of so-called *other groupings*.

According to the UN Statistics Division, other groupings include Least Developed Countries (LDC), Land Locked Developing Countries (LLDC), Small Island Developing States (SIDS), Developed Regions, and Developing Regions. 

Developing Regions are Latin America and the Caribbean, South-Eastern Asia, Southern Asia, Southern Asia (excluding India), Caucasus and Central Asia, Eastern Asia (excluding Japan and China), Western Asia (exc. Armenia, Azerbaijan, Cyprus, Israel and Georgia), Eastern Asia (excluding Japan), Oceania (exc. Australia and New Zealand), Sub-Saharan Africa (inc. Sudan), and Northern Africa (exc. Sudan).

**All these groupings can be subject to separate network analyses of the indicators later on.**




Let's first see all different columns of our data frame before we only see these different groupings.

In [None]:
list(all_data)

We have even lots of information on a sub-indicator level and this might be subject to more detailed analyses later on. We could, e.g., indicator 4.6.1* explore by disparate age goups and by sex.

\* *Indicator 4.6.1: Proportion of population in a given age group achieving at least a fixed level of proficiency in functional (a) literacy and (b) numeracy skills, by sex.*


We keep this possibility open, but now, let's not go further into a sub-indicator level and see the different groupings.

In [5]:
groupings_UN = list(SDG13_data['GeoAreaName'].unique())
groupings_UN

['Afghanistan',
 'Albania',
 'Angola',
 'Argentina',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Brazil',
 'Solomon Islands',
 'Myanmar',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Cabo Verde',
 'Sri Lanka',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Costa Rica',
 'Dominica',
 'Ecuador',
 'El Salvador',
 'Ethiopia',
 'Fiji',
 'Djibouti',
 'Gambia',
 'Ghana',
 'Guatemala',
 'Honduras',
 'Indonesia',
 'Iran (Islamic Republic of)',
 'Italy',
 "Côte d'Ivoire",
 'Japan',
 'Jordan',
 'Kenya',
 'Republic of Korea',
 "Lao People's Democratic Republic",
 'Madagascar',
 'Malawi',
 'Malaysia',
 'Mali',
 'Mauritius',
 'Morocco',
 'Mozambique',
 'Nepal',
 'Vanuatu',
 'Nicaragua',
 'Pakistan',
 'Panama',
 'Papua New Guinea',
 'Paraguay',
 'Peru',
 'Philippines',
 'Guinea-Bissau',
 'Timor-Leste',
 'Romania',
 'Saint Lucia',
 'Saint Vincent and the Grenadines',
 'Senegal',
 'Serbia',
 'Seychelles',
 'Viet Nam',
 'Zimbabwe',
 'South Sudan',
 'Sudan',
 'Thailand',
 'Togo',
 'Tonga',
 'Tunisia',
 

In [6]:
#all_data.replace({"Democratic People's Republic of Korea": "Korea, Dem. People's Rep.", 'Gambia': 'Gambia, The', 'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom', 'Congo': 'Congo, Rep.', 'Democratic Republic of the Congo': 'Congo, Dem. Rep.', 'Czechia': 'Czech Republic', 'Iran (Islamic Republic of)': 'Iran, Islamic Rep.', "Côte d'Ivoire": "Cote d'Ivoire", 'Kyrgyzstan': 'Kyrgyz Republic', "Lao People's Democratic Republic": 'Lao PDR', 'Republic of Moldova': 'Moldova', 'Micronesia (Federated States of)': 'Micronesia, Fed. Sts.', 'Slovakia': 'Slovak Republic', 'Viet Nam': 'Vietnam', 'Egypt': 'Egypt, Arab Rep.', 'United Republic of Tanzania': 'Tanzania','United States of America': 'United States', 'Venezuela (Bolivarian Republic of)': 'Venezuela, RB', 'Yemen': 'Yemen, Rep.', 'Bahamas': 'Bahamas, The', 'Bolivia (Plurinational State of)': 'Bolivia'}, inplace=True)
sdg13_data.replace({"Democratic People's Republic of Korea": "Korea, Dem. People's Rep.", 'Gambia': 'Gambia, The', 'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom', 'Congo': 'Congo, Rep.', 'Democratic Republic of the Congo': 'Congo, Dem. Rep.', 'Czechia': 'Czech Republic', 'Iran (Islamic Republic of)': 'Iran, Islamic Rep.', "Côte d'Ivoire": "Cote d'Ivoire", 'Kyrgyzstan': 'Kyrgyz Republic', "Lao People's Democratic Republic": 'Lao PDR', 'Republic of Moldova': 'Moldova', 'Micronesia (Federated States of)': 'Micronesia, Fed. Sts.', 'Slovakia': 'Slovak Republic', 'Viet Nam': 'Vietnam', 'Egypt': 'Egypt, Arab Rep.', 'United Republic of Tanzania': 'Tanzania','United States of America': 'United States', 'Venezuela (Bolivarian Republic of)': 'Venezuela, RB', 'Yemen': 'Yemen, Rep.', 'Bahamas': 'Bahamas, The', 'Bolivia (Plurinational State of)': 'Bolivia'}, inplace=True)

In [None]:
# list of keys to delete
delete_groups = []

for g in list(groupings):
    if g not in countries:
        delete_groups.append(g)
        
# delete
for dg in delete_groups:
    groupings.remove(dg)

In [7]:
# only take World Bank countries
c = pd.read_csv('utils/countries_wb.csv', dtype=str, delimiter=';', header=None)
countries = list(c[0])
countries

['Afghanistan',
 'Albania',
 'Algeria',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt, Arab Rep.',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia, The',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Greenland',
 'Grenada',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Honduras',
 '

In [9]:
# loading World Bankd ata set
wb_data = pd.read_csv('utils/data/wb_data.csv', dtype=object)
wb_data.drop(wb_data.tail(5).index,inplace=True)    # 5 last rows are blank / have other info
wb_data.tail()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],...,2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019]
101776,Zimbabwe,ZWE,"Water productivity, total (constant 2010 US$ G...",ER.GDP.FWTL.M3.KD,..,..,..,..,..,..,...,..,..,..,..,..,..,..,..,..,..
101777,Zimbabwe,ZWE,Women Business and the Law Index Score (1-100),SG.LAW.INDX,55,55,57.5,57.5,66.9,66.9,...,86.9,86.9,86.9,86.9,86.9,86.9,86.9,86.9,86.9,86.9
101778,Zimbabwe,ZWE,Women making their own informed decisions rega...,SG.DMK.SRCR.FN.ZS,..,..,..,..,..,..,...,..,58.8,..,..,..,59.9,..,..,..,..
101779,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,..,..,..,..,6.4,..,...,..,3.9,..,..,..,3.7,..,..,..,..
101780,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,..,..,..,..,31,..,...,..,30.5,..,..,33.5,32.4,..,..,..,..


In [10]:
columns = list(wb_data.columns)
for column in columns[4:]:
    columns.append(column[:4])
    columns.remove(column)

wb_data.columns = columns

In [11]:
years = columns[4:]

Let's now save a data frame with all of the meta-information. We delete the columns which are specific in area and time, and of course we do not want to have the values in this data frame. In the end, we delete all duplicate entries in the column **SeriesCode**. So, we are left with the information we wanted: mapping the series codes to the indicators, the Source for the data, the Units measured in, etc.

In [12]:
info = sdg13_data.drop(columns=['GeoAreaCode', 'GeoAreaName', 'TimePeriod', 'Value', 'Time_Detail']).drop_duplicates(subset=['Indicator', 'SeriesCode'])

NameError: name 'sdg13_data' is not defined

In [15]:
# check
info.head()

Unnamed: 0,Goal,Target,Indicator,SeriesCode,SeriesDescription,TimeCoverage,UpperBound,LowerBound,BasePeriod,Source,GeoInfoUrl,FootNote,Nature,Reporting Type,Units
21300,13,13.1,13.1.1,VC_DSR_MISS,Number of missing persons due to disaster (num...,,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,NUMBER
21694,13,13.1,13.1.1,VC_DSR_AFFCT,Number of people affected by disaster (number),,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,,G,NUMBER
22912,13,13.1,13.1.1,VC_DSR_MORT,Number of deaths due to disaster (number),,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,NUMBER
24103,13,13.1,13.1.1,VC_DSR_MTMP,Number of deaths and missing persons attribute...,,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,PER_100000_POP
25299,13,13.1,13.1.1,VC_DSR_MTMN,Number of deaths and missing persons attribute...,,,,,United Nations Office for Disaster Risk Reduct...,,Disclaimer: the data being submitted by UNDRR ...,C,G,NUMBER


In [16]:
# list of all series codes of SDG 13
seriescodes_13 = set(list(info['SeriesCode']))
seriescodes_13

{'SG_DSR_SILN',
 'SG_DSR_SILS',
 'SG_GOV_LOGV',
 'VC_DSR_AFFCT',
 'VC_DSR_DAFF',
 'VC_DSR_IJILN',
 'VC_DSR_MISS',
 'VC_DSR_MORT',
 'VC_DSR_MTMN',
 'VC_DSR_MTMP',
 'VC_DSR_PDAN',
 'VC_DSR_PDLN',
 'VC_DSR_PDYN'}

In [18]:
# count how many we have
len(seriescodes_13)

13

We convert the data set into multiple small data sets by creating a dictionary that contains the groupings' names as keys. 

First, we create empty data frames for each key.

In [19]:
#dict_all = {c: pd.DataFrame() for c in countries}
dict_13 = {c: pd.DataFrame() for c in countries}

In [20]:
# check, should be empty
#dict_all.get('Belize')
dict_13.get('Belize')

Second, we replace each of the empty data frames with the data we have available for them. Note, that our dictionary will be the ensamble of all groupings.

In [21]:
for c in countries:    # memory-intensive
    #dict_all[c] = all_data[all_data['GeoAreaName'].isin(['{}'.format(c)])]
    dict_13[c] = sdg13_data[sdg13_data['GeoAreaName'].isin(['{}'.format(c)])]

In [23]:
# check
dict_13['Azerbaijan']

Unnamed: 0,Goal,Target,Indicator,SeriesCode,SeriesDescription,GeoAreaCode,GeoAreaName,TimePeriod,Value,Time_Detail,TimeCoverage,UpperBound,LowerBound,BasePeriod,Source,GeoInfoUrl,FootNote,Nature,Reporting Type,Units


Now, we have one data frame per country. The next step is to have years as columns.

The next cell gives us the series codes in the rows and the years in the columns. These series codes are unique descriptions of the sub-indicators and we match these series codes to indicators and all other information in a different data frame.

In [24]:
for c in countries:
    #dict_all[c] = dict_all.get(c).pivot_table(values='Value', index='SeriesCode', columns='TimePeriod', dropna=False, aggfunc='first')
    if c not in groupings_UN:
        dict_13[c] = pd.DataFrame(index=seriescodes_13, columns=years)
    else:
        dict_13[c] = dict_13.get(c).pivot_table(values='Value', index='SeriesCode', columns='TimePeriod', dropna=False, aggfunc='first')

In [27]:
# check
dict_13['Austria']

TimePeriod,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
SeriesCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
SG_DSR_SILN,,,,,,,,,,,,,,850.0
SG_DSR_SILS,,,,,,,,,,,,,,40.55344
SG_GOV_LOGV,,,,,,,,,,,,,,2096.0
VC_DSR_AFFCT,111.0,61.0,109.0,152.0,119.0,65.0,57.0,58.0,583.0,30.0,196.0,60.0,271.0,6630.0
VC_DSR_DAFF,1.34486,0.73624,1.31108,1.82221,1.42129,0.77289,0.67428,0.68217,6.81378,0.34822,2.25841,0.68593,3.0726,75.17007
VC_DSR_IJILN,111.0,61.0,109.0,152.0,119.0,65.0,57.0,58.0,583.0,30.0,196.0,60.0,271.0,29.0
VC_DSR_MORT,15.0,4.0,4.0,8.0,6.0,9.0,1.0,13.0,15.0,3.0,6.0,4.0,11.0,1.0
VC_DSR_MTMN,15.0,4.0,4.0,8.0,6.0,9.0,1.0,13.0,15.0,3.0,6.0,4.0,11.0,1.0
VC_DSR_MTMP,0.18174,0.04828,0.04811,0.09591,0.07166,0.10702,0.01183,0.1529,0.17531,0.03482,0.06914,0.04573,0.12472,0.01134
VC_DSR_PDAN,,,,,,,,,,,,,,6598.0


## Cleaning up and transforming all country data frames into the same dimensions

We have a couple of things to do to make our data frames workable:
1. We have some values in the data frames which we do not want, as e.g. <code>,</code>, <code> = </code>, <code>N</code>, etc. We replace them with appropriate values, i.e. <code>0</code>, or simply a space. 
2. Some data frames have data from **1990** to **2018**, some others from **1992** to **2018**. We want to have all data frames having data from **1990** to **2018**, i.e. an equal amount of columns. The additional columns are filled with <code>NaNs</code>.
3. Some data frames have not all indicators and sub-indicators listed, but we would like to have all of them in all data frames. These additional rows are filled with <code>NaNs</code>.

Let's start with the first task, i.e. cleaning up the data frames.

We first need to define lists for all years, i.e. **1990** to **2018** and all indicators and sub-indicators, i.e. series codes.

Change <span style="color:red"> 'Haiti' </span> in the cell below to a few other countries and you'll see that they can have different lengths. We need to bring all on the same length. We agree on having data for the **years 1990 to 2019**.

Now, we insert the missing years for all groupings. We want to have NaNs in those columns.

In [28]:
# example
list(dict_13['Germany'])

['2015', '2016', '2017']

Firstly, we insert the missing years as columns for all groupings.

In [29]:
for c in countries:    # memory-intensive
    for year in years:
        if year not in list(dict_13[c]):
            dict_13[c]['{}'.format(year)] = np.nan
    # having the years in order
    dict_13[c] = dict_13[c][years]

In [30]:
# check
dict_13['Azerbaijan']

Unnamed: 0,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
SG_GOV_LOGV,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_DAFF,,,,,,,,,,,...,,,,,,,,,,
SG_DSR_SILS,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_PDYN,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_PDLN,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_MTMP,,,,,,,,,,,...,,,,,,,,,,
SG_DSR_SILN,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_MORT,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_AFFCT,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_IJILN,,,,,,,,,,,...,,,,,,,,,,


Secondly, we insert the missing series codes as rows.

Let's first see how many rows do we have for <span style="color:red"> Guam </span>?

In [31]:
len(list(dict_13['Nicaragua'].index))

10

Let's have all $J$ sub-indicators we want for each country as rows. We fill these rows with NaNs. 

In [33]:
for c in countries:        # memory-intensive
    for seriescode in seriescodes_13:
        if seriescode not in list(dict_13[c].index):
            dict_13[c].loc[seriescode] = np.nan    # fill these rows with NaNs

In [34]:
# check: do we have J many?
len(list(dict_13['Nicaragua'].index))

13

In [35]:
# convert all to floats
for c in countries:
    for year in years:    
        for seriescode in seriescodes_13:
            if not isinstance(dict_13[c].loc[seriescode, year], float):
                dict_13[c].loc[seriescode, year] = float(dict_13[c].loc[seriescode, year].replace(',', '').replace('<', '').replace('>', '').replace('=', '').replace('N', '0').replace(' -   ', '0').replace('0V', '0').replace('. . .', '0'))

In [36]:
# double-check: are all series codes as rows?
len(list(dict_13['Nicaragua'].index))

13

Finally, we can save all countries as different <code>csv</code> files and as one `dict`.

In [None]:
if not os.path.exists('csv_original'):
    os.mkdir('csv_original')

In [None]:
for c in countries:
    dict_all[c].to_csv(r'csv_original/{}.csv'.format(c))

Having the information file might also be helpful.

In [None]:
if not os.path.exists('utils'):
    os.mkdir('utils')
    
info.to_csv(r'utils/info.csv')

In [37]:
# as one pickle file
dict13 = open('utils/data/dict_13.pkl', 'wb')
pickle.dump(dict_13, dict13)
dict13.close()

In [None]:
# as one pickle file
dictall = open('utils/data/dict_all.pkl', 'wb')
pickle.dump(dict_all, dictall)
dictall.close()

In [None]:
# CHECKPOINT
dictall = open('utils/data/dict_all.pkl', 'rb')
dict_all = pickle.load(dictall)
dictall.close()

## Visualising time-series

We quickly visualise the time-series to get a better idea of the characteristics of our data set.

In [None]:
f, (ax, ax2) = plt.subplots(2, 1, sharex=True)
ax2.plot(list(range(2000, 2020)), dict_all['Bolivia'].loc['DC_ODA_BDVL'], color='#42B24C', linewidth=5)
ax.plot(list(range(2000, 2020)), dict_all['Bolivia'].loc['DC_TRF_TOTL'], color='#DE0E68', linewidth=5)

ax2.set_ylim(0, 251)  # biodiversity ODA
ax.set_ylim(250, 2501)  # total ODA

# hide the spines between ax and ax2
ax.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax.xaxis.tick_top()
ax.tick_params(labelsize=20, labeltop='off')  # don't put tick labels at the top
ax2.xaxis.tick_bottom()

plt.xticks(np.arange(2000, 2019, step=2), size=20)
ax2.tick_params(labelsize=20)

f.set_figheight(8)
f.set_figwidth(12)

plt.show()

In [None]:
plt.figure(figsize=(12,8))
plt.plot(list(range(2000, 2020)), dict_all['Bolivia'].loc['DC_ODA_BDVL'], color='#42B24C', linewidth=5)
plt.xticks(np.arange(2000, 2019, step=2), size=20)
plt.yticks(np.arange(0, 251, step=25), size=20)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
plt.plot(list(range(2000, 2020)), dict_all['Bolivia'].loc['DC_TRF_TOTL'], color='#DE0E68', linewidth=5)
plt.xticks(np.arange(2000, 2019, step=2), size=20)
plt.yticks(np.arange(0, 2501, step=250), size=20)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
plt.yticks(np.arange(0, 2001, step=250), size=20)
plt.xticks(np.arange(0, 251, step=25), size=20)
plt.plot(dict_all['Bolivia'].loc['DC_ODA_BDVL'], dict_all['Bolivia'].loc['DC_TRF_TOTL'], '--bo'); #, s=100, color='black')

## Data standardisation
We have saved the original data set, but it is often useful to have the data standardised. Standardising a data set involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. Standardisation is often required by machine learning algorithms when your time series data has input values with differing scales. 

We create a new dictionary `dict_all_std` to keep the standardised values separate to the original ones.

In [None]:
# CHECKPOINT (we don't want to re-run the entire script every time we continue working on it)
dict_all = pickle.load(open('utils/data/dict_all.pkl', 'rb'))
dict_all_std = pickle.load(open('utils/data/dict_all_std.pkl', 'rb'))

In [None]:
# ~15 minutes computing time
#dict_all_std = copy.deepcopy(dict_all)
dict_13_std = copy.deepcopy(dict_13)   

for country in countries:
    for seriescode in seriescodes:
        
        dict_13_std[country].loc[seriescode] = scale(dict_13[country].loc[seriescode])
        #dict_all_std[country].loc[seriescode] = scale(dict_all[country].loc[seriescode])

In [None]:
#check
print('Original value', dict_all['France'].loc['AG_LND_FRST'])
print('-------')
print('Standardised value', dict_all_std['France'].loc['AG_LND_FRST'])

We better save `dict_all_std`...

In [None]:
# as csv files per grouping
if not os.path.exists('csv_standardised'):
    os.mkdir('csv_standardised')
    
for group in groupings:
    dict_all_std[group].to_csv(r'csv_standardised/{}.csv'.format(group))

# as one pickle file
stand = open('utils/data/dict_all_std.pkl', 'wb')
pickle.dump(dict_all_std, stand)
stand.close()

## (2) World Bank data set

**We use World Bank's data set and convert this data set, so every country, continent, etc. is in a separate <code>csv</code> file.**

To get started, we download the entire available data from http://datatopics.worldbank.org/sdgs/ and call it <code>wb_data.csv</code>.


Let's load the data set and look at its columns and rows to figure out how it is structured.


**We aim to have one pandas data frame per country, with all indicators. We save them as separate <code>csv</code> files.** 

In [13]:
# only execute when new indicators are added -> call new metadata file wb_info_new.csv

wb_info_new = pd.read_csv('utils/wb_info.csv', header=None, dtype=object)
print(len(wb_info_new))

#wb_info = pd.read_csv('utils/wb_info.csv', header=None, dtype=object)
#wb_info = wb_info.drop(columns=['Topic', 'Indicator Name'])
#print('old:', len(wb_info))

401


In [21]:
# the percentage of targets we have data far
print(round(len(wb_info_new[4].unique())/169, 4)*100, '%')

45.56 %


In [None]:
i = 0

for code in wb_info_new[0]:
    if code not in list(wb_info[0]):
        print(code)
        i += 1

print()
print('# added indicators:', i)
print()

j = 0

for code in wb_info[0]:
    if code not in list(wb_info_new[0]):
        print(code)
        j += 1

print()
print('# deleted indicators:', j)

In [None]:
all_countries = list(wb_data['Country Name'].unique())
# save countries
np.savetxt('utils/countries_wb_all.csv', all_countries, delimiter=';', fmt='%s')

In [39]:
dict_all_wb = {country: pd.DataFrame() for country in countries}
for country in countries:
    print(country)
    dict_all_wb[country] = wb_data[wb_data['Country Name'].isin(['{}'.format(country)])]
    dict_all_wb[country] = dict_all_wb[country].drop(columns=['Country Name', 'Country Code', 'Series Name'])
    dict_all_wb[country].set_index('Series Code', inplace=True)
    dict_all_wb[country] = pd.concat([dict_all_wb[country], dict_13[country]])    # adding series codes for SDG 13
    dict_all_wb[country] = dict_all_wb[country].replace('..', np.nan).astype(float)

Afghanistan
Albania
Algeria
Angola
Antigua and Barbuda
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas, The
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo, Dem. Rep.
Congo, Rep.
Costa Rica
Cote d'Ivoire
Croatia
Cuba
Cyprus
Czech Republic
Denmark
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt, Arab Rep.
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Fiji
Finland
France
Gabon
Gambia, The
Georgia
Germany
Ghana
Greece
Greenland
Grenada
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Honduras
Hungary
Iceland
India
Indonesia
Iran, Islamic Rep.
Iraq
Ireland
Israel
Italy
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Korea, Dem. People's Rep.
Kuwait
Kyrgyz Republic
Lao PDR
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
Madagascar
Malawi
Malaysia
Maldives
M

In [44]:
dict_all_wb['Azerbaijan']

Unnamed: 0,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
EG.CFT.ACCS.ZS,,,,,,,,,,,...,91.380000,92.250000,92.96,93.71,94.440000,95.1,95.54,,,
EG.ELC.ACCS.ZS,,,,,,,,,,97.000000,...,99.938293,99.900000,100.00,100.00,100.000000,100.0,100.00,100.000000,100.0,
EG.ELC.ACCS.RU.ZS,,,,,,,,,,94.450461,...,99.888518,99.900000,100.00,100.00,100.000000,100.0,100.00,100.000000,100.0,
EG.ELC.ACCS.UR.ZS,,,,,,,,,,99.431488,...,99.981720,99.900000,100.00,100.00,100.000000,100.0,100.00,100.000000,100.0,
FX.OWN.TOTL.ZS,,,,,,,,,,,...,,14.900849,,,29.151493,,,28.571203,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VC_DSR_AFFCT,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_IJILN,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_MISS,,,,,,,,,,,...,,,,,,,,,,
VC_DSR_PDAN,,,,,,,,,,,...,,,,,,,,,,


In [41]:
seriescodes_wb = set(list(dict_all_wb['Germany'].index))

In [42]:
# saving data
for country in countries: 
    dict_all_wb[country].to_csv(r'csv_original/{}_wb.csv'.format(country))
    
# as one pickle file
dictall = open('utils/data/dict_all_wb.pkl', 'wb')
pickle.dump(dict_all_wb, dictall)
dictall.close()

## Data standardisation

In [45]:
dict_all_wb_std = copy.deepcopy(dict_all_wb)

for country in countries:
    for seriescode in seriescodes_wb:
        # adding noise as representative for measurement errors
        #noise = np.random.normal(scale=0.1, size=len(dict_all_wb[country].loc[seriescode]))
        
        #dict_all_wb[country].loc[seriescode] = dict_all_wb[country].loc[seriescode] + noise
        
        dict_all_wb_std[country].loc[seriescode] = scale(dict_all_wb[country].loc[seriescode])

In [51]:
#check
print('Original value', dict_all_wb['Belgium'].loc['ER.H2O.FWTL.ZS'])
print('-------')
print('Standardised value', dict_all_wb_std['Belgium'].loc['ER.H2O.FWTL.ZS'])

Original value 1990          NaN
1991          NaN
1992          NaN
1993          NaN
1994          NaN
1995          NaN
1996          NaN
1997    64.083333
1998          NaN
1999          NaN
2000          NaN
2001          NaN
2002    56.125000
2003          NaN
2004          NaN
2005          NaN
2006          NaN
2007    51.783333
2008          NaN
2009          NaN
2010          NaN
2011          NaN
2012    50.016667
2013          NaN
2014          NaN
2015          NaN
2016          NaN
2017          NaN
2018          NaN
2019          NaN
Name: ER.H2O.FWTL.ZS, dtype: float64
-------
Standardised value 1990         NaN
1991         NaN
1992         NaN
1993         NaN
1994         NaN
1995         NaN
1996         NaN
1997    1.580306
1998         NaN
1999         NaN
2000         NaN
2001         NaN
2002    0.114715
2003         NaN
2004         NaN
2005         NaN
2006         NaN
2007   -0.684838
2008         NaN
2009         NaN
2010         NaN
2011         NaN
2012   

We better save `dict_all_wb_std`.

In [52]:
# as csv files per grouping
if not os.path.exists('csv_standardised'):
    os.mkdir('csv_standardised')
    
for country in countries:
    dict_all_wb_std[country].to_csv(r'csv_standardised/{}_wb.csv'.format(country))

# as one pickle file
stand = open('utils/data/dict_all_wb_std.pkl', 'wb')
pickle.dump(dict_all_wb_std, stand)
stand.close()