# Data preparation for the SDG Indicators by (1) UN and (2) WorldBank

## (1) UN data set

**We use UN SDG's data set and convert this data set, so every country, continent, etc. is in a separate <code>csv</code> file.**

To get started, we download the entire available data from https://unstats.un.org/sdgs/indicators/database/ and call it <code>un_data.csv</code>.


Let's load the data set and look at its columns and rows to figure out how it is structured.


**We aim to have one pandas data frame per country, with all indicators. We save them as separate <code>csv</code> files.** 

Let's start with the usual imports and loading the data set.

In [1]:
import numpy as np
import pandas as pd
import math
import os
import pickle
import copy
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale
import warnings
warnings.filterwarnings('ignore')

In [231]:
# loading data set
all_data = pd.read_csv('utils/data/un_data.csv', dtype=object)
all_data.tail()

Unnamed: 0,Goal,Target,Indicator,SeriesCode,SeriesDescription,GeoAreaCode,GeoAreaName,TimePeriod,Value,Time_Detail,...,[Name of international institution],[Name of non-communicable disease],[Policy Domains],[Quantile],[Reporting Type],[Sex],[Type of occupation],[Type of product],[Type of skill],[Type of speed]
1083602,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,894,Zambia,2013,85.76164,2013,...,,,,,G,,,,,
1083603,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,894,Zambia,2014,125.47225,2014,...,,,,,G,,,,,
1083604,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,894,Zambia,2015,94.86903,2015,...,,,,,G,,,,,
1083605,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,894,Zambia,2016,93.93723,2016,...,,,,,G,,,,,
1083606,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,894,Zambia,2017,98.11231,2017,...,,,,,G,,,,,


The data set is structured by indicators and years in rows in one large data frame with all countries. We would like to have one data frame per country. Hence, we first extract the names of *regional groupings*, i.e. countries, continents, etc., and the names of so-called *other groupings*.

According to the UN Statistics Division, other groupings include Least Developed Countries (LDC), Land Locked Developing Countries (LLDC), Small Island Developing States (SIDS), Developed Regions, and Developing Regions. 

Developing Regions are Latin America and the Caribbean, South-Eastern Asia, Southern Asia, Southern Asia (excluding India), Caucasus and Central Asia, Eastern Asia (excluding Japan and China), Western Asia (exc. Armenia, Azerbaijan, Cyprus, Israel and Georgia), Eastern Asia (excluding Japan), Oceania (exc. Australia and New Zealand), Sub-Saharan Africa (inc. Sudan), and Northern Africa (exc. Sudan).

**All these groupings can be subject to separate network analyses of the indicators later on.**




Let's first see all different columns of our data frame before we only see these different groupings.

In [232]:
list(all_data)

['Goal',
 'Target',
 'Indicator',
 'SeriesCode',
 'SeriesDescription',
 'GeoAreaCode',
 'GeoAreaName',
 'TimePeriod',
 'Value',
 'Time_Detail',
 ' UpperBound',
 ' LowerBound',
 ' BasePeriod',
 ' Source',
 'FootNote',
 'Nature',
 'Units',
 '[Age]',
 '[Cities]',
 '[Disability status]',
 '[Education level]',
 '[IHR Capacity]',
 '[Level/Status]',
 '[Location]',
 '[Migratory status]',
 '[Mode of transportation]',
 '[Name of international institution]',
 '[Name of non-communicable disease]',
 '[Policy Domains]',
 '[Quantile]',
 '[Reporting Type]',
 '[Sex]',
 '[Type of occupation]',
 '[Type of product]',
 '[Type of skill]',
 '[Type of speed]']

We have even lots of information on a sub-indicator level and this might be subject to more detailed analyses later on. We could, e.g., indicator 4.6.1* explore by disparate age goups and by sex.

\* *Indicator 4.6.1: Proportion of population in a given age group achieving at least a fixed level of proficiency in functional (a) literacy and (b) numeracy skills, by sex.*


We keep this possibility open, but now, let's not go further into a sub-indicator level and see the different groupings.

In [233]:
groupings = list(all_data['GeoAreaName'].unique())
groupings

['World',
 'South America',
 'Albania',
 'Oceania',
 'Western Africa',
 'Algeria',
 'Central America',
 'Eastern Africa',
 'Northern Africa',
 'Middle Africa',
 'Southern Africa',
 'Northern America',
 'Angola',
 'Caribbean',
 'Eastern Asia',
 'Azerbaijan',
 'Argentina',
 'Southern Asia',
 'South-Eastern Asia',
 'Australia',
 'Southern Europe',
 'Austria',
 'Bangladesh',
 'Armenia',
 'Australia and New Zealand',
 'Melanesia',
 'Belgium',
 'Polynesia',
 'Central and Southern Asia',
 'Bhutan',
 'Bolivia (Plurinational State of)',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Solomon Islands',
 'Bulgaria',
 'Myanmar',
 'Burundi',
 'Belarus',
 'Cameroon',
 'Canada',
 'Cabo Verde',
 'Caucasus and Central Asia',
 'Central African Republic',
 'Asia',
 'Central Asia',
 'Sri Lanka',
 'Western Asia',
 'Chad',
 'Europe',
 'Eastern Europe',
 'Chile',
 'Northern Europe',
 'Western Europe',
 'China',
 'Colombia',
 'Comoros',
 'Congo',
 'Democratic Republic of the Congo',
 'Costa Rica',
 'Croa

In [234]:
# only take World Bank countries
c = pd.read_csv('utils/countries_wb.csv', dtype=str, delimiter=';', header=None)
countries = list(c[0])
countries

['Afghanistan',
 'Albania',
 'Algeria',
 'Angola',
 'Antigua and Barbuda',
 'Argentina',
 'Armenia',
 'Australia',
 'Austria',
 'Azerbaijan',
 'Bahamas, The',
 'Bahrain',
 'Bangladesh',
 'Barbados',
 'Belarus',
 'Belgium',
 'Belize',
 'Benin',
 'Bhutan',
 'Bolivia',
 'Bosnia and Herzegovina',
 'Botswana',
 'Brazil',
 'Brunei Darussalam',
 'Bulgaria',
 'Burkina Faso',
 'Burundi',
 'Cambodia',
 'Cameroon',
 'Canada',
 'Central African Republic',
 'Chad',
 'Chile',
 'China',
 'Colombia',
 'Comoros',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 'Costa Rica',
 "Cote d'Ivoire",
 'Croatia',
 'Cuba',
 'Cyprus',
 'Czech Republic',
 'Denmark',
 'Djibouti',
 'Dominica',
 'Dominican Republic',
 'Ecuador',
 'Egypt, Arab Rep.',
 'El Salvador',
 'Equatorial Guinea',
 'Eritrea',
 'Estonia',
 'Ethiopia',
 'Fiji',
 'Finland',
 'France',
 'Gabon',
 'Gambia, The',
 'Georgia',
 'Germany',
 'Ghana',
 'Greece',
 'Greenland',
 'Grenada',
 'Guatemala',
 'Guinea',
 'Guinea-Bissau',
 'Guyana',
 'Haiti',
 'Honduras',
 '

In [235]:
all_data.replace({"Democratic People's Republic of Korea": "Korea, Dem. People's Rep.", 'Gambia': 'Gambia, The', 'United Kingdom of Great Britain and Northern Ireland': 'United Kingdom', 'Congo': 'Congo, Rep.', 'Democratic Republic of the Congo': 'Congo, Dem. Rep.', 'Czechia': 'Czech Republic', 'Iran (Islamic Republic of)': 'Iran, Islamic Rep.', "Côte d'Ivoire": "Cote d'Ivoire", 'Kyrgyzstan': 'Kyrgyz Republic', "Lao People's Democratic Republic": 'Lao PDR', 'Republic of Moldova': 'Moldova', 'Micronesia (Federated States of)': 'Micronesia, Fed. Sts.', 'Slovakia': 'Slovak Republic', 'Viet Nam': 'Vietnam', 'Egypt': 'Egypt, Arab Rep.', 'United Republic of Tanzania': 'Tanzania','United States of America': 'United States', 'Venezuela (Bolivarian Republic of)': 'Venezuela, RB', 'Yemen': 'Yemen, Rep.', 'Bahamas': 'Bahamas, The', 'Bolivia (Plurinational State of)': 'Bolivia'}, inplace=True)

In [236]:
# list of keys to delete
delete_groups = []

for g in list(groupings):
    if g not in countries:
        delete_groups.append(g)
        
# delete
for dg in delete_groups:
    groupings.remove(dg)

We convert the data set into multiple small data sets by creating a dictionary that contains the groupings' names as keys. 

First, we create empty data frames for each key.

In [237]:
dict_all = {c: pd.DataFrame() for c in countries}

In [238]:
# check, should be empty
dict_all.get('Belize')

Second, we replace each of the empty data frames with the data we have available for them. Note, that our dictionary will be the ensamble of all groupings.

In [239]:
for c in countries:    # memory-intensive
    dict_all[c] = all_data[all_data['GeoAreaName'].isin(['{}'.format(c)])]

In [240]:
# check
dict_all['Bolivia']

Unnamed: 0,Goal,Target,Indicator,SeriesCode,SeriesDescription,GeoAreaCode,GeoAreaName,TimePeriod,Value,Time_Detail,...,[Name of international institution],[Name of non-communicable disease],[Policy Domains],[Quantile],[Reporting Type],[Sex],[Type of occupation],[Type of product],[Type of skill],[Type of speed]
224,1,1.1,1.1.1,SI_POV_DAY1,Proportion of population below international p...,68,Bolivia,2000,28.6,2000,...,,,,,G,,,,,
225,1,1.1,1.1.1,SI_POV_DAY1,Proportion of population below international p...,68,Bolivia,2001,22.8,2001,...,,,,,G,,,,,
226,1,1.1,1.1.1,SI_POV_DAY1,Proportion of population below international p...,68,Bolivia,2002,24.7,2002,...,,,,,G,,,,,
227,1,1.1,1.1.1,SI_POV_DAY1,Proportion of population below international p...,68,Bolivia,2004,13.7,2004,...,,,,,G,,,,,
228,1,1.1,1.1.1,SI_POV_DAY1,Proportion of population below international p...,68,Bolivia,2005,19.3,2005,...,,,,,G,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1080881,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,68,Bolivia,2013,142.60478,2013,...,,,,,G,,,,,
1080882,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,68,Bolivia,2014,212.98375,2014,...,,,,,G,,,,,
1080883,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,68,Bolivia,2015,152.66092,2015,...,,,,,G,,,,,
1080884,17,17.9,17.9.1,DC_FTA_TOTAL,Total official development assistance (gross d...,68,Bolivia,2016,85.47392,2016,...,,,,,G,,,,,


Now, we have one data frame per country. The next step is to have years as columns.

The next cell gives us the series codes in the rows and the years in the columns. These series codes are unique descriptions of the sub-indicators and we match these series codes to indicators and all other information in a different data frame.

In [241]:
for c in countries:
    dict_all[c] = dict_all.get(c).pivot_table(values='Value', index='SeriesCode', columns='TimePeriod', dropna=False, aggfunc='first')

In [242]:
# check
dict_all['Bolivia'].head()

TimePeriod,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
SeriesCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
AG_FPA_CFPI,,,,,,,,,,,,,,,,,0.2,-0.4,,
AG_FPA_COMM,,,,,,,,,,,,,,,,,0.6,-0.1,,
AG_LND_DGRD,,,,,,,,,,,,,,,,18.0,,,,
AG_LND_FRST,55.47032,,,,,54.21767,,,,,51.88683,,,,,50.55294,,,,
AG_LND_FRSTBIOPHA,133.25,,,,,133.26,,,,,135.6,,,,,135.6,,,,


Let's now save a data frame with all of the meta-information. We delete the columns which are specific in area and time, and of course we do not want to have the values in this data frame. In the end, we delete all duplicate entries in the column **SeriesCode**. So, we are left with the information we wanted: mapping the series codes to the indicators, the Source for the data, the Units measured in, etc.

In [243]:
info = all_data.drop(columns=['GeoAreaCode', 'GeoAreaName', 'TimePeriod', 'Value', 'Time_Detail']).drop_duplicates(subset=['Indicator', 'SeriesCode'])

In [244]:
# check
info.head()

Unnamed: 0,Goal,Target,Indicator,SeriesCode,SeriesDescription,UpperBound,LowerBound,BasePeriod,Source,FootNote,...,[Name of international institution],[Name of non-communicable disease],[Policy Domains],[Quantile],[Reporting Type],[Sex],[Type of occupation],[Type of product],[Type of skill],[Type of speed]
0,1,1.1,1.1.1,SI_POV_DAY1,Proportion of population below international p...,,,,"World Development Indicators database, World Bank","Retrieved on March 20, 2019 from World Bank, P...",...,,,,,G,,,,,
1448,1,1.1,1.1.1,SI_POV_EMP1,Employed population below international povert...,,,,"ILO estimates, November 2018, available in ILO...",,...,,,,,G,BOTHSEX,,,,
10619,1,1.2,1.2.1,SI_POV_NAHC,Proportion of population living below the nati...,,,,"World Development Indicators database, World Bank",Source: National Statictis and Information Aut...,...,,,,,G,,,,,
11921,1,1.3,1.3.1,SI_COV_MATNL,[ILO] Proportion of mothers with newborns rece...,,,,ILO estimates based on country data compled th...,,...,,,,,G,BOTHSEX,,,,
12028,1,1.3,1.3.1,SI_COV_POOR,[ILO] Proportion of poor population receiving ...,,,,ILO Social Security Inquiry (SSI). Available a...,ILO estimates based on country data,...,,,,,G,BOTHSEX,,,,


## Cleaning up and transforming all country data frames into the same dimensions

We have a couple of things to do to make our data frames workable:
1. We have some values in the data frames which we do not want, as e.g. <code>,</code>, <code> = </code>, <code>N</code>, etc. We replace them with appropriate values, i.e. <code>0</code>, or simply a space. 
2. Some data frames have data from **1990** to **2018**, some others from **1992** to **2018**. We want to have all data frames having data from **1990** to **2018**, i.e. an equal amount of columns. The additional columns are filled with <code>NaNs</code>.
3. Some data frames have not all indicators and sub-indicators listed, but we would like to have all of them in all data frames. These additional rows are filled with <code>NaNs</code>.

Let's start with the first task, i.e. cleaning up the data frames.

We first need to define lists for all years, i.e. **1990** to **2018** and all indicators and sub-indicators, i.e. series codes.

In [245]:
# list of all years
years = list(dict_all['France'])    # France is an example of a country that has all columns

Change <span style="color:red"> 'Haiti' </span> in the cell below to a few other countries and you'll see that they can have different lengths. We need to bring all on the same length. We agree on having data for the **years 1990 to 2018**.

Now, we insert the missing years for all groupings. We want to have NaNs in those columns.

In [246]:
# example
list(dict_all['Haiti'])

['2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019']

In [247]:
# list of all series codes
seriescodes = list(info['SeriesCode'])
seriescodes

['SI_POV_DAY1',
 'SI_POV_EMP1',
 'SI_POV_NAHC',
 'SI_COV_MATNL',
 'SI_COV_POOR',
 'SI_COV_SOCAST',
 'SI_COV_SOCINS',
 'SI_COV_CHLD',
 'SI_COV_UEMP',
 'SI_COV_VULN',
 'SI_COV_WKINJRY',
 'SI_COV_BENFTS',
 'SI_COV_DISAB',
 'SI_COV_LMKT',
 'SI_COV_PENSN',
 'SP_ACS_BSRVH2O',
 'SP_ACS_BSRVSAN',
 'VC_DSR_GDPLS',
 'VC_DSR_MISS',
 'VC_DSR_AFFCT',
 'VC_DSR_MORT',
 'VC_DSR_MTMP',
 'VC_DSR_MTMN',
 'VC_DSR_DAFF',
 'VC_DSR_IJILN',
 'VC_DSR_PDAN',
 'VC_DSR_PDYN',
 'VC_DSR_PDLN',
 'SG_DSR_LGRGSR',
 'SG_DSR_SILS',
 'SG_DSR_SILN',
 'SG_GOV_LOGV',
 'VC_DSR_LSGP',
 'VC_DSR_AGLN',
 'VC_DSR_HOLN',
 'VC_DSR_CILN',
 'VC_DSR_CHLN',
 'VC_DSR_DDPA',
 'SD_XPD_ESED',
 'SN_ITK_DEFC',
 'AG_PRD_FIESSI',
 'AG_PRD_FIESSIN',
 'SN_ITK_DEFCN',
 'SH_STA_STUNT',
 'SH_STA_STUNTN',
 'SH_STA_WASTE',
 'SH_STA_WASTEN',
 'SH_STA_OVRWGT',
 'SH_STA_OVRWGTN',
 'PD_AGR_SSFP',
 'SI_AGR_SSFP',
 'ER_GRF_ANIMRCNT',
 'ER_GRF_PLNTSTOR',
 'ER_RSK_LBREDS',
 'AG_PRD_ORTIND',
 'DC_TOF_AGRL',
 'AG_PRD_AGVAS',
 'AG_XPD_AGSGB',
 'AG_PRD_XSUBDY',


In [248]:
# count how many we have
len(seriescodes)

430

Firstly, we insert the missing years as columns for all groupings.

In [249]:
for c in countries:    # memory-intensive
    for year in years:
        if year not in list(dict_all[c]):
            dict_all[c]['{}'.format(year)] = np.nan
    # having the years in order
    dict_all[c] = dict_all[c][['2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']]

In [250]:
# check
dict_all['Nicaragua'].head()

TimePeriod,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
SeriesCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
AG_FPA_CFPI,,,,,,,,,,,,,,,,,-1.6,-1.2,,
AG_FPA_COMM,,,,,,,,,,,,,,,,,-0.2,-0.1,,
AG_LND_DGRD,,,,,,,,,,,,,,,,,,,,
AG_LND_FRST,31.69353,,,,,28.78511,,,,,25.87668,,,,,25.87668,,,,
AG_LND_FRSTBIOPHA,192.45,,,,,192.44,,,,,192.45,,,,,192.45,,,,


Secondly, we insert the missing series codes as rows.

Let's first see how many rows do we have for <span style="color:red"> Guam </span>?

In [251]:
len(list(dict_all['Nicaragua'].index))

235

Let's have all $J$ sub-indicators we want for each country as rows. We fill these rows with NaNs. 

In [252]:
for c in countries:        # memory-intensive
    for seriescode in seriescodes:
        if seriescode not in list(dict_all[c].index):
            dict_all[c].loc[seriescode] = np.nan    # fill these rows with NaNs

In [253]:
# check: do we have J many?
len(list(dict_all['Nicaragua'].index))

385

In [254]:
# convert all to floats
for c in countries:
    for year in years:    
        for seriescode in seriescodes:
            if not isinstance(dict_all[c].loc[seriescode, year], float):
                dict_all[c].loc[seriescode, year] = float(dict_all[c].loc[seriescode, year].replace(',', '').replace('<', '').replace('>', '').replace('=', '').replace('N', '0').replace(' -   ', '0').replace('0V', '0').replace('. . .', '0'))

In [255]:
# double-check: are all series codes as rows?
list(dict_all['Nicaragua'].index)

['AG_FPA_CFPI',
 'AG_FPA_COMM',
 'AG_LND_DGRD',
 'AG_LND_FRST',
 'AG_LND_FRSTBIOPHA',
 'AG_LND_FRSTCERT',
 'AG_LND_FRSTCHG',
 'AG_LND_FRSTMGT',
 'AG_LND_FRSTN',
 'AG_LND_FRSTPRCT',
 'AG_LND_TOTL',
 'AG_PRD_FIESSI',
 'AG_PRD_FIESSIN',
 'BX_TRF_PWKR',
 'DC_FTA_TOTAL',
 'DC_ODA_BDVL',
 'DC_TOF_AGRL',
 'DC_TOF_HLTHL',
 'DC_TOF_HLTHNT',
 'DC_TOF_INFRAL',
 'DC_TOF_SCHIPSL',
 'DC_TOF_TRDCML',
 'DC_TOF_TRDDBML',
 'DC_TOF_WASHL',
 'DC_TRF_TOTL',
 'DT_TDS_DECT',
 'EG_EGY_CLEAN',
 'EG_EGY_PRIM',
 'EG_ELC_ACCS',
 'EG_FEC_RNEW',
 'EG_IFF_RANDN',
 'EG_TBA_H2CO',
 'EG_TBA_H2COAQ',
 'EG_TBA_H2CORL',
 'EN_ATM_CO2',
 'EN_ATM_CO2GDP',
 'EN_ATM_CO2MVA',
 'EN_ATM_PM25',
 'EN_LND_SLUM',
 'EN_MAT_DOMCMPC',
 'EN_MAT_DOMCMPG',
 'EN_MAT_DOMCMPT',
 'EN_REF_WASCOL',
 'EN_WBE_PMNR',
 'EN_WBE_PMPN',
 'EN_WBE_PMPP',
 'EN_WBE_PMPR',
 'ER_CBD_ABSCLRHS',
 'ER_CBD_NAGOYA',
 'ER_CBD_ORSPGRFA',
 'ER_CBD_PTYPGRFA',
 'ER_CBD_SMTA',
 'ER_FFS_PRTSPC',
 'ER_FFS_PRTSPR',
 'ER_FFS_PRTSST',
 'ER_GRF_ANIMRCNT',
 'ER_H2O_STRESS',
 

Finally, we can save all countries as different <code>csv</code> files and as one `dict`.

In [256]:
if not os.path.exists('csv_original'):
    os.mkdir('csv_original')

In [257]:
for c in countries:
    dict_all[c].to_csv(r'csv_original/{}.csv'.format(c))

Having the information file might also be helpful.

In [258]:
if not os.path.exists('utils'):
    os.mkdir('utils')
    
info.to_csv(r'utils/info.csv')

In [259]:
# as one pickle file
dictall = open('utils/data/dict_all.pkl', 'wb')
pickle.dump(dict_all, dictall)
dictall.close()

In [None]:
# CHECKPOINT
dictall = open('utils/data/dict_all.pkl', 'rb')
dict_all = pickle.load(dictall)
dictall.close()

## Visualising time-series

We quickly visualise the time-series to get a better idea of the characteristics of our data set.

In [None]:
f, (ax, ax2) = plt.subplots(2, 1, sharex=True)
ax2.plot(list(range(2000, 2020)), dict_all['Bolivia'].loc['DC_ODA_BDVL'], color='#42B24C', linewidth=5)
ax.plot(list(range(2000, 2020)), dict_all['Bolivia'].loc['DC_TRF_TOTL'], color='#DE0E68', linewidth=5)

ax2.set_ylim(0, 251)  # biodiversity ODA
ax.set_ylim(250, 2501)  # total ODA

# hide the spines between ax and ax2
ax.spines['bottom'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax.xaxis.tick_top()
ax.tick_params(labelsize=20, labeltop='off')  # don't put tick labels at the top
ax2.xaxis.tick_bottom()

plt.xticks(np.arange(2000, 2019, step=2), size=20)
ax2.tick_params(labelsize=20)

f.set_figheight(8)
f.set_figwidth(12)

plt.show()

In [None]:
plt.figure(figsize=(12,8))
plt.plot(list(range(2000, 2020)), dict_all['Bolivia'].loc['DC_ODA_BDVL'], color='#42B24C', linewidth=5)
plt.xticks(np.arange(2000, 2019, step=2), size=20)
plt.yticks(np.arange(0, 251, step=25), size=20)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
plt.plot(list(range(2000, 2020)), dict_all['Bolivia'].loc['DC_TRF_TOTL'], color='#DE0E68', linewidth=5)
plt.xticks(np.arange(2000, 2019, step=2), size=20)
plt.yticks(np.arange(0, 2501, step=250), size=20)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
plt.yticks(np.arange(0, 2001, step=250), size=20)
plt.xticks(np.arange(0, 251, step=25), size=20)
plt.plot(dict_all['Bolivia (Plurinational State of)'].loc['DC_ODA_BDVL'], dict_all['Bolivia'].loc['DC_TRF_TOTL'], '--bo') #, s=100, color='black')

## Data standardisation
We have saved the original data set, but it is often useful to have the data standardised. Standardising a data set involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. Standardisation is often required by machine learning algorithms when your time series data has input values with differing scales. 

We create a new dictionary `dict_all_std` to keep the standardised values separate to the original ones.

In [None]:
# CHECKPOINT (we don't want to re-run the entire script every time we continue working on it)
dict_all = pickle.load(open('utils/data/dict_all.pkl', 'rb'))
dict_all_std = pickle.load(open('utils/data/dict_all_std.pkl', 'rb'))

In [None]:
# ~3 hours computing time
dict_all_std = copy.deepcopy(dict_all)    

for group in groupings:
    for seriescode in seriescodes:
        # adding noise as representative for measurement errors
        noise = np.random.normal(scale=0.1, size=len(dict_all[group].loc[seriescode]))
        
        dict_all[group].loc[seriescode] = dict_all[group].loc[seriescode] + noise
        
        dict_all_std[country].loc[seriescode] = scale(list(dict_all[country].loc[seriescode]))

In [None]:
#check
print('Original value', dict_all['World'].loc['AG_LND_FRST'])
print('-------')
print('Standardised value', dict_all_std['World'].loc['AG_LND_FRST'])

We better save `dict_all_std`...

In [None]:
# as csv files per grouping
if not os.path.exists('csv_standardised'):
    os.mkdir('csv_standardised')
    
for group in groupings:
    dict_all_std[group].to_csv(r'csv_standardised/{}.csv'.format(group))

# as one pickle file
stand = open('utils/data/dict_all_std.pkl', 'wb')
pickle.dump(dict_all_std, stand)
stand.close()

## (2) World Bank data set

**We use World Bank's data set and convert this data set, so every country, continent, etc. is in a separate <code>csv</code> file.**

To get started, we download the entire available data from http://datatopics.worldbank.org/sdgs/ and call it <code>wb_data.csv</code>.


Let's load the data set and look at its columns and rows to figure out how it is structured.


**We aim to have one pandas data frame per country, with all indicators. We save them as separate <code>csv</code> files.** 

In [260]:
# loading data set
wb_data = pd.read_csv('utils/data/wb_data.csv', dtype=object)
wb_data.drop(wb_data.tail(5).index,inplace=True)    # 5 last rows are blank / have other info
wb_data.tail()

Unnamed: 0,Country Name,Country Code,Series Name,Series Code,1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],...,2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019]
98620,Zimbabwe,ZWE,"Wage and salaried workers, total (% of total e...",SL.EMP.WORK.ZS,,36.5120010375977,36.3860015869141,35.9500007629395,36.0699996948242,36.9910011291504,...,34.1699981689453,34.1469993591309,34.007999420166,34.0499992370605,33.9790000915527,33.898998260498,33.8489990234375,33.8880004882813,33.8720016479492,33.8489990234375
98621,Zimbabwe,ZWE,"Water productivity, total (constant 2010 US$ G...",ER.GDP.FWTL.M3.KD,,,,,,,...,,,,,,,,,,
98622,Zimbabwe,ZWE,Women making their own informed decisions rega...,SG.DMK.SRCR.FN.ZS,,,,,,,...,,58.8,,,,59.9,,,,
98623,Zimbabwe,ZWE,Women who were first married by age 15 (% of w...,SP.M15.2024.FE.ZS,,,,,6.4,,...,,3.9,,,,3.7,,,,
98624,Zimbabwe,ZWE,Women who were first married by age 18 (% of w...,SP.M18.2024.FE.ZS,,,,,31.0,,...,,30.5,,,33.5,32.4,,,,


In [261]:
columns = list(wb_data.columns)
for column in columns[4:]:
    columns.append(column[:4])
    columns.remove(column)

wb_data.columns = columns

In [262]:
years = columns[4:]

In [263]:
# meta-data
wb_info = pd.read_csv('utils/wb_info.csv', dtype=object)
wb_info = wb_info.drop(columns=['Topic', 'Indicator Name'])

In [264]:
all_countries = list(wb_data['Country Name'].unique())
# save countries
np.savetxt('utils/countries_wb_all.csv', all_countries, delimiter=';', fmt='%s')

In [265]:
dict_all_wb = {country: pd.DataFrame() for country in countries}
for country in countries:
    print(country)
    dict_all_wb[country] = wb_data[wb_data['Country Name'].isin(['{}'.format(country)])]
    dict_all_wb[country] = dict_all_wb[country].drop(columns=['Country Name', 'Country Code', 'Series Name'])
    dict_all_wb[country].set_index('Series Code', inplace=True)
    dict_all_wb[country] = dict_all_wb[country].append(dict_all[country].loc[['VC_DSR_MTMP', 'VC_DSR_DAFF', 'SG_DSR_SILS']])    # adding series codes for SDG 13
    dict_all_wb[country] = dict_all_wb[country].astype(float)

Afghanistan
Albania
Algeria
Angola
Antigua and Barbuda
Argentina
Armenia
Australia
Austria
Azerbaijan
Bahamas, The
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Brazil
Brunei Darussalam
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Central African Republic
Chad
Chile
China
Colombia
Comoros
Congo, Dem. Rep.
Congo, Rep.
Costa Rica
Cote d'Ivoire
Croatia
Cuba
Cyprus
Czech Republic
Denmark
Djibouti
Dominica
Dominican Republic
Ecuador
Egypt, Arab Rep.
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Fiji
Finland
France
Gabon
Gambia, The
Georgia
Germany
Ghana
Greece
Greenland
Grenada
Guatemala
Guinea
Guinea-Bissau
Guyana
Haiti
Honduras
Hungary
Iceland
India
Indonesia
Iran, Islamic Rep.
Iraq
Ireland
Israel
Italy
Jamaica
Japan
Jordan
Kazakhstan
Kenya
Kiribati
Korea, Dem. People's Rep.
Kuwait
Kyrgyz Republic
Lao PDR
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
Madagascar
Malawi
Malaysia
Maldives
M

In [266]:
seriescodes_wb = list(dict_all_wb['Germany'].index)

In [267]:
# saving data
for country in countries: 
    dict_all_wb[country].to_csv(r'csv_original/{}_wb.csv'.format(country))
    
# as one pickle file
dictall = open('utils/data/dict_all_wb.pkl', 'wb')
pickle.dump(dict_all_wb, dictall)
dictall.close()

## Data standardisation

In [268]:
dict_all_wb_std = copy.deepcopy(dict_all_wb)    

for country in countries:
    for seriescode in seriescodes_wb:
        # adding noise as representative for measurement errors
        #noise = np.random.normal(scale=0.1, size=len(dict_all_wb[country].loc[seriescode]))
        
        #dict_all_wb[country].loc[seriescode] = dict_all_wb[country].loc[seriescode] + noise
        
        dict_all_wb_std[country].loc[seriescode] = scale(list(dict_all_wb[country].loc[seriescode]))

In [269]:
#check
print('Original value', dict_all_wb['Belgium'].loc['EN.CLC.MDAT.ZS'])
print('-------')
print('Standardised value', dict_all_wb_std['Belgium'].loc['EN.CLC.MDAT.ZS'])

Original value 1990         NaN
1991         NaN
1992         NaN
1993         NaN
1994         NaN
1995         NaN
1996         NaN
1997         NaN
1998         NaN
1999         NaN
2000         NaN
2001         NaN
2002         NaN
2003         NaN
2004         NaN
2005         NaN
2006         NaN
2007         NaN
2008         NaN
2009    0.001692
2010         NaN
2011         NaN
2012         NaN
2013         NaN
2014         NaN
2015         NaN
2016         NaN
2017         NaN
2018         NaN
2019         NaN
Name: EN.CLC.MDAT.ZS, dtype: float64
-------
Standardised value 1990    NaN
1991    NaN
1992    NaN
1993    NaN
1994    NaN
1995    NaN
1996    NaN
1997    NaN
1998    NaN
1999    NaN
2000    NaN
2001    NaN
2002    NaN
2003    NaN
2004    NaN
2005    NaN
2006    NaN
2007    NaN
2008    NaN
2009    0.0
2010    NaN
2011    NaN
2012    NaN
2013    NaN
2014    NaN
2015    NaN
2016    NaN
2017    NaN
2018    NaN
2019    NaN
Name: EN.CLC.MDAT.ZS, dtype: float64


We better save `dict_all_wb_std`.

In [270]:
# as csv files per grouping
if not os.path.exists('csv_standardised'):
    os.mkdir('csv_standardised')
    
for country in countries:
    dict_all_wb_std[country].to_csv(r'csv_standardised/{}_wb.csv'.format(country))

# as one pickle file
stand = open('utils/data/dict_all_wb_std.pkl', 'wb')
pickle.dump(dict_all_wb_std, stand)
stand.close()