# Targets ... of UN Sustainable Development Goals
## Data preprocessing and visualisation

In this notebook, we preprocess and visualise data of the *targets* of United Nations Sustainable Development Goals. For each of these targets, at least one quantitatively measurable *indicator* is defined. The total number of targets is 169; of those, data for 83 are given by the UN Statistics Division.

**Background to data**: It represents the entire world

### Import the required packages

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pystan
import pystan_utils
import os

# matplotlib style options
plt.style.use('ggplot')
%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 10)

### Use pandas to load the data set

In [2]:
# load csv
#os.chdir('/home/felix/PycharmProjects/MBMLproject')
df = pd.read_csv("SDG_Indicators.csv")
df.head()

Unnamed: 0,Goal,Target,Indicator Ref,IndicatorId,Indicator Description,Series Code,Series Type,Series Description,Parent Country or Area Code,Country or Area Code,...,2013,FN.30,2014,FN.31,2015,FN.32,2016,FN.33,2017,FN.34
0,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_DAY1,SD,Proportion of population below the internation...,,MDG_WORLD,...,10.7,"24, 70",,,,,,,,
1,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_EMP1,SD,Proportion of employed population below the in...,,MDG_WORLD,...,16.45,"M, 24, 72",15.87,"M, 25, 72",15.51,"M, 26, 72",15.1,"M, 27, 72",,
2,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_EMP1,SD,Proportion of employed population below the in...,,MDG_WORLD,...,16.37,"M, 24, 72",15.83,"M, 25, 72",15.54,"M, 26, 72",15.18,"M, 27, 72",,
3,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_EMP1,SD,Proportion of employed population below the in...,,MDG_WORLD,...,16.5,"M, 24, 72",15.89,"M, 25, 72",15.49,"M, 26, 72",15.04,"M, 27, 72",,
4,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_EMP1,SD,Proportion of employed population below the in...,,MDG_WORLD,...,11.14,"M, 24, 72",10.61,"M, 25, 72",10.25,"M, 26, 72",9.87,"M, 27, 72",,


### Modify df to focus on columns of use

Display columns

In [3]:
df.columns

Index(['Goal', 'Target', 'Indicator Ref', 'IndicatorId',
       'Indicator Description', 'Series Code', 'Series Type',
       'Series Description', 'Parent Country or Area Code',
       'Country or Area Code', 'Country or Area Name', 'LDC', 'LLDC', 'SIDS',
       'Frequency', 'Source type', 'Age group', 'Location', 'Sex',
       'Value type', 'Unit', 'Unit multiplier', '1983', 'FN', '1984', 'FN.1',
       '1985', 'FN.2', '1986', 'FN.3', '1987', 'FN.4', '1988', 'FN.5', '1989',
       'FN.6', '1990', 'FN.7', '1991', 'FN.8', '1992', 'FN.9', '1993', 'FN.10',
       '1994', 'FN.11', '1995', 'FN.12', '1996', 'FN.13', '1997', 'FN.14',
       '1998', 'FN.15', '1999', 'FN.16', '2000', 'FN.17', '2001', 'FN.18',
       '2002', 'FN.19', '2003', 'FN.20', '2004', 'FN.21', '2005', 'FN.22',
       '2006', 'FN.23', '2007', 'FN.24', '2008', 'FN.25', '2009', 'FN.26',
       '2010', 'FN.27', '2011', 'FN.28', '2012', 'FN.29', '2013', 'FN.30',
       '2014', 'FN.31', '2015', 'FN.32', '2016', 'FN.33', '2017'

In [4]:
df.describe()

Unnamed: 0,Goal,Parent Country or Area Code,LDC,LLDC,SIDS,1983,1984,FN.1,1985,1986,...,1988,FN.5,1989,1991,1992,1993,1994,1996,1997,1999
count,299.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,0.0,...,0.0,0.0,3.0,3.0,6.0,6.0,6.0,6.0,9.0,7.0
mean,7.672241,,,,,33.336667,,,33.333333,,...,,,33.333333,48.2,23.583333,14.611667,37.21,27.341667,77.948889,36.674286
std,5.428572,,,,,18.147543,,,16.119157,,...,,,8.259845,61.588311,13.360255,23.24336,79.044737,55.043911,187.464333,66.865561
min,1.0,,,,,16.67,,,18.18,,...,,,26.84,7.0,7.0,0.82,0.82,0.81,0.81,0.8
25%,3.0,,,,,23.67,,,24.865,,...,,,28.685,12.8,16.625,0.8225,0.82,0.82,0.82,0.81
50%,6.0,,,,,30.67,,,31.55,,...,,,30.53,18.6,21.635,3.915,2.91,3.41,16.1,15.3
75%,12.0,,,,,41.67,,,40.91,,...,,,36.58,68.8,28.1525,15.4,14.6,13.95,26.63,26.5
max,17.0,,,,,52.67,,,50.27,,...,,,42.63,119.0,45.95,60.0,198.0,139.0,576.0,186.0


Delete the columns of no use.
Decided to drop all years before 2005 as they only have very few data points.

In [5]:
df1 = df.drop(['IndicatorId', 'Series Code', 'Series Type',
             'Series Description', 'Parent Country or Area Code',
             'Country or Area Code', 'Country or Area Name', 'LDC', 'LLDC', 'SIDS',
             'Frequency', 'Source type', 'Location', '1983', 'FN', '1984', 'FN.1',
               '1985', 'FN.2', '1986', 'FN.3', '1987', 'FN.4', '1988', 'FN.5', '1989',
               'FN.6', '1990', 'FN.7', '1991', 'FN.8', '1992', 'FN.9', '1993', 'FN.10',
               '1994', 'FN.11', '1995', 'FN.12', '1996', 'FN.13', '1997', 'FN.14',
               '1998', 'FN.15', '1999', 'FN.16', '2000', 'FN.17', '2001', 'FN.18',
               '2002', 'FN.19', '2003', 'FN.20', '2004', 'FN.21','FN.22','FN.23','FN.24',
               'FN.25','FN.26','FN.27','FN.28','FN.29','FN.30','FN.31','FN.32','FN.33','FN.34',
              ], axis=1)

In [6]:
df1.head(100)

Unnamed: 0,Goal,Target,Indicator Ref,Indicator Description,Age group,Sex,Value type,Unit,Unit multiplier,2005,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,1,1.1,1.1.1,Proportion of population below the internation...,All age ranges or no breakdown by age,Both sexes or no breakdown by sex,,Percent,Units,,...,,,,,,10.70,,,,
1,1,1.1,1.1.1,Proportion of population below the internation...,15 to 24 years old,Both sexes or no breakdown by sex,,Percent,Units,24.80,...,22.04,21.18,20.22,18.26,17.08,16.45,15.87,15.51,15.10,
2,1,1.1,1.1.1,Proportion of population below the internation...,15 to 24 years old,Female,,Percent,Units,24.82,...,21.83,20.84,19.84,18.04,16.98,16.37,15.83,15.54,15.18,
3,1,1.1,1.1.1,Proportion of population below the internation...,15 to 24 years old,Male,,Percent,Units,24.78,...,22.18,21.41,20.47,18.40,17.15,16.50,15.89,15.49,15.04,
4,1,1.1,1.1.1,Proportion of population below the internation...,15 years old and over,Both sexes or no breakdown by sex,,Percent,Units,18.69,...,16.35,15.51,14.62,12.89,11.77,11.14,10.61,10.25,9.87,
5,1,1.1,1.1.1,Proportion of population below the internation...,15 years old and over,Female,,Percent,Units,18.62,...,15.98,15.00,14.05,12.53,11.58,11.00,10.56,10.26,9.93,
6,1,1.1,1.1.1,Proportion of population below the internation...,15 years old and over,Male,,Percent,Units,18.74,...,16.60,15.84,14.99,13.12,11.90,11.23,10.65,10.25,9.84,
7,1,1.1,1.1.1,Proportion of population below the internation...,25 years old and over,Both sexes or no breakdown by sex,,Percent,Units,17.31,...,15.14,14.34,13.50,11.84,10.77,10.17,9.68,9.34,8.99,
8,1,1.1,1.1.1,Proportion of population below the internation...,25 years old and over,Female,,Percent,Units,17.21,...,14.72,13.78,12.89,11.46,10.56,10.02,9.62,9.35,9.05,
9,1,1.1,1.1.1,Proportion of population below the internation...,25 years old and over,Male,,Percent,Units,17.38,...,15.41,14.70,13.90,12.09,10.91,10.27,9.72,9.34,8.95,


In [8]:
df1.describe() # why doesn't describe show the other values?

Unnamed: 0,Goal
count,299.0
mean,7.672241
std,5.428572
min,1.0
25%,3.0
50%,6.0
75%,12.0
max,17.0


In [9]:
df1.columns

Index(['Goal', 'Target', 'Indicator Ref', 'Indicator Description', 'Age group',
       'Sex', 'Value type', 'Unit', 'Unit multiplier', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017'],
      dtype='object')

- "Goal" defining the overall goal
- "Target" measurable target
- "Indicator Ref" 
- year indicating in which year the measure was taken
- 'Age group'
- 'Sex'
- 'Value type'
- 'Unit'
- 'Unit multiplier'
- "Indicator Description" - description of the target

Select the rows that display data of 
    
1. *all* age groups and sex
2. are *not* lower or upper bound
3. 

@Galina: as you can see in the following cell, some rows have merely one data point. We can do that and plot different categories in a histogram; or we take the rows with the most data points and plot each row over time. The former has more use for our intentions later on, whereas the latter would perhaps look nicer (?). I lean toward the the former. *(The next few cells will help you understanding what I mean with categories)*
    

In [10]:
# all age groups
df2 = df1[df1['Age group'].isin(['All age ranges or no breakdown by age'])]
df3 = df2.drop(['Age group'], axis=1)

# all sex
df4 = df3[df3['Sex'].isin(['Both sexes or no breakdown by sex'])]
df5 = df4.drop(['Sex'], axis=1)

# not lower or upper bound
df6 = df5[~df5['Value type'].isin(['Lower bound', 'Upper bound'])]

# uncomment to check
df6.head(20)


Unnamed: 0,Goal,Target,Indicator Ref,Indicator Description,Value type,Unit,Unit multiplier,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,1,1.1,1.1.1,Proportion of population below the internation...,,Percent,Units,,,,,,,,,10.7,,,,
10,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,45.17,
11,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,34.86,
12,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,27.79,
13,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,41.08,
14,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,67.93,
15,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,21.77,
16,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,24.72,
17,2,2.1,2.1.1,Prevalence of undernourishment,,Percent,Units,14.7,14.3,13.7,13.0,12.5,12.1,11.8,11.4,11.2,11.0,10.8,,
48,2,2.5,2.5.1,Number of plant and animal genetic resources f...,,Number,Units,,,,,,,,,11616.0,,,,


### Sort data by different units

ind = df6['Indicator Description'].unique()
for i in ind:
    print(i)

*Issue*: data is given in percentage, per 1000, 10000, in tonnes CO2eq, in kilometres, etc. (see above output)

*Idea*: organise data into different categories, display in **another** graph environmental (i.e. taking CO2 eq), display in **another** graph distance, etc.

In [14]:
# Trying out some stuff
df6[df6['Unit'].isin(['Kilometres'])]['Indicator Description']

189    Passenger and freight volumes, by mode of tran...
191    Passenger and freight volumes, by mode of tran...
196    Passenger and freight volumes, by mode of tran...
Name: Indicator Description, dtype: object

In [15]:
# show all units
units = df6.Unit.unique()
for u in units:
    print(u)
    print(df6[df6['Unit'].isin([u])]['Indicator Description'].unique())
    print('------------------------------------------------------------------------------------')

Percent
['Proportion of population below the international poverty line, by sex, age, employment status and geographical location (urban/rural)'
 'Proportion of population covered by social protection floors/systems, by sex, distinguishing children, unemployed persons, older persons, persons with disabilities, pregnant women, newborns, work-injury victims and the poor and the vulnerable'
 'Prevalence of undernourishment'
 'Number of plant and animal genetic resources for food and agriculture secured in either medium or long-term conservation facilities'
 'Proportion of local breeds classified as being at risk, not-at-risk or at unknown level of risk of extinction'
 'Proportion of births attended by skilled health personnel'
 'Participation rate in organized learning (one year before the official primary entry age), by sex'
 'Proportion of population using safely managed drinking water services'
 'Proportion of population using safely managed sanitation services, including a hand-washin

In [17]:
#df6[df['Indicator Description'].isin(['Red List Index'])]

## category 1: all indicators which measure proportion of population

In [18]:
# show all rows which contain the word 'population in its description
df_p1 = df6[df6['Indicator Description'].str.contains('population')]
df_p1

Unnamed: 0,Goal,Target,Indicator Ref,Indicator Description,Value type,Unit,Unit multiplier,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,1,1.1,1.1.1,Proportion of population below the internation...,,Percent,Units,,,,,,,,,10.7,,,,
10,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,45.17,
11,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,34.86,
12,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,27.79,
13,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,41.08,
14,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,67.93,
15,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,21.77,
16,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,24.72,
72,3,3.3,3.3.1,"Number of new HIV infections per 1,000 uninfec...",,"Per 1,000 uninfected population",Units,0.4,,,,,0.33,,,,,0.3,,
76,3,3.3,3.3.2,"Tuberculosis incidence per 100,000 population",,"Per 100,000 population",Units,169.0,166.0,164.0,161.0,159.0,155.0,153.0,150.0,147.0,144.0,142.0,,


### bring all numbers to percent

In [None]:
# do not consider the rows which are already in percentage
df_p2 = df_p1[df_p1['Unit'].isin(['Percent'])]
df_p2

In [None]:
# see if the rows containing 'population', but are not measured in percentage, contain information
df_p3 = df_p1[~df_p1['Unit'].isin(['Percent'])]
df_p3

In [None]:
set(df_p3['Unit'])

In [None]:
# the rows with Units as 'Micrograms per cubic meter' or 'Number' do not want to be considered either
df_p4 = df_p3[~df_p3['Unit'].isin(['Micrograms per cubic meter', 'Number'])]
df_p4

In [None]:
df_p4.columns

In [None]:
# continue here
# change the unit of those rows to percentage
years = ['1983', '1984', '1985', '1986','1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017']
def toPercent(unit, n):
    if np.isnan(n): # does not work!
        return np.nan
    n = float(n)
    if unit == 'Per 1,000 population' or 'Per 1,000 uninfected population':
        return n/1000
    elif unit == 'Per 100,000 population':
        return n/100000

In [None]:
df_to_percent = df_p4.copy()
    
for year in years:
    df_to_percent[year] = df_p4.apply(lambda r: toPercent(r['Unit'], float(r[year])), axis=1)
df_to_percent['Unit'] = 'Percent'

df_to_percent

In [None]:
df_p4

In [None]:
df_p_combined_percent = df_p2.append(df_to_percent)
df_p_combined_percent = df_p_combined_percent.reset_index()
df_p_combined_percent = df_p_combined_percent.drop(['index'], axis=1)
df_p_combined_percent

In [None]:
df_p_combined_percent.describe()
# why does this give the wrong counts?

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

N=5
## the data
menMeans = [18, 35, 30, 35, 27]
menStd =   [2, 3, 4, 1, 2]
womenMeans = [25, 32, 34, 20, 25]
womenStd =   [3, 5, 2, 3, 3]

## necessary variables
ind = np.arange(N)                # the x locations for the groups
width = 0.35                      # the width of the bars

## the bars
rects1 = ax.bar(ind, menMeans, width,
                color='black',
                yerr=menStd,
                error_kw=dict(elinewidth=2,ecolor='red'))

rects2 = ax.bar(ind+width, womenMeans, width,
                    color='red',
                    yerr=womenStd,
                    error_kw=dict(elinewidth=2,ecolor='black'))

# axes and labels
ax.set_xlim(-width,len(ind)+width)
ax.set_ylim(0,45)
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
xTickMarks = ['Group'+str(i) for i in range(1,6)]
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=45, fontsize=10)

## add a legend
ax.legend( (rects1[0], rects2[0]), ('Men', 'Women') )

plt.show()

In [None]:
fig1 = plt.figure()
ax = fig1.add_subplot(111)

## the data
# TODO: take out every target for the years

## necessary variables
ind = years                # years: x-location for groups
width = 0.25                      # the width of the bars

## the bars
# HOW TO VISUALIZE? IT IS A BIT TOO MUCH TO PUT THEM ALL IN ONE PLOT....
rects1 = ax.bar(ind, menMeans, width,
                color='black',
                yerr=menStd,
                error_kw=dict(elinewidth=2,ecolor='red'))

rects2 = ax.bar(ind+width, womenMeans, width,
                    color='red',
                    yerr=womenStd,
                    error_kw=dict(elinewidth=2,ecolor='black'))

# axes and labels
ax.set_xlim(-width,len(ind)+width)
ax.set_ylim(0,45)
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
xTickMarks = ['Group'+str(i) for i in range(1,6)]
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=45, fontsize=10)

## add a legend
ax.legend( (rects1[0], rects2[0]), ('Men', 'Women') )

plt.show()

In [None]:
vis_percent.hist(by='Indicator Ref')

http://people.duke.edu/~ccc14/pcfb/numpympl/MatplotlibBarPlots.html
https://matplotlib.org/devdocs/gallery/api/two_scales.html

*perhaps good* in the very end: descriptive statistics

In [None]:
# define different dfx for different units
dfx1.describe()
dfx2.describe()
...
df6.describe()

In [None]:
# TODO: Fill the gaps with inference
# TODO: Structure Learning