# Targets ... of UN Sustainable Development Goals
## Data preprocessing and visualisation

In this notebook, we preprocess and visualise data of the *targets* of United Nations Sustainable Development Goals. For each of these targets, at least one quantitatively measurable *indicator* is defined. The total number of targets is 169; of those, data for 83 are given by the UN Statistics Division.

**Background to data**: It represents the entire world

### Import the required packages

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pystan
import pystan_utils
import os

# matplotlib style options
plt.style.use('ggplot')
%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 10)

### Use pandas to load the data set

In [2]:
# load csv
#os.chdir('/home/felix/PycharmProjects/MBMLproject')
df = pd.read_csv("SDG_Indicators.csv")
df.head()

Unnamed: 0,Goal,Target,Indicator Ref,IndicatorId,Indicator Description,Series Code,Series Type,Series Description,Parent Country or Area Code,Country or Area Code,...,2013,FN.30,2014,FN.31,2015,FN.32,2016,FN.33,2017,FN.34
0,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_DAY1,SD,Proportion of population below the internation...,,MDG_WORLD,...,10.7,"24, 70",,,,,,,,
1,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_EMP1,SD,Proportion of employed population below the in...,,MDG_WORLD,...,16.45,"M, 24, 72",15.87,"M, 25, 72",15.51,"M, 26, 72",15.1,"M, 27, 72",,
2,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_EMP1,SD,Proportion of employed population below the in...,,MDG_WORLD,...,16.37,"M, 24, 72",15.83,"M, 25, 72",15.54,"M, 26, 72",15.18,"M, 27, 72",,
3,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_EMP1,SD,Proportion of employed population below the in...,,MDG_WORLD,...,16.5,"M, 24, 72",15.89,"M, 25, 72",15.49,"M, 26, 72",15.04,"M, 27, 72",,
4,1,1.1,1.1.1,C010101,Proportion of population below the internation...,SI_POV_EMP1,SD,Proportion of employed population below the in...,,MDG_WORLD,...,11.14,"M, 24, 72",10.61,"M, 25, 72",10.25,"M, 26, 72",9.87,"M, 27, 72",,


### Modify df to focus on columns of use

Display columns

In [3]:
df.columns

Index(['Goal', 'Target', 'Indicator Ref', 'IndicatorId',
       'Indicator Description', 'Series Code', 'Series Type',
       'Series Description', 'Parent Country or Area Code',
       'Country or Area Code', 'Country or Area Name', 'LDC', 'LLDC', 'SIDS',
       'Frequency', 'Source type', 'Age group', 'Location', 'Sex',
       'Value type', 'Unit', 'Unit multiplier', '1983', 'FN', '1984', 'FN.1',
       '1985', 'FN.2', '1986', 'FN.3', '1987', 'FN.4', '1988', 'FN.5', '1989',
       'FN.6', '1990', 'FN.7', '1991', 'FN.8', '1992', 'FN.9', '1993', 'FN.10',
       '1994', 'FN.11', '1995', 'FN.12', '1996', 'FN.13', '1997', 'FN.14',
       '1998', 'FN.15', '1999', 'FN.16', '2000', 'FN.17', '2001', 'FN.18',
       '2002', 'FN.19', '2003', 'FN.20', '2004', 'FN.21', '2005', 'FN.22',
       '2006', 'FN.23', '2007', 'FN.24', '2008', 'FN.25', '2009', 'FN.26',
       '2010', 'FN.27', '2011', 'FN.28', '2012', 'FN.29', '2013', 'FN.30',
       '2014', 'FN.31', '2015', 'FN.32', '2016', 'FN.33', '2017'

In [4]:
df.describe()

Unnamed: 0,Goal,Parent Country or Area Code,LDC,LLDC,SIDS,1983,1984,FN.1,1985,1986,...,1988,FN.5,1989,1991,1992,1993,1994,1996,1997,1999
count,299.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,3.0,0.0,...,0.0,0.0,3.0,3.0,6.0,6.0,6.0,6.0,9.0,7.0
mean,7.672241,,,,,33.336667,,,33.333333,,...,,,33.333333,48.2,23.583333,14.611667,37.21,27.341667,77.948889,36.674286
std,5.428572,,,,,18.147543,,,16.119157,,...,,,8.259845,61.588311,13.360255,23.24336,79.044737,55.043911,187.464333,66.865561
min,1.0,,,,,16.67,,,18.18,,...,,,26.84,7.0,7.0,0.82,0.82,0.81,0.81,0.8
25%,3.0,,,,,23.67,,,24.865,,...,,,28.685,12.8,16.625,0.8225,0.82,0.82,0.82,0.81
50%,6.0,,,,,30.67,,,31.55,,...,,,30.53,18.6,21.635,3.915,2.91,3.41,16.1,15.3
75%,12.0,,,,,41.67,,,40.91,,...,,,36.58,68.8,28.1525,15.4,14.6,13.95,26.63,26.5
max,17.0,,,,,52.67,,,50.27,,...,,,42.63,119.0,45.95,60.0,198.0,139.0,576.0,186.0


Delete the columns of no use.
Decided to drop all years before 2005 as they only have very few data points.

In [5]:
df1 = df.drop(['IndicatorId', 'Series Code', 'Series Type',
             'Series Description', 'Parent Country or Area Code',
             'Country or Area Code', 'Country or Area Name', 'LDC', 'LLDC', 'SIDS',
             'Frequency', 'Source type', 'Location', '1983', 'FN', '1984', 'FN.1',
               '1985', 'FN.2', '1986', 'FN.3', '1987', 'FN.4', '1988', 'FN.5', '1989',
               'FN.6', '1990', 'FN.7', '1991', 'FN.8', '1992', 'FN.9', '1993', 'FN.10',
               '1994', 'FN.11', '1995', 'FN.12', '1996', 'FN.13', '1997', 'FN.14',
               '1998', 'FN.15', '1999', 'FN.16', '2000', 'FN.17', '2001', 'FN.18',
               '2002', 'FN.19', '2003', 'FN.20', '2004', 'FN.21','FN.22','FN.23','FN.24',
               'FN.25','FN.26','FN.27','FN.28','FN.29','FN.30','FN.31','FN.32','FN.33','FN.34',
              ], axis=1)

In [6]:
df1.head(100)

Unnamed: 0,Goal,Target,Indicator Ref,Indicator Description,Age group,Sex,Value type,Unit,Unit multiplier,2005,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,1,1.1,1.1.1,Proportion of population below the internation...,All age ranges or no breakdown by age,Both sexes or no breakdown by sex,,Percent,Units,,...,,,,,,10.70,,,,
1,1,1.1,1.1.1,Proportion of population below the internation...,15 to 24 years old,Both sexes or no breakdown by sex,,Percent,Units,24.80,...,22.04,21.18,20.22,18.26,17.08,16.45,15.87,15.51,15.10,
2,1,1.1,1.1.1,Proportion of population below the internation...,15 to 24 years old,Female,,Percent,Units,24.82,...,21.83,20.84,19.84,18.04,16.98,16.37,15.83,15.54,15.18,
3,1,1.1,1.1.1,Proportion of population below the internation...,15 to 24 years old,Male,,Percent,Units,24.78,...,22.18,21.41,20.47,18.40,17.15,16.50,15.89,15.49,15.04,
4,1,1.1,1.1.1,Proportion of population below the internation...,15 years old and over,Both sexes or no breakdown by sex,,Percent,Units,18.69,...,16.35,15.51,14.62,12.89,11.77,11.14,10.61,10.25,9.87,
5,1,1.1,1.1.1,Proportion of population below the internation...,15 years old and over,Female,,Percent,Units,18.62,...,15.98,15.00,14.05,12.53,11.58,11.00,10.56,10.26,9.93,
6,1,1.1,1.1.1,Proportion of population below the internation...,15 years old and over,Male,,Percent,Units,18.74,...,16.60,15.84,14.99,13.12,11.90,11.23,10.65,10.25,9.84,
7,1,1.1,1.1.1,Proportion of population below the internation...,25 years old and over,Both sexes or no breakdown by sex,,Percent,Units,17.31,...,15.14,14.34,13.50,11.84,10.77,10.17,9.68,9.34,8.99,
8,1,1.1,1.1.1,Proportion of population below the internation...,25 years old and over,Female,,Percent,Units,17.21,...,14.72,13.78,12.89,11.46,10.56,10.02,9.62,9.35,9.05,
9,1,1.1,1.1.1,Proportion of population below the internation...,25 years old and over,Male,,Percent,Units,17.38,...,15.41,14.70,13.90,12.09,10.91,10.27,9.72,9.34,8.95,


In [7]:
df1.describe() # why doesn't describe show the other values?

Unnamed: 0,Goal
count,299.0
mean,7.672241
std,5.428572
min,1.0
25%,3.0
50%,6.0
75%,12.0
max,17.0


In [8]:
df1.columns

Index(['Goal', 'Target', 'Indicator Ref', 'Indicator Description', 'Age group',
       'Sex', 'Value type', 'Unit', 'Unit multiplier', '2005', '2006', '2007',
       '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016',
       '2017'],
      dtype='object')

- "Goal" defining the overall goal
- "Target" measurable target
- "Indicator Ref" 
- year indicating in which year the measure was taken
- 'Age group'
- 'Sex'
- 'Value type'
- 'Unit'
- 'Unit multiplier'
- "Indicator Description" - description of the target

Select the rows that display data of 
    
1. *all* age groups and sex
2. are *not* lower or upper bound
3. 

@Galina: as you can see in the following cell, some rows have merely one data point. We can do that and plot different categories in a histogram; or we take the rows with the most data points and plot each row over time. The former has more use for our intentions later on, whereas the latter would perhaps look nicer (?). I lean toward the the former. *(The next few cells will help you understanding what I mean with categories)*
    

In [9]:
# all age groups
df2 = df1[df1['Age group'].isin(['All age ranges or no breakdown by age'])]
df3 = df2.drop(['Age group'], axis=1)

# all sex
df4 = df3[df3['Sex'].isin(['Both sexes or no breakdown by sex'])]
df5 = df4.drop(['Sex'], axis=1)

# not lower or upper bound
df6 = df5[~df5['Value type'].isin(['Lower bound', 'Upper bound'])]

# uncomment to check
df6.head(20)


Unnamed: 0,Goal,Target,Indicator Ref,Indicator Description,Value type,Unit,Unit multiplier,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,1,1.1,1.1.1,Proportion of population below the internation...,,Percent,Units,,,,,,,,,10.7,,,,
10,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,45.17,
11,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,34.86,
12,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,27.79,
13,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,41.08,
14,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,67.93,
15,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,21.77,
16,1,1.3,1.3.1,Proportion of population covered by social pro...,,Percent,Units,,,,,,,,,,,,24.72,
17,2,2.1,2.1.1,Prevalence of undernourishment,,Percent,Units,14.7,14.3,13.7,13.0,12.5,12.1,11.8,11.4,11.2,11.0,10.8,,
48,2,2.5,2.5.1,Number of plant and animal genetic resources f...,,Number,Units,,,,,,,,,11616.0,,,,


### Sort data by different units

ind = df6['Indicator Description'].unique()
for i in ind:
    print(i)

*Issue*: data is given in percentage, per 1000, 10000, in tonnes CO2eq, in kilometres, etc. (see above output)

*Idea*: organise data into different categories, display in **another** graph environmental (i.e. taking CO2 eq), display in **another** graph distance, etc.

In [10]:
# Trying out some stuff
df6[df6['Unit'].isin(['Kilometres'])]['Indicator Description']

189    Passenger and freight volumes, by mode of tran...
191    Passenger and freight volumes, by mode of tran...
196    Passenger and freight volumes, by mode of tran...
Name: Indicator Description, dtype: object

In [11]:
# show all units
units = df6.Unit.unique()
for u in units:
    print(u)
    print(df6[df6['Unit'].isin([u])]['Indicator Description'].unique())
    print('------------------------------------------------------------------------------------')

Percent
['Proportion of population below the international poverty line, by sex, age, employment status and geographical location (urban/rural)'
 'Proportion of population covered by social protection floors/systems, by sex, distinguishing children, unemployed persons, older persons, persons with disabilities, pregnant women, newborns, work-injury victims and the poor and the vulnerable'
 'Prevalence of undernourishment'
 'Number of plant and animal genetic resources for food and agriculture secured in either medium or long-term conservation facilities'
 'Proportion of local breeds classified as being at risk, not-at-risk or at unknown level of risk of extinction'
 'Proportion of births attended by skilled health personnel'
 'Participation rate in organized learning (one year before the official primary entry age), by sex'
 'Proportion of population using safely managed drinking water services'
 'Proportion of population using safely managed sanitation services, including a hand-washin

In [12]:
#df6[df['Indicator Description'].isin(['Red List Index'])]

# Create Dataframes of different categories

In [68]:
# import csv file
indicator_categories = pd.read_csv('Dev_Indicators.csv', delimiter=';')
indicator_categories.head()
del indicator_categories['Unnamed: 2']
del indicator_categories['Unnamed: 3']
indicator_categories.head()

Unnamed: 0,Category,Indicator Description
0,----,"Number of verified cases of killing, kidnappin..."
1,E&P (Percent),Number of countries that have national statist...
2,E&P (Percent),Number of countries with a national statistica...
3,E&P (Percent),Energy intensity measured in terms of primary ...
4,E&P (Percent),Annual growth rate of real GDP per capita


In [82]:
df6['Indicator Description']

0      Proportion of population below the internation...
10     Proportion of population covered by social pro...
11     Proportion of population covered by social pro...
12     Proportion of population covered by social pro...
13     Proportion of population covered by social pro...
14     Proportion of population covered by social pro...
15     Proportion of population covered by social pro...
16     Proportion of population covered by social pro...
17                        Prevalence of undernourishment
48     Number of plant and animal genetic resources f...
49     Number of plant and animal genetic resources f...
50     Number of plant and animal genetic resources f...
51     Number of plant and animal genetic resources f...
52     Number of plant and animal genetic resources f...
53     Number of plant and animal genetic resources f...
54     Number of plant and animal genetic resources f...
55     Proportion of local breeds classified as being...
56     Proportion of local bree

In [77]:
df6Desc = df6['Indicator Description'].unique().sort()
indcat = indicator_categories['Indicator Description'].unique()
df6Desc == indcat.sort()

True

In [86]:
ind_cat_joined = indicator_categories.merge(df6, on='Indicator Description')
ind_cat_joined.head(50)

Unnamed: 0,Category,Indicator Description,Goal,Target,Indicator Ref,Value type,Unit,Unit multiplier,2005,2006,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
0,----,"Number of verified cases of killing, kidnappin...",16,16.10,16.10.1,,Number,Units,,,...,,,65.0,62.0,124.0,90.0,98.0,115.0,102.0,
1,E&P (Percent),Number of countries that have national statist...,17,17.18,17.18.2,,Number,Units,,,...,,,,,,,,,37.0,
2,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,8.0,
3,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,18.0,
4,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,7.0,
5,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,17.0,
6,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,81.0,
7,E&P (Percent),Energy intensity measured in terms of primary ...,7,7.3,7.3.1,,Megajoules per USD constant 2011 PPP GDP,Units,6.37,,...,,,5.95,,,,5.49,,,
8,E&P (Percent),Annual growth rate of real GDP per capita,8,8.1,8.1.1,,Percent,Units,2.35,2.77,...,0.2,-3.24,2.82,1.62,1.01,1.08,1.35,1.45,,
9,E&P (Percent),Annual growth rate of real GDP per employed pe...,8,8.2,8.2.1,,Percent,Units,2.77,3.85,...,1.83,-0.71,4.07,2.66,1.9,1.93,1.8,1.67,1.8,


In [88]:
ind_cat_joined['Category'].unique()

array(['----', 'E&P (Percent)', 'E&P (percent)', 'E&P (USD)',
       'Environment (Percent)', 'Environment (percent)',
       'Environment (Tonnes)', 'Population', 'Population (Percent)',
       '---'], dtype=object)

## Population dataframe

In [91]:
df_population = ind_cat_joined[ind_cat_joined['Category'].str.contains('Population')]
df_population['Unit'].unique()

array(['Percent', 'Per 100,000 live births',
       'Per 1,000 uninfected population', 'Per 100,000 population',
       'Per 1,000 population', 'Number', 'Per million population'],
      dtype=object)

In [94]:
df_population[df_population['Unit'].str.contains('Number')]['Indicator Description'].unique()

array(['Suicide mortality rate',
       'Degree of integrated water resources management implementation (0-100)',
       'Number of victims of intentional homicide per 100,000 population, by sex and age'],
      dtype=object)

In [95]:
df_population[df_population['Indicator Description'].str.contains('Suicide mortality rate')]

Unnamed: 0,Category,Indicator Description,Goal,Target,Indicator Ref,Value type,Unit,Unit multiplier,2005,2006,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
109,Population,Suicide mortality rate,3,3.4,3.4.2,,"Per 100,000 population",Units,11.61,,...,,,11.23,,,,,10.73,,
110,Population,Suicide mortality rate,3,3.4,3.4.2,,Number,Thousands,756.72,,...,,,777.95,,,,,788.09,,


*Fix here*: delete the one that is number

In [96]:
year = ['2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017']

In [127]:
def toPercent(unit, n):
    if np.isnan(n): # does not work!
        return np.nan
    n = float(n.replace(',',''))
    if unit =='Percent':
        return n
    if unit == 'Per 1,000 population' or 'Per 1,000 uninfected population':
        return n/1000
    elif unit == 'Per 100,000 population':
        return n/100000

In [128]:
df_pop_soon_percent = df_population[~df_population['Unit'].isin(['Number'])]

In [129]:
list(df_pop_soon_percent[df_pop_soon_percent.index==96]['2013'])

['10.70']

In [130]:
for y in year:
    test = df_pop_soon_percent[y].unique()
    for t in test:
        print(y)
        print(float(t))


2005
nan
2005
288.0
2005
0.4
2005
169.0
2005
141.25
2005
11.61
2005
18.8
2005
1.83
2005
57.42
2005
47.97
2005
66.09
2005
84.91
2005
30.97
2005
17.62
2005
3.76
2005
35.29
2005
31.34
2005
27.54
2005
80.23
2005
95.44
2005
907.37
2005
63.7
2005
32.0
2005
27.9
2005
35.5
2005
15.76
2005
14.7
2006
nan
2006
166.0
2006
58.66
2006
49.46
2006
67.06
2006
84.97
2006
30.32
2006
17.07
2006
3.59
2006
36.56
2006
32.45
2006
28.41
2006
929.35
2006
79.49
2006
17.46
2006
14.3
2007
nan
2007
164.0
2007
59.07
2007
50.98
2007
68.02
2007
85.02
2007
29.66
2007
16.53
2007
3.43
2007
37.29
2007
33.21
2007
29.11
2007
958.45
2007
82.64
2007
20.54
2007
13.7
2008
nan
2008
161.0
2008
58.89
2008
52.49
2008
68.97
2008
85.07
2008
28.98
2008
15.98
2008
3.26
2008
38.04
2008
33.97
2008
29.82
2008
983.58
2008
84.22
2008
23.18
2008
13.0
2009
nan
2009
159.0
2009
59.41
2009
54.02
2009
69.9
2009
85.12
2009
28.29
2009
15.42
2009
3.09
2009
38.78
2009
34.74
2009
30.52
2009


ValueError: could not convert string to float: '1,007.52'

In [131]:
st = '34456'
float(st.replace(',',''))

34456.0

In [132]:
st2 = '234,56'
float(st2.replace(',',''))

23456.0

In [134]:
df_pop_to_percent = df_pop_soon_percent.copy()
    
for yr in year:
    df_pop_to_percent[yr] = df_pop_to_percent.apply(lambda r: toPercent(r['Unit'], float(r[yr])), axis=1)
df_pop_to_percent['Unit'] = 'Percent'

df_pop_to_percent

AttributeError: ("'float' object has no attribute 'replace'", 'occurred at index 104')

## Economy and Politics

In [136]:
df_ecopol = ind_cat_joined[ind_cat_joined['Category'].str.contains('E&P')]
df_ecopol.head()

Unnamed: 0,Category,Indicator Description,Goal,Target,Indicator Ref,Value type,Unit,Unit multiplier,2005,2006,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
1,E&P (Percent),Number of countries that have national statist...,17,17.18,17.18.2,,Number,Units,,,...,,,,,,,,,37.0,
2,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,8.0,
3,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,18.0,
4,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,7.0,
5,E&P (Percent),Number of countries with a national statistica...,17,17.18,17.18.3,,Number,Units,,,...,,,,,,,,,17.0,


In [138]:
df_ecopol['Category'].unique()

array(['E&P (Percent)', 'E&P (percent)', 'E&P (USD)'], dtype=object)

In [137]:
df_ecopol['Unit'].unique()

array(['Number', 'Megajoules per USD constant 2011 PPP GDP', 'Percent',
       'Constant USD', 'Per 1,000 USD', 'USD', 'Not applicable'],
      dtype=object)

## Environment

In [140]:
df_environment = ind_cat_joined[ind_cat_joined['Category'].str.contains('Environment')]
df_environment.head()

Unnamed: 0,Category,Indicator Description,Goal,Target,Indicator Ref,Value type,Unit,Unit multiplier,2005,2006,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017
37,Environment (Percent),Level of water stress: freshwater withdrawal a...,6,6.4,6.4.2,,Percent,Units,,,...,,,,,,,12.7,,,
38,Environment (Percent),Renewable energy share in the total final ener...,7,7.2,7.2.1,,Percent,Units,16.91,16.97,...,17.14,17.71,17.51,17.54,17.91,18.19,18.33,,,
39,Environment (percent),Proportion of urban solid waste regularly coll...,11,11.6,11.6.1,,Percent,Units,,,...,,,,,,,,,,65.2
40,Environment (percent),Number of parties to international multilatera...,12,12.4,12.4.1,,Percent,Units,,,...,,,,,,,,54.46,,
41,Environment (percent),Number of parties to international multilatera...,12,12.4,12.4.1,,Percent,Units,,,...,,,,,,,,100.0,,


In [141]:
df_environment['Unit'].unique()

array(['Percent', 'Square kilometers', 'Hectares', 'Metric Tons',
       'Not applicable', 'Number', 'Kilograms', 'Tonne kilometres',
       'Kilometres', 'kg CO2 equivalent per USD1 constant 2005 PPP GDP',
       'Kilograms per constant USD', 'Micrograms per cubic meter',
       'Tonnes per hectare'], dtype=object)

## category 1: all indicators which measure proportion of population

In [None]:
# show all rows which contain the word 'population in its description
df_p1 = df6[df6['Indicator Description'].str.contains('population')]
df_p1

### bring all numbers to percent

In [None]:
# do not consider the rows which are already in percentage
df_p2 = df_p1[df_p1['Unit'].isin(['Percent'])]
df_p2

In [None]:
# see if the rows containing 'population', but are not measured in percentage, contain information
df_p3 = df_p1[~df_p1['Unit'].isin(['Percent'])]
df_p3

In [None]:
set(df_p3['Unit'])

In [None]:
# the rows with Units as 'Micrograms per cubic meter' or 'Number' do not want to be considered either
df_p4 = df_p3[~df_p3['Unit'].isin(['Micrograms per cubic meter', 'Number'])]
df_p4

In [None]:
df_p4.columns

In [None]:
# continue here
# change the unit of those rows to percentage
years = ['1983', '1984', '1985', '1986','1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
       '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
       '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
       '2014', '2015', '2016', '2017']
def toPercent(unit, n):
    if np.isnan(n): # does not work!
        return np.nan
    n = float(n)
    if unit == 'Per 1,000 population' or 'Per 1,000 uninfected population':
        return n/1000
    elif unit == 'Per 100,000 population':
        return n/100000

In [None]:
df_to_percent = df_p4.copy()
    
for year in years:
    df_to_percent[year] = df_p4.apply(lambda r: toPercent(r['Unit'], float(r[year])), axis=1)
df_to_percent['Unit'] = 'Percent'

df_to_percent

In [None]:
df_p4

In [None]:
df_p_combined_percent = df_p2.append(df_to_percent)
df_p_combined_percent = df_p_combined_percent.reset_index()
df_p_combined_percent = df_p_combined_percent.drop(['index'], axis=1)
df_p_combined_percent

In [None]:
df_p_combined_percent.describe()
# why does this give the wrong counts?

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

N=5
## the data
menMeans = [18, 35, 30, 35, 27]
menStd =   [2, 3, 4, 1, 2]
womenMeans = [25, 32, 34, 20, 25]
womenStd =   [3, 5, 2, 3, 3]

## necessary variables
ind = np.arange(N)                # the x locations for the groups
width = 0.35                      # the width of the bars

## the bars
rects1 = ax.bar(ind, menMeans, width,
                color='black',
                yerr=menStd,
                error_kw=dict(elinewidth=2,ecolor='red'))

rects2 = ax.bar(ind+width, womenMeans, width,
                    color='red',
                    yerr=womenStd,
                    error_kw=dict(elinewidth=2,ecolor='black'))

# axes and labels
ax.set_xlim(-width,len(ind)+width)
ax.set_ylim(0,45)
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
xTickMarks = ['Group'+str(i) for i in range(1,6)]
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=45, fontsize=10)

## add a legend
ax.legend( (rects1[0], rects2[0]), ('Men', 'Women') )

plt.show()

In [None]:
fig1 = plt.figure()
ax = fig1.add_subplot(111)

## the data
# TODO: take out every target for the years

## necessary variables
ind = years                # years: x-location for groups
width = 0.25                      # the width of the bars

## the bars
# HOW TO VISUALIZE? IT IS A BIT TOO MUCH TO PUT THEM ALL IN ONE PLOT....
rects1 = ax.bar(ind, menMeans, width,
                color='black',
                yerr=menStd,
                error_kw=dict(elinewidth=2,ecolor='red'))

rects2 = ax.bar(ind+width, womenMeans, width,
                    color='red',
                    yerr=womenStd,
                    error_kw=dict(elinewidth=2,ecolor='black'))

# axes and labels
ax.set_xlim(-width,len(ind)+width)
ax.set_ylim(0,45)
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
xTickMarks = ['Group'+str(i) for i in range(1,6)]
ax.set_xticks(ind+width)
xtickNames = ax.set_xticklabels(xTickMarks)
plt.setp(xtickNames, rotation=45, fontsize=10)

## add a legend
ax.legend( (rects1[0], rects2[0]), ('Men', 'Women') )

plt.show()

In [None]:
vis_percent.hist(by='Indicator Ref')

http://people.duke.edu/~ccc14/pcfb/numpympl/MatplotlibBarPlots.html
https://matplotlib.org/devdocs/gallery/api/two_scales.html

*perhaps good* in the very end: descriptive statistics

In [None]:
# define different dfx for different units
dfx1.describe()
dfx2.describe()
...
df6.describe()

In [None]:
# TODO: Fill the gaps with inference
# TODO: Structure Learning