# Initial Questions
1. What are the downsides of development? 

# Profile

* Where did the data set come from (provenance)? What's in it?
    * The data is sourced from multiple locations and aggregated by The World Bank. Seems to mostly come from large, inter-governmental institutions, like the United Nations. I did not investigate tertiary sources in the hierarchy
        * Environmental Center 
        * Food and Agriculture Organization
        * Internal Displacement Monitoring Centre. 
        * World Health Organization
    
* How big is data set (how many rows? how many variables? file size?).
* What types of data variables present? What are the dimensions/type?
* What is the overall perceived quality of the data? What's missing? What do you wish it included? Any noticeable outliers? Any other anomalous or curious things that jump out at you?


# Variables to consider
## Positive indicators
* GDP per capita (current US$)
* GNI per capita, Atlas method (current US$)
* Literacy rate, adult total (% of people ages 15 and above)
* Mortality rate, infant (per 1,000 live births)
* Current health expenditure (% of GDP)
* Access to electricity (% population)
* industry (including construction), value added (% of GDP)

## Potentially negative indicators
* Rural population (% of total population)
* Urban population (% of total population)
* Total greenhouse gas emitions (kt)
* Forest area (% of land)
* Agriculture, forestry, fishing, value added (% of GDP)
* level of water stress
* Livestock production index (2014-2016 = 100)
* Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)
* Cause of death, by non-communicable diseases (% of total)
* Droughts, floods, extreme temperatures (% of population, average 1990-2009)
* Death rate, crude (per 1,000 people)
* Suicide mortality rate (per 100,000 population)
* Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70 (%)
* PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total)

## Dropped indicators
* Bird, fish, mammal, plant species (threatened)
    * Data is too sparse--single entry for each per country. Also the number of threatened species is not normalized by the number of species in that country, so comparisons wouldn't make much sense. 
    TODO: visualization that demonstates sparseness

I did some filtering on the world bank webite. I looked through the variables and decided what was relevant to my question. Domain knowledge would have been helpful here. I pulled data for all years (1960 to 2019) and each of the above variables. Where possible, I chose variables that had already been normalized per capita. I do not yet know the fullness of the data. 

In [9]:
import pandas as pd
import numpy as np
import requests

pd.set_option('display.max.columns', None)
pd.set_option('display.precision', 2)

df = pd.read_csv('data/world_indicators.csv', na_values='..')

# Standardize column names: replace spaces with underscores and upper-case with lower-case
df.columns = [c.lower().replace(' ', '_') for c in df.columns]

df.head(1)

Unnamed: 0,country_name,country_code,series_name,series_code,1960_[yr1960],1961_[yr1961],1962_[yr1962],1963_[yr1963],1964_[yr1964],1965_[yr1965],1966_[yr1966],1967_[yr1967],1968_[yr1968],1969_[yr1969],1970_[yr1970],1971_[yr1971],1972_[yr1972],1973_[yr1973],1974_[yr1974],1975_[yr1975],1976_[yr1976],1977_[yr1977],1978_[yr1978],1979_[yr1979],1980_[yr1980],1981_[yr1981],1982_[yr1982],1983_[yr1983],1984_[yr1984],1985_[yr1985],1986_[yr1986],1987_[yr1987],1988_[yr1988],1989_[yr1989],1990_[yr1990],1991_[yr1991],1992_[yr1992],1993_[yr1993],1994_[yr1994],1995_[yr1995],1996_[yr1996],1997_[yr1997],1998_[yr1998],1999_[yr1999],2000_[yr2000],2001_[yr2001],2002_[yr2002],2003_[yr2003],2004_[yr2004],2005_[yr2005],2006_[yr2006],2007_[yr2007],2008_[yr2008],2009_[yr2009],2010_[yr2010],2011_[yr2011],2012_[yr2012],2013_[yr2013],2014_[yr2014],2015_[yr2015],2016_[yr2016],2017_[yr2017],2018_[yr2018],2019_[yr2019],2020_[yr2020]
0,Afghanistan,AFG,Access to electricity (% of population),EG.ELC.ACCS.ZS,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,22.3,28.1,33.9,42.4,45.52,42.7,43.22,69.1,68.98,89.5,71.5,97.7,97.7,98.72,97.7,


In [26]:
# Transform data to be of this form:
# country_name, year, series_1, series_1, series_n

# Get all the column years 
year_columns = df.columns[4:]

df_melted = df.melt(df.columns[:4], year_columns, 'years')
df_melted = df_melted.drop('series_code', 1)

df_melted.pivot_table(index=['country_name', 'years'], columns='series_name', values='value')


Unnamed: 0_level_0,series_name,Access to electricity (% of population),Agricultural land (% of land area),Agricultural methane emissions (% of total),Agricultural methane emissions (thousand metric tons of CO2 equivalent),"Bird species, threatened",CO2 emissions (kt),CO2 emissions (metric tons per capita),"Cause of death, by communicable diseases and maternal, prenatal and nutrition conditions (% of total)","Cause of death, by non-communicable diseases (% of total)",Current health expenditure per capita (current US$),"Droughts, floods, extreme temperatures (% of population, average 1990-2009)",Electric power consumption (kWh per capita),"Energy imports, net (% of energy use)","Fish species, threatened",Forest area (% of land area),Forest rents (% of GDP),"GNI per capita, Atlas method (current US$)","Industry (including construction), value added (% of GDP)",Level of water stress: freshwater withdrawal as a proportion of available freshwater resources,"Literacy rate, adult total (% of people ages 15 and above)",Livestock production index (2014-2016 = 100),"Mammal species, threatened","Mortality from CVD, cancer, diabetes or CRD between exact ages 30 and 70 (%)","Mortality rate, infant, male (per 1,000 live births)","Mortality rate, neonatal (per 1,000 live births)","PM2.5 air pollution, population exposed to levels exceeding WHO guideline value (% of total)",Permanent cropland (% of land area),"Plant species (higher), threatened",Rural population (% of total population),"Suicide mortality rate (per 100,000 population)",Terrestrial protected areas (% of total land area),Total greenhouse gas emissions (kt of CO2 equivalent),Urban population (% of total population)
country_name,years,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1
Afghanistan,1960_[yr1960],,,,,,414.37,0.05,,,,,,,,,,,,,,,,,,,,,,91.60,,,,8.40
Afghanistan,1961_[yr1961],,57.75,,,,491.38,0.05,,,,,,,,,,,,,,41.35,,,,,,0.08,,91.32,,,,8.68
Afghanistan,1962_[yr1962],,57.84,,,,689.40,0.07,,,,,,,,,,,,,,41.88,,,243.6,,,0.09,,91.02,,,,8.98
Afghanistan,1963_[yr1963],,57.91,,,,707.73,0.07,,,,,,,,,,,,,,44.52,,,239.3,,,0.09,,90.72,,,,9.28
Afghanistan,1964_[yr1964],,58.01,,,,839.74,0.09,,,,,,,,,,,,,,45.96,,,235.1,,,0.11,,90.41,,,,9.59
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
American Samoa,1972_[yr1972],,15.00,45.64,2.85,,,,,,,,,,,,,,,,,85.97,,,,,,10.00,,28.82,,,13.45,71.18
American Samoa,1973_[yr1973],,15.00,45.07,2.85,,,,,,,,,,,,,,,,,85.70,,,,,,10.00,,28.42,,,13.68,71.58
American Samoa,1974_[yr1974],,15.00,44.51,2.87,,,,,,,,,,,,,,,,,93.36,,,,,,10.00,,28.03,,,13.91,71.97
American Samoa,1975_[yr1975],,15.00,43.98,2.89,,,,,,,,,,,,,,,,,92.83,,,,,,10.00,,27.65,,,14.19,72.35
