# Exploring Datesets for the Mid-Term Project

The mid-term project consists on doing some exploratory data analysis of one of the three datasets we've seen in the course. In this notebook, I'll explore the datasets in search for an interesting research question to use in the project.

## First Choice:

The economic variables dataset.

In [1]:
import pandas as pd

In [2]:
path = '/Users/Propietario/Desktop/Docs/Kaggle/world-development-indicators/'
data = pd.read_csv(path+'indicators.csv')
data.head()

Unnamed: 0,CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value
0,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1960,133.5609
1,Arab World,ARB,Age dependency ratio (% of working-age populat...,SP.POP.DPND,1960,87.7976
2,Arab World,ARB,"Age dependency ratio, old (% of working-age po...",SP.POP.DPND.OL,1960,6.634579
3,Arab World,ARB,"Age dependency ratio, young (% of working-age ...",SP.POP.DPND.YG,1960,81.02333
4,Arab World,ARB,Arms exports (SIPRI trend indicator values),MS.MIL.XPRT.KD,1960,3000000.0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5656458 entries, 0 to 5656457
Data columns (total 6 columns):
CountryName      object
CountryCode      object
IndicatorName    object
IndicatorCode    object
Year             int64
Value            float64
dtypes: float64(1), int64(1), object(4)
memory usage: 258.9+ MB


In [4]:
len(data.IndicatorName.unique())

1344

In [5]:
name_2_code = data[['IndicatorName','IndicatorCode']].drop_duplicates().set_index('IndicatorName').to_dict()
code_2_name = data[['IndicatorName','IndicatorCode']].drop_duplicates().set_index('IndicatorCode').to_dict()

That's a lot of indicators. Let's print them to see if anything jumps out to me.

In [6]:
data = data.sort_values(by='IndicatorName',ascending=True)
for ind in data.IndicatorName.unique():
    print(ind)

2005 PPP conversion factor, GDP (LCU per international $)
2005 PPP conversion factor, private consumption (LCU per international $)
ARI treatment (% of children under 5 taken to a health provider)
Access to electricity (% of population)
Access to electricity, rural (% of rural population)
Access to electricity, urban (% of urban population)
Access to non-solid fuel (% of population)
Access to non-solid fuel, rural (% of rural population)
Access to non-solid fuel, urban (% of urban population)
Adequacy of social insurance programs (% of total welfare of beneficiary households)
Adequacy of social protection and labor programs (% of total welfare of beneficiary households)
Adequacy of social safety net programs (% of total welfare of beneficiary households)
Adequacy of unemployment benefits and ALMP (% of total welfare of beneficiary households)
Adjusted net enrolment rate, primary, both sexes (%)
Adjusted net enrolment rate, primary, female (%)
Adjusted net enrolment rate, primary, male 

There are many themes that can be tracked through these indicators, which include:

* Macroeconomic behaviour,
* Gender equality
* Sustainability

Of these, I think that a comparative analysis between countries (or pehaps as a time series) of Gender Equality seems like a good, relevant topic in which to find an interesting question. Let's see where  this leads us.

In [7]:
equality = data.loc[data.IndicatorName.str.contains('male|gender|inclusion|wom.n',case= False),['CountryName','Year','IndicatorCode','Value']]
for ind in equality.IndicatorCode.unique():
    print(ind, code_2_name['IndicatorName'][ind])

('SE.PRM.TENR.FE', 'Adjusted net enrolment rate, primary, female (%)')
('SE.PRM.TENR.MA', 'Adjusted net enrolment rate, primary, male (%)')
('SP.ADO.TFRT', 'Adolescent fertility rate (births per 1,000 women ages 15-19)')
('SE.ADT.LITR.FE.ZS', 'Adult literacy rate, population 15+ years, female (%)')
('SE.ADT.LITR.MA.ZS', 'Adult literacy rate, population 15+ years, male (%)')
('SL.TLF.0714.SW.FE.TM', 'Average working hours of children, study and work, female, ages 7-14 (hours per week)')
('SL.TLF.0714.SW.MA.TM', 'Average working hours of children, study and work, male, ages 7-14 (hours per week)')
('SL.TLF.0714.WK.FE.TM', 'Average working hours of children, working only, female, ages 7-14 (hours per week)')
('SL.TLF.0714.WK.MA.TM', 'Average working hours of children, working only, male, ages 7-14 (hours per week)')
('IQ.CPA.GNDR.XQ', 'CPIA gender equality rating (1=low to 6=high)')
('IQ.CPA.SOCI.XQ', 'CPIA policies for social inclusion/equity cluster average (1=low to 6=high)')
('SL.AGR.

In [8]:
len(equality.IndicatorCode.unique())

190

In [9]:
equality.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 571582 entries, 1106987 to 5498228
Data columns (total 4 columns):
CountryName      571582 non-null object
Year             571582 non-null int64
IndicatorCode    571582 non-null object
Value            571582 non-null float64
dtypes: float64(1), int64(1), object(2)
memory usage: 21.8+ MB


**Note:** There's an indicator (not in the equality slice) about gender equality policies, so an interesting question could be whether countries that have better equality policies actually achieve their goals or not. Let's try to answer it.

In [10]:
# unstacked = equality.set_index(['CountryName','Year','IndicatorCode']).unstack()
pivoted = pd.pivot_table(equality,index=['CountryName','Year'],columns='IndicatorCode',values='Value').reset_index()
pivoted.head()

IndicatorCode,CountryName,Year,IC.FRM.FEMM.ZS,IC.FRM.FEMO.ZS,IQ.CPA.GNDR.XQ,IQ.CPA.SOCI.XQ,SE.ADT.1524.LT.FE.ZS,SE.ADT.1524.LT.FM.ZS,SE.ADT.1524.LT.MA.ZS,SE.ADT.LITR.FE.ZS,...,SP.DYN.LE00.FE.IN,SP.DYN.LE00.MA.IN,SP.DYN.TFRT.IN,SP.DYN.TO65.FE.ZS,SP.DYN.TO65.MA.ZS,SP.DYN.WFRT,SP.HOU.FEMA.ZS,SP.MTR.1519.ZS,SP.POP.TOTL.FE.ZS,SP.UWT.TFRT
0,Afghanistan,1960,,,,,,,,,...,33.105,31.589,7.45,21.17069,17.59371,,,,48.306615,
1,Afghanistan,1961,,,,,,,,,...,33.557,32.035,7.45,21.70654,18.0825,,,,48.396679,
2,Afghanistan,1962,,,,,,,,,...,34.001,32.476,7.45,22.2424,18.5713,,,,48.481309,
3,Afghanistan,1963,,,,,,,,,...,34.44,32.913,7.45,22.77282,19.06977,,,,48.560586,
4,Afghanistan,1964,,,,,,,,,...,34.875,33.348,7.45,23.30325,19.56825,,,,48.634625,


In [11]:
inds_de_mex = equality.loc[(equality.CountryName=='Mexico')&(equality.Year > 2000),'IndicatorCode'].unique()
usefuls = equality[equality.IndicatorCode.isin(inds_de_mex)]
for ind in usefuls.IndicatorCode.unique():
    print(ind, code_2_name['IndicatorName'][ind])

('SE.PRM.TENR.FE', 'Adjusted net enrolment rate, primary, female (%)')
('SE.PRM.TENR.MA', 'Adjusted net enrolment rate, primary, male (%)')
('SP.ADO.TFRT', 'Adolescent fertility rate (births per 1,000 women ages 15-19)')
('SE.ADT.LITR.FE.ZS', 'Adult literacy rate, population 15+ years, female (%)')
('SE.ADT.LITR.MA.ZS', 'Adult literacy rate, population 15+ years, male (%)')
('SL.TLF.0714.SW.FE.TM', 'Average working hours of children, study and work, female, ages 7-14 (hours per week)')
('SL.TLF.0714.SW.MA.TM', 'Average working hours of children, study and work, male, ages 7-14 (hours per week)')
('SL.TLF.0714.WK.FE.TM', 'Average working hours of children, working only, female, ages 7-14 (hours per week)')
('SL.TLF.0714.WK.MA.TM', 'Average working hours of children, working only, male, ages 7-14 (hours per week)')
('SL.AGR.0714.FE.ZS', 'Child employment in agriculture, female (% of female economically active children ages 7-14)')
('SL.AGR.0714.MA.ZS', 'Child employment in agriculture, m