# Exploring Available Data



### References

* https://www.kaggle.com/worldbank/world-development-indicators

### Possible Indicators
* Proportion of seats held by women in national parliaments (%) 
* Firms with female top manager (% of firms)'
* Firms with female participation in ownership (% of firms)',
* Employers, female (% of employment)'


* Primary to secondary general education transition rate, female (%)',
* Pregnant women receiving prenatal care (%)
* Contraceptive prevalence (% of women ages 15-49)


* Unemployment, female (% of female labor force)',
* Unemployment, youth female (% of female labor force ages 15-24) (modeled ILO estimate)
* Labor force participation rate for ages 15-24, female (%) (national estimate)',
* Labor force participation rate, female (% of female population ages 15+) (national estimate)',
* Labor force with primary education, female (% of female labor force)',
* Labor force with secondary education, female (% of female labor force)',
* Labor force with tertiary education, female (% of female labor force)',
* Adult literacy rate, population 15+ years, female (%)',
* Youth literacy rate, population 15-24 years, female (%)',


* GDP per capita (current USdollar)
* GDP per capita (constant 2005 USdollar)
* GDP growth (annual %) 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Indicators Dataset

In [2]:
indicator_data = pd.read_csv('./world-development-indicators/Indicators.csv')
indicator_data.shape

(5656458, 6)

In [3]:
indicator_data.head()

Unnamed: 0,CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value
0,Arab World,ARB,"Adolescent fertility rate (births per 1,000 wo...",SP.ADO.TFRT,1960,133.5609
1,Arab World,ARB,Age dependency ratio (% of working-age populat...,SP.POP.DPND,1960,87.7976
2,Arab World,ARB,"Age dependency ratio, old (% of working-age po...",SP.POP.DPND.OL,1960,6.634579
3,Arab World,ARB,"Age dependency ratio, young (% of working-age ...",SP.POP.DPND.YG,1960,81.02333
4,Arab World,ARB,Arms exports (SIPRI trend indicator values),MS.MIL.XPRT.KD,1960,3000000.0


In [4]:
indicator_data.dtypes

CountryName       object
CountryCode       object
IndicatorName     object
IndicatorCode     object
Year               int64
Value            float64
dtype: object

In [5]:
countries = indicator_data['CountryName'].unique().tolist()
print(countries,len(countries)) 

['Arab World', 'Caribbean small states', 'Central Europe and the Baltics', 'East Asia & Pacific (all income levels)', 'East Asia & Pacific (developing only)', 'Euro area', 'Europe & Central Asia (all income levels)', 'Europe & Central Asia (developing only)', 'European Union', 'Fragile and conflict affected situations', 'Heavily indebted poor countries (HIPC)', 'High income', 'High income: nonOECD', 'High income: OECD', 'Latin America & Caribbean (all income levels)', 'Latin America & Caribbean (developing only)', 'Least developed countries: UN classification', 'Low & middle income', 'Low income', 'Lower middle income', 'Middle East & North Africa (all income levels)', 'Middle East & North Africa (developing only)', 'Middle income', 'North America', 'OECD members', 'Other small states', 'Pacific island small states', 'Small states', 'South Asia', 'Sub-Saharan Africa (all income levels)', 'Sub-Saharan Africa (developing only)', 'Upper middle income', 'World', 'Afghanistan', 'Albania', '

In [7]:
indicators = indicator_data['IndicatorName'].unique().tolist()
print(indicators,len(indicators)) 

['Adolescent fertility rate (births per 1,000 women ages 15-19)', 'Age dependency ratio (% of working-age population)', 'Age dependency ratio, old (% of working-age population)', 'Age dependency ratio, young (% of working-age population)', 'Arms exports (SIPRI trend indicator values)', 'Arms imports (SIPRI trend indicator values)', 'Birth rate, crude (per 1,000 people)', 'CO2 emissions (kt)', 'CO2 emissions (metric tons per capita)', 'CO2 emissions from gaseous fuel consumption (% of total)', 'CO2 emissions from liquid fuel consumption (% of total)', 'CO2 emissions from liquid fuel consumption (kt)', 'CO2 emissions from solid fuel consumption (% of total)', 'Death rate, crude (per 1,000 people)', 'Fertility rate, total (births per woman)', 'Fixed telephone subscriptions', 'Fixed telephone subscriptions (per 100 people)', 'Hospital beds (per 1,000 people)', 'International migrant stock (% of population)', 'International migrant stock, total', 'Life expectancy at birth, female (years)', 

In [21]:
print("Indicator year range:",indicator_data['Year'].min(),"-",indicator_data['Year'].max())

Indicator year range: 1960 - 2015


### Country Dataset

In [8]:
country_data = pd.read_csv('./world-development-indicators/Country.csv')
country_data.shape

(247, 31)

In [9]:
country_data.head()

Unnamed: 0,CountryCode,ShortName,TableName,LongName,Alpha2Code,CurrencyUnit,SpecialNotes,Region,IncomeGroup,Wb2Code,...,GovernmentAccountingConcept,ImfDataDisseminationStandard,LatestPopulationCensus,LatestHouseholdSurvey,SourceOfMostRecentIncomeAndExpenditureData,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
0,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period fo...,South Asia,Low income,AF,...,Consolidated central government,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",,2013/14,,2013.0,2000.0
1,ALB,Albania,Albania,Republic of Albania,AL,Albanian lek,,Europe & Central Asia,Upper middle income,AL,...,Budgetary central government,General Data Dissemination System (GDDS),2011,"Demographic and Health Survey (DHS), 2008/09",Living Standards Measurement Study Survey (LSM...,Yes,2012,2011.0,2013.0,2006.0
2,DZA,Algeria,Algeria,People's Democratic Republic of Algeria,DZ,Algerian dinar,,Middle East & North Africa,Upper middle income,DZ,...,Budgetary central government,General Data Dissemination System (GDDS),2008,"Multiple Indicator Cluster Survey (MICS), 2012","Integrated household survey (IHS), 1995",,,2010.0,2013.0,2001.0
3,ASM,American Samoa,American Samoa,American Samoa,AS,U.S. dollar,,East Asia & Pacific,Upper middle income,AS,...,,,2010,,,Yes,2007,,,
4,ADO,Andorra,Andorra,Principality of Andorra,AD,Euro,,Europe & Central Asia,High income: nonOECD,AD,...,,,2011. Population data compiled from administra...,,,Yes,,,2006.0,


In [12]:
region = country_data['Region'].unique().tolist()

incomegrp  = country_data['IncomeGroup'].unique().tolist()

len(region), len(incomegrp)

(8, 6)

In [14]:
country_data['Region'].value_counts()

Europe & Central Asia         57
Sub-Saharan Africa            48
Latin America & Caribbean     41
East Asia & Pacific           36
Middle East & North Africa    21
South Asia                     8
North America                  3
Name: Region, dtype: int64

In [16]:
country_data['IncomeGroup'].value_counts()

Upper middle income     53
Lower middle income     51
High income: nonOECD    47
High income: OECD       32
Low income              31
Name: IncomeGroup, dtype: int64

In [19]:
country_data.isnull().any()

CountryCode                                   False
ShortName                                     False
TableName                                     False
LongName                                      False
Alpha2Code                                     True
CurrencyUnit                                   True
SpecialNotes                                   True
Region                                         True
IncomeGroup                                    True
Wb2Code                                        True
NationalAccountsBaseYear                       True
NationalAccountsReferenceYear                  True
SnaPriceValuation                              True
LendingCategory                                True
OtherGroups                                    True
SystemOfNationalAccounts                       True
AlternativeConversionFactor                    True
PppSurveyYear                                  True
BalanceOfPaymentsManualInUse                   True
ExternalDebt

### Test for 2014 data merging

In [96]:
# Filter 2014 data
indicator_14 = indicator_data[indicator_data['Year'] == 2014 ]
print(indicator_14.shape)

# Filter GDP data
filter_gdp = indicator_data['IndicatorName'].str.contains('GDP per capita \(constant 2005')
indicator_GDP14 = indicator_14[filter_gdp]
print(indicator_GDP14.shape)

(107389, 6)
(212, 6)


  import sys


In [97]:
indicator_GDP14.head()

Unnamed: 0,CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value
5534378,Arab World,ARB,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,4548.529662
5534629,Caribbean small states,CSS,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,7458.860004
5534888,Central Europe and the Baltics,CEB,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,10646.24982
5535235,East Asia & Pacific (all income levels),EAS,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,6465.238232
5535536,East Asia & Pacific (developing only),EAP,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,3253.864486


In [98]:
data_GDP14 = indicator_GDP14.merge(country_data[['Region','CountryCode','IncomeGroup','ShortName']], on='CountryCode', how='inner')

In [99]:
data_GDP14.head()

Unnamed: 0,CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value,Region,IncomeGroup,ShortName
0,Arab World,ARB,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,4548.529662,,,Arab World
1,Caribbean small states,CSS,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,7458.860004,,,Caribbean small states
2,Central Europe and the Baltics,CEB,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,10646.24982,,,Central Europe and the Baltics
3,East Asia & Pacific (all income levels),EAS,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,6465.238232,,,East Asia & Pacific (all income levels)
4,East Asia & Pacific (developing only),EAP,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,3253.864486,,,East Asia & Pacific (developing only)


In [100]:
data_GDP14 = data_GDP14.dropna()

In [101]:
data_GDP14.tail()

Unnamed: 0,CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value,Region,IncomeGroup,ShortName
207,"Venezuela, RB",VEN,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,6088.02729,Latin America & Caribbean,High income: nonOECD,Venezuela
208,Vietnam,VNM,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,1077.909089,East Asia & Pacific,Lower middle income,Vietnam
209,West Bank and Gaza,WBG,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,1389.881434,Middle East & North Africa,Lower middle income,West Bank and Gaza
210,Zambia,ZMB,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,1032.802532,Sub-Saharan Africa,Lower middle income,Zambia
211,Zimbabwe,ZWE,GDP per capita (constant 2005 US$),NY.GDP.PCAP.KD,2014,458.102904,Sub-Saharan Africa,Low income,Zimbabwe


In [102]:
data_GDP14.drop('IndicatorCode', axis=1, inplace=True)

In [103]:
data_GDP14.head()

Unnamed: 0,CountryName,CountryCode,IndicatorName,Year,Value,Region,IncomeGroup,ShortName
33,Afghanistan,AFG,GDP per capita (constant 2005 US$),2014,406.248136,South Asia,Low income,Afghanistan
34,Albania,ALB,GDP per capita (constant 2005 US$),2014,3897.1308,Europe & Central Asia,Upper middle income,Albania
35,Algeria,DZA,GDP per capita (constant 2005 US$),2014,3390.932843,Middle East & North Africa,Upper middle income,Algeria
36,Antigua and Barbuda,ATG,GDP per capita (constant 2005 US$),2014,11881.096762,Latin America & Caribbean,High income: nonOECD,Antigua and Barbuda
37,Argentina,ARG,GDP per capita (constant 2005 US$),2014,7663.728453,Latin America & Caribbean,High income: nonOECD,Argentina


In [105]:
data_GDP14.shape

(179, 8)

### Make the other filtered tables

In [87]:
#Proportion of seats held by women in national parliaments (%)
#Firms with female top manager (% of firms)'
#Firms with female participation in ownership (% of firms)',
#Employers, female (% of employment)'


In [93]:
# Filter Seats in Parliament
filter_pol = indicator_data['IndicatorName'].str.contains('Proportion of seats held by women')
data_pol = indicator_data[filter_pol]
data_pol = data_pol[data_pol['Year'] == 2014]

data_pol.head()

Unnamed: 0,CountryName,CountryCode,IndicatorName,IndicatorCode,Year,Value
5534510,Arab World,ARB,Proportion of seats held by women in national ...,SG.GEN.PARL.ZS,2014,17.79397
5534753,Caribbean small states,CSS,Proportion of seats held by women in national ...,SG.GEN.PARL.ZS,2014,17.158691
5535075,Central Europe and the Baltics,CEB,Proportion of seats held by women in national ...,SG.GEN.PARL.ZS,2014,19.697632
5535360,East Asia & Pacific (all income levels),EAS,Proportion of seats held by women in national ...,SG.GEN.PARL.ZS,2014,18.623405
5535849,East Asia & Pacific (developing only),EAP,Proportion of seats held by women in national ...,SG.GEN.PARL.ZS,2014,19.014539


In [94]:
data_pol.shape

(220, 6)

In [107]:
data_polGDP = data_pol.merge(data_GDP14[['Region','CountryCode','IncomeGroup','ShortName']], 
                             on='CountryCode', how='inner')
data_polGDP.dropna(inplace=True)
data_polGDP.head(), data_polGDP.shape

(           CountryName CountryCode  \
 0          Afghanistan         AFG   
 1              Albania         ALB   
 2              Algeria         DZA   
 3  Antigua and Barbuda         ATG   
 4            Argentina         ARG   
 
                                        IndicatorName   IndicatorCode  Year  \
 0  Proportion of seats held by women in national ...  SG.GEN.PARL.ZS  2014   
 1  Proportion of seats held by women in national ...  SG.GEN.PARL.ZS  2014   
 2  Proportion of seats held by women in national ...  SG.GEN.PARL.ZS  2014   
 3  Proportion of seats held by women in national ...  SG.GEN.PARL.ZS  2014   
 4  Proportion of seats held by women in national ...  SG.GEN.PARL.ZS  2014   
 
    Value                      Region           IncomeGroup  \
 0   27.7                  South Asia            Low income   
 1   20.0       Europe & Central Asia   Upper middle income   
 2   31.6  Middle East & North Africa   Upper middle income   
 3   11.1   Latin America & Caribbea

In [3]:
notes_ind = pd.read_csv('./world-development-indicators/series.csv')
notes_ind.tail()

Unnamed: 0,SeriesCode,Topic,IndicatorName,ShortDefinition,LongDefinition,UnitOfMeasure,Periodicity,BasePeriod,OtherNotes,AggregationMethod,LimitationsAndExceptions,NotesFromOriginalSource,GeneralComments,Source,StatisticalConceptAndMethodology,DevelopmentRelevance,RelatedSourceLinks,OtherWebLinks,RelatedIndicators,LicenseType
1340,SL.UEM.1524.FE.NE.ZS,Social Protection & Labor: Unemployment,"Unemployment, youth female (% of female labor ...",,Youth unemployment refers to the share of the ...,,Annual,,,Weighted average,Data on youth unemployment are drawn from labo...,,"Data are based on labor force sample surveys, ...","International Labour Organization, Key Indicat...",The standard definition of unemployed persons ...,Youth unemployment is an important policy issu...,,,,Open
1341,SL.UEM.1524.MA.ZS,Social Protection & Labor: Unemployment,"Unemployment, youth male (% of male labor forc...",,Youth unemployment refers to the share of the ...,,Annual,,,Weighted average,There may be persons not currently in the labo...,,The unemployment rates presented here are the ...,"International Labour Organization, Key Indicat...",The standard definition of unemployed persons ...,Youth unemployment is an important policy issu...,,,,Open
1342,SL.UEM.1524.MA.NE.ZS,Social Protection & Labor: Unemployment,"Unemployment, youth male (% of male labor forc...",,Youth unemployment refers to the share of the ...,,Annual,,,Weighted average,Data on youth unemployment are drawn from labo...,,"Data are based on labor force sample surveys, ...","International Labour Organization, Key Indicat...",The standard definition of unemployed persons ...,Youth unemployment is an important policy issu...,,,,Open
1343,SL.UEM.1524.ZS,Social Protection & Labor: Unemployment,"Unemployment, youth total (% of total labor fo...",,Youth unemployment refers to the share of the ...,,Annual,,,Weighted average,There may be persons not currently in the labo...,,The unemployment rates presented here are the ...,"International Labour Organization, Key Indicat...",The standard definition of unemployed persons ...,Youth unemployment is an important policy issu...,,,,Open
1344,SL.UEM.1524.NE.ZS,Social Protection & Labor: Unemployment,"Unemployment, youth total (% of total labor fo...",,Youth unemployment refers to the share of the ...,,Annual,,,Weighted average,Data on youth unemployment are drawn from labo...,,"Data are based on labor force sample surveys, ...","International Labour Organization, Key Indicat...",The standard definition of unemployed persons ...,Youth unemployment is an important policy issu...,,,,Open
