## Model one people variables

This notebook extracts the selected people variables in the `indicator_list` from IMF and World Bank (wb) data sources, and writes them to a csv file.

In [4]:
import warnings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [5]:
warnings.filterwarnings('ignore')
pd.options.display.float_format = '{:20,.2f}'.format

|  variable                 | origin            | source      |granularity|countries|   description                                               | composition                                                       |
| --------------------------|-------------------|-------------|-----------|---------|-------------------------------------------------------------|-------------------------------------------------------------------|
| population                | -                 | world bank  | yearly    | 217     | Population, total                                           | Unweighted sum                                                    |
| population growth         | -                 | world bank  | yearly    | 217     | Population growth (annual %)                                 | -                                                                 |
| infant mortality rate     | -                 | world bank  | yearly    | 217     | Mortality rate, infant (per 1,000 live births)          | -                                                                 |
| births rate               | -                 | world bank  | yearly    | 217     | Birth rate, crude (per 1,000 people)                     | -                                                                 |
| fertility rate            | -                 | world bank  | yearly    | 217     | Fertility rate, total (births per woman)                  | -                                                                 |
| rural population          | -                 | world bank  | yearly    | 217     | Rural population (% of total population)                 | -                                                                 |
| rural population growth   | -                 | world bank  | yearly    | 217     | Rural population growth (annual %)                          | -                                                                 |
| urban population          | -                 | world bank  | yearly    | 217     | Urban population (% of total)                               | -                                                                 |
| urban population growth   | -                 | world bank  | yearly    | 217     | Urban population growth (annual %)                          | -                                                                 |
| life expectancy           | -                 | world bank  | yearly    | 217     | Life expectancy at birth, total (years)                     | -                                                                 |
| population aged 15-64     | -                 | world bank  | yearly    | 217     | Population ages 15-64 (% of total)                      | -                                                                 |
| population density        | -                 | world bank  | yearly    | 217     | Population density (people per sq. km of land area)        | -                                                                 |
| adults overweight         | -                 | world bank  | yearly    | 217     | Prevalence of overweight (% of adults)                     | -                                                                 |
| informal employment       | -                 | wb econ     | yearly    | 217     | Informal employment (% of total non-agricultural employment)| -                                                                 |
| consumption growth        | -                 | wb econ     | yearly    | 217     | Final consumption expenditure (annual % growth)           | -                                                                 |
| consumer price inflation  | -                 | wb econ     | yearly    | 217     | Inflation, consumer prices (annual %)                  | -                                                                 |
| labor force               | -                 | imf pplt    | monthly   | 208     | Labor Force, Persons, Number of                             | Unweighted sum                                                    |
| wage rates change         | -                 | imf pplt    | monthly   | 208     | Wage rates, Percentage change, previous period, Percent    | -                                                                 |
| unemployed                | -                 | imf pplt    | monthly   | 208     | Unemployment, Persons, Number of                             | Unweighted sum                                                    |
| unemployment rate         | -                 | imf pplt    | monthly   | 208     | Unemployment, total (% of total labor force) (national estimate)| -                                                             |
| cpi change all items      | -                 | imf cpi     | monthly   | 189     | Consumer Price Index, All items, Percentage change, Previous period| -                                                           |
| cpi all items             | -                 | imf cpi     | monthly   | 189     | Consumer Price Index, All items                             | -                                                                 |
| cpi housing weigth        | -                 | imf cpi     | monthly   | 189     | Housing, Water, Electricity, Gas and Other Fuels, Weight, Percent| Percentage                                                    |
| cpi food weigth           | -                 | imf cpi     | monthly   | 189     | Food and non-alcoholic beverages, Weight, Percent  | Percentage                                                        |
| cpi education weigth      | -                 | imf cpi     | monthly   | 189     | Education, Weight, Percent                                   | Percentage                                                        |
| cpi health weigth         | -                 | imf cpi     | monthly   | 189     | Health, Weight, Percent                                     | Percentage                                                        |
| cpi transport weigth      | -                 | imf cpi     | monthly   | 189     | Transport, Weight, Percent                                   | Percentage                                                        |
| cpi leisure weigth        | -                 | imf cpi     | monthly   | 189     | Recreation and culture, Weight, Percent                     | Percentage                                                        |


In [77]:
indicator_list = ['Population, total', 'Population growth (annual %)', 'Mortality rate, infant (per 1,000 live births)',
                  'Birth rate, crude (per 1,000 people)', 'Fertility rate, total (births per woman)',
                  'Rural population (% of total population)', 'Rural population growth (annual %)',
                  'Urban population (% of total)', 'Urban population growth (annual %)',
                  'Life expectancy at birth, total (years)', 'Population ages 15-64 (% of total)',
                  'Population density (people per sq. km of land area)', 'Prevalence of overweight (% of adults)',
                  'Informal employment (% of total non-agricultural employment)',
                  'Final consumption expenditure (annual % growth)', 'Inflation, consumer prices (annual %)',
                  'Labor Force, Persons, Number of', 'Wage rates, Percentage change, previous period, Percent',
                  'Unemployment, Persons, Number of', 'Unemployment, total (% of total labor force) (national estimate)',
                  'Consumer Price Index, All items, Percentage change, Previous period', 'Consumer Price Index, All items',
                  'Housing, Water, Electricity, Gas and Other Fuels, Weight, Percent',
                  'Food and non-alcoholic beverages, Weight, Percent', 'Education, Weight, Percent',
                  'Health, Weight, Percent', 'Transport, Weight, Percent', 'Recreation and culture, Weight, Percent']

In [78]:
len(indicator_list)

28

## Load imf monthly data

In [79]:
%%bash
wc -l imf/*.csv

  365536 imf/BOP_11-25-2018 19-15-19-60_timeSeries.csv
      64 imf/COMMP_11-25-2018 19-13-52-15_timeSeries.csv
   14430 imf/CPI_11-25-2018 19-14-47-26_timeSeries.csv
    1693 imf/FDI_11-20-2018 21-39-31-89_timeSeries.csv
 1247714 imf/GFSR_11-25-2018 19-23-39-70_timeSeries.csv
   16732 imf/IRFCL_11-25-2018 19-13-18-05_timeSeries.csv
    7846 imf/ITS_11-14-2018 15-14-06-02_timeSeries.csv
    7425 imf/PPLT_11-25-2018 19-25-01-32_timeSeries.csv
 1661440 total


In [80]:
time_values = [str('%sM%s' % (y, m)) for m in list(range(1, 13)) for y in list(range(1960, 2018))]
imf_columns = ['Country Name', 'Indicator Name'] + time_values

In [81]:
imf_country_aggregates = ['Euro Area']

In [82]:
def load_imf_monthly(file_name, indicators, imf_columns, country_aggregates):
    csv_df = pd.read_csv('imf/%s' % file_name).fillna(0)
    base_df = csv_df.loc[csv_df['Attribute'] == 'Value'].drop(columns=['Attribute'])
    monthly_df = base_df.loc[(base_df['Indicator Name'].isin(indicators))]
    imf_df = monthly_df[imf_columns].fillna(0)
    df = pd.melt(imf_df, id_vars=['Country Name', 'Indicator Name'], var_name='date', value_name='value')
    df['date'] = pd.to_datetime(df['date'], format='%YM%m')
    df.columns = ['country', 'indicator', 'date', 'value']
    return df.loc[~df['country'].isin(country_aggregates)]

In [83]:
imf_pplt_df = load_imf_monthly('PPLT_11-25-2018 19-25-01-32_timeSeries.csv', indicator_list, imf_columns, imf_country_aggregates)

In [84]:
imf_cpi_df = load_imf_monthly('CPI_11-25-2018 19-14-47-26_timeSeries.csv', indicator_list, imf_columns, imf_country_aggregates)

In [85]:
imf_df = pd.concat([imf_cpi_df, imf_pplt_df], join='outer')

In [86]:
imf_df.size

4047936

In [87]:
imf_df.head(15)

Unnamed: 0,country,indicator,date,value
0,Brazil,"Consumer Price Index, All items, Percentage ch...",1960-01-01,0.0
1,Brazil,"Consumer Price Index, All items",1960-01-01,0.0
2,France,"Consumer Price Index, All items",1960-01-01,9.797458
3,Bulgaria,"Consumer Price Index, All items",1960-01-01,0.0
4,France,"Consumer Price Index, All items, Percentage ch...",1960-01-01,1.32398621801728
5,Bulgaria,"Consumer Price Index, All items, Percentage ch...",1960-01-01,0.0
6,Latvia,"Consumer Price Index, All items",1960-01-01,0.0
7,Honduras,"Consumer Price Index, All items",1960-01-01,0.0
8,Latvia,"Consumer Price Index, All items, Percentage ch...",1960-01-01,0.0
9,Colombia,"Consumer Price Index, All items",1960-01-01,0.0


In [88]:
len(imf_df['country'].unique())

191

In [89]:
imf_countries = sorted(list(imf_df['country'].unique()))

### Load world bank yearly data

In [90]:
%%bash
wc -l world_bank/*.csv

   33534 world_bank/ECON.csv
    9589 world_bank/HNP.csv
      38 world_bank/HNP_indicator_definitions.csv
   36174 world_bank/POP.csv
   79335 total


In [91]:
wb_country_aggregates = ['nan', 'Lower middle income', 'Post-demographic dividend', 'High income',
                         'Pre-demographic dividend', 'East Asia & Pacific (IDA & IBRD countries)',
                         'Europe & Central Asia (excluding high income)', 'Heavily indebted poor countries (HIPC)',
                         'Caribbean small states', 'Pacific island small states', 'Middle income',
                         'Late-demographic dividend', 'OECD members', 'IDA & IBRD total', 'Not classified', 
                         'East Asia & Pacific (excluding high income)',
                         'Latin America & the Caribbean (IDA & IBRD countries)', 'Low income', 'Low & middle income',
                         'IDA blend', 'IBRD only', 'Sub-Saharan Africa (excluding high income)', 
                         'Fragile and conflict affected situations', 'Europe & Central Asia (IDA & IBRD countries)',
                         'Euro area', 'Other small states', 'Europe & Central Asia', 'Arab World',
                         'Latin America & Caribbean (excluding high income)', 
                         'Sub-Saharan Africa (IDA & IBRD countries)', 'Early-demographic dividend', 'IDA only',
                         'Small states', 'Middle East & North Africa (excluding high income)', 'East Asia & Pacific',
                         'South Asia', 'European Union', 'Least developed countries: UN classification',
                         'Middle East & North Africa (IDA & IBRD countries)', 'Upper middle income',
                         'South Asia (IDA & IBRD)', 'Central Europe and the Baltics', 'Sub-Saharan Africa', 
                         'Latin America & Caribbean', 'Middle East & North Africa', 'IDA total', 'North America',
                         'Last Updated: 11/14/2018', 'Data from database: World Development Indicators', 'World']

In [92]:
wb_cols = ['Country Name', 'Series Name'] + [str('%s [YR%s]' % (y, y)) for y in list(range(1960, 2018))]

In [93]:
def load_wb_yearly(file_name, indicators, wb_columns, country_aggregates):
    csv_df = pd.read_csv('world_bank/%s' % file_name).fillna(0)
    base_df = csv_df.loc[(csv_df['Series Name'].isin(indicators))]
    wb_df = base_df[wb_columns].fillna(0)
    df = pd.melt(wb_df, id_vars=['Country Name', 'Series Name'], var_name='date', value_name='value')
    df['date'] = pd.to_datetime(df['date'].map(lambda x: int(x.split(' ')[0])), format='%Y')
    df.columns = ['country', 'indicator', 'date', 'value']
    return df.loc[~df['country'].isin(country_aggregates)]

In [94]:
wb_econ_df = load_wb_yearly('ECON.csv', indicator_list, wb_cols, wb_country_aggregates)

In [95]:
wb_hnp_df = load_wb_yearly('HNP.csv', indicator_list, wb_cols, wb_country_aggregates)

In [96]:
wb_pop_df = load_wb_yearly('POP.csv', indicator_list, wb_cols, wb_country_aggregates)

In [97]:
wb_df = pd.concat([wb_econ_df, wb_hnp_df, wb_pop_df], join='outer')

In [98]:
wb_df.size

1359288

In [99]:
wb_df.head(15)

Unnamed: 0,country,indicator,date,value
0,Afghanistan,Final consumption expenditure (annual % growth),1960-01-01,0.0
1,Afghanistan,"Inflation, consumer prices (annual %)",1960-01-01,0.0
2,Afghanistan,Informal employment (% of total non-agricultur...,1960-01-01,0.0
3,Albania,Final consumption expenditure (annual % growth),1960-01-01,0.0
4,Albania,"Inflation, consumer prices (annual %)",1960-01-01,0.0
5,Albania,Informal employment (% of total non-agricultur...,1960-01-01,0.0
6,Algeria,Final consumption expenditure (annual % growth),1960-01-01,0.0
7,Algeria,"Inflation, consumer prices (annual %)",1960-01-01,0.0
8,Algeria,Informal employment (% of total non-agricultur...,1960-01-01,0.0
9,American Samoa,Final consumption expenditure (annual % growth),1960-01-01,0.0


In [100]:
len(wb_df['country'].unique())

217

In [101]:
wb_countries = sorted(list(wb_df['country'].unique()))

In [102]:
sorted(list(wb_df['indicator'].unique()))

['Birth rate, crude (per 1,000 people)',
 'Fertility rate, total (births per woman)',
 'Final consumption expenditure (annual % growth)',
 'Inflation, consumer prices (annual %)',
 'Informal employment (% of total non-agricultural employment)',
 'Life expectancy at birth, total (years)',
 'Mortality rate, infant (per 1,000 live births)',
 'Population ages 15-64 (% of total)',
 'Population density (people per sq. km of land area)',
 'Population growth (annual %)',
 'Population, total',
 'Prevalence of overweight (% of adults)',
 'Rural population (% of total population)',
 'Rural population growth (annual %)',
 'Unemployment, total (% of total labor force) (national estimate)',
 'Urban population (% of total)',
 'Urban population growth (annual %)']

### Combine the two datasets

In [103]:
imf_specific = [country for country in imf_countries if country not in wb_countries]

In [104]:
len(imf_specific)

27

In [105]:
imf_to_wb_country_map = {
    'Afghanistan, Islamic Republic of': 'Afghanistan',
    'Armenia, Republic of': 'Armenia',
    'Azerbaijan, Republic of': 'Azerbaijan',
    'Bahrain, Kingdom of': 'Bahrain',
    'China, P.R.: Hong Kong': 'Hong Kong SAR, China',
    'China, P.R.: Macao': 'Macao SAR, China',
    'China, P.R.: Mainland': 'China',
    'Congo, Democratic Republic of': 'Congo, Dem. Rep.',
    'Congo, Republic of': 'Congo, Rep.',
    'Egypt': 'Egypt, Arab Rep.',
    'French Territories: New Caledonia': 'New Caledonia',
    'Iran, Islamic Republic of': 'Iran',
    'Korea, Republic of': 'Korea, Rep.',
    'Kosovo, Republic of': 'Kosovo',
    "Lao People's Democratic Republic": 'Lao PDR',
    'Serbia, Republic of': 'Serbia',
    'Sint Maarten': 'Sint Maarten (Dutch part)',
    'Timor-Leste, Dem. Rep. of': 'Timor-Leste',
    'Venezuela, Republica Bolivariana de': 'Venezuela, RB',
    'Venezuela, República Bolivariana de': 'Venezuela, RB',
    'Yemen, Republic of': 'Yemen'
}

In [106]:
imf_df = imf_df.replace({'country': imf_to_wb_country_map})

In [107]:
people_df = pd.concat([wb_df, imf_df], join='outer')

In [108]:
people_df.size

5407224

In [109]:
people_df.head(15)

Unnamed: 0,country,indicator,date,value
0,Afghanistan,Final consumption expenditure (annual % growth),1960-01-01,0.0
1,Afghanistan,"Inflation, consumer prices (annual %)",1960-01-01,0.0
2,Afghanistan,Informal employment (% of total non-agricultur...,1960-01-01,0.0
3,Albania,Final consumption expenditure (annual % growth),1960-01-01,0.0
4,Albania,"Inflation, consumer prices (annual %)",1960-01-01,0.0
5,Albania,Informal employment (% of total non-agricultur...,1960-01-01,0.0
6,Algeria,Final consumption expenditure (annual % growth),1960-01-01,0.0
7,Algeria,"Inflation, consumer prices (annual %)",1960-01-01,0.0
8,Algeria,Informal employment (% of total non-agricultur...,1960-01-01,0.0
9,American Samoa,Final consumption expenditure (annual % growth),1960-01-01,0.0


In [119]:
indicators = sorted(list(people_df['indicator'].unique()))
indicators

['Birth rate, crude (per 1,000 people)',
 'Consumer Price Index, All items',
 'Consumer Price Index, All items, Percentage change, Previous period',
 'Education, Weight, Percent',
 'Fertility rate, total (births per woman)',
 'Final consumption expenditure (annual % growth)',
 'Food and non-alcoholic beverages, Weight, Percent',
 'Health, Weight, Percent',
 'Housing, Water, Electricity, Gas and Other Fuels, Weight, Percent',
 'Inflation, consumer prices (annual %)',
 'Informal employment (% of total non-agricultural employment)',
 'Labor Force, Persons, Number of',
 'Life expectancy at birth, total (years)',
 'Mortality rate, infant (per 1,000 live births)',
 'Population ages 15-64 (% of total)',
 'Population density (people per sq. km of land area)',
 'Population growth (annual %)',
 'Population, total',
 'Prevalence of overweight (% of adults)',
 'Recreation and culture, Weight, Percent',
 'Rural population (% of total population)',
 'Rural population growth (annual %)',
 'Transport,

In [120]:
assert len(indicators) == len(indicator_list), 'The number of retrieved variables (%s) does not match the number of specified variables (%s).\nThe following variables are missing:\n\n %s' % (len(indicators), len(indicator_list), [i for i in indicator_list if i not in indicators])

In [121]:
people_df.to_csv('model_one/people.csv', sep=';', index=False)