# Case Study 2.3: Do Poor Countries Grow Faster than Rich Countries?

Instructor: Victor Chernuzkov
Activity Type: Optional 
Case Study Description: Answer the question: ¨Do poor countries grow faster than rich countries?¨by using a large dimensional dataset.
Why this Case Study? Participants are equipped with tools which can handle high dimensional datasets. They can apply these tools to any high dimensional dataset.
Self-Help Package Contents: 

The video that covers this case study is given in Module 2, Segment 2.4.

Self-Help-Package.zip

Codebook.txt contains the name of the variables and a brief description.
growth.Rdata: The dataset contains the variables used in the regression.
Regression 2.4.CaseStudy.R: looks at how the rates at which economies of different countries grow related to initial wealth levels in each country controlling for several country-specific characteristics. This relationship is estimated in two ways. In the first analysis, a simple regression linear model is used. In the second analysis control variables are partialled out using the Lasso method and then residuals of the dependent variable are regressed on residuals of the indepedent variable.
Regression.2.4.pdf is the set of slides that describes the estimation technique and present the results.
.Rapp.history
.Rhistory

In [1]:
#importing the libraries

from IPython.display import HTML, display


import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.sandbox.regression.predstd import wls_prediction_std

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("darkgrid")

import pandas as pd
import numpy as np

# Getting Data

In [2]:
# Codebook for growth dataset.
# 05/09/2016

# The data set contains growth data of Barro-Lee. The Barro Lee data consists of a panel of 138 countries for the period 1960 to 1985. The dependent variable is national growth rates in GDP per capita for the periods 1965-1975 and 1975-1985. The growth rate in GDP over a period from t1 to t2 is commonly defined as log(GDPt1 /GDPt2 ). The number of covariates is p=62. The number of complete observations is 90.

# The full data set and further details can be found at http://www.nber.org/pub/barro.lee, http: //www.barrolee.com, and, http://www.bristol.ac.uk//Depts//Economics//Growth//barlee. htm.

# Outcome  : national growth rates in GDP per capita for the periods 1965-1975.
# intercept: Constant.
# gdpsh465 : Real GDP per capita (1980 international prices) in 1965
# bmp1l    : Black market premium. Log (1+BMP)
# freeop   : Measure of "Free trade openness
# freetar  : Measure of tariff restriction
# h65      : Total gross enrollment ratio for higher education in 1965.
# hm65     : Male gross enrollment ratio for higher education in 1965.
# hf65     : Female gross enrollment ratio for higher education in 1965.
# p65      : Total gross enrollment ratio for primary education in 1965.
# pm65     : Male gross enrollment ratio for primary education in 1965.
# pf65     : Female gross enrollment ratio for primary education in 1965.
# s65      : Total gross enrollment ratio for secondary education in 1965.
# sm65     : Male gross enrollment ratio for secondary education in 1965.
# sf65     : Female gross enrollment ratio for secondary education in 1965.
# fert65   : Total fertility rate (children per woman) in 1965.
# mort65   : Infant Mortality Rate in 1965.
# lifee065 : Life expectancy at age 0 in 1965.
# gpop1    : Growth rate of population.
# fert1    : Total fertility rate (children per woman).
# mort1    : Infant Mortality Rate (ages 0-1).
# invsh41  : Ratio of real domestic investment (private plus public) to real GDP.
# geetot1  : Ratio of total nominal government expenditure on education to nominal GDP.
# geerec1  : Ratio of recurring nominal government expenditure on education to nominal GDP.
# gde1     : Ratio of nominal government expenditure on defense to nominal GDP.
# govwb1   : Ratio of nominal government "consumption" expenditure to nominal GDP (using current local currency).
# govsh41  : Ratio of real government "consumption" expenditure to real GDP. (Period average).
# gvxdxe41 : Ratio of real government "consumption" expenditure net of spending on defense and on education to real GDP.
# high65   : Percentage of "higher school attained" in the total pop in 1965.
# highm65  : Percentage of "higher school attained" in the male pop in 1965.
# highf65	  : Percentage of "higher school attained" in the female pop in 1965.
# highc65  : Percentage of "higher school complete" in the total pop.
# highcm65 : Percentage of "higher school complete" in the male pop.
# highcf65 : Percentage of "higher school complete" in the female pop.
# human65  : Average schooling years in the total population over age 25 in 1965.
# humanm65 : Average schooling years in the male population over age 25 in 1965.
# humanf65 : Average schooling years in the female population over age 25 in 1965.
# hyr65    : Average years of higher schooling in the total population over age 25.
# hyrm65   : Average years of higher schooling in the male population over age 25.
# hyrf65   : Average years of higher schooling in the female population over age 25.
# no65	  : Percentage of "no schooling" in the total population.
# nom65	  : Percentage of "no schooling" in the male population.
# nof65	  : Percentage of "no schooling" in the female population.
# pinstab1 : Measure of political instability.
# pop65    : Total Population in 1965.
# worker65 : Ratio of total Workers to population.
# pop1565  : Population Proportion under 15 in 1965.
# pop6565	  : Population Proportion over 65 in 1965.
# sec65    : Percentage of "secondary school attained" in the total pop in 1965.
# secm65   : Percentage of "secondary school attained" in male total pop in 1965. 
# secf65   : Percentage of "secondary school attained" in female total pop in 1965. 
# secc65   : Percentage of "secondary school complete" in the total pop in 1965.
# seccm65	  : Percentage of "secondary school complete" in the total pop in 1965.
# seccf65  : Percentage of "secondary school complete" in female pop in 1965.
# syr65    : Average years of secondary schooling in the total population over age 25 in 1965.
# syrm65   : Average years of secondary schooling in the male population over age 25 in 1965.
# syrf65   : Average years of secondary schooling in the female population over age 25 in 1965.
# teapri65 : Pupil/Teacher Ratio in primary school.
# teasex65 : Pupil/Teacher Ratio in secondary school
# ex1      : Ratio of export to GDP (in current international prices)
# im1      : Ratio of import to GDP (in current international prices) 
# xr65	  : Exchange rate (domestic currency per U.S. dollar) in 1965.
# tot1	  : Terms of trade shock (growth rate of export prices minus growth rate of import prices).



In [3]:
dataset = pd.read_csv('data.csv', index_col=0)
dataset

Unnamed: 0,Outcome,intercept,gdpsh465,bmp1l,freeop,freetar,h65,hm65,hf65,p65,...,seccf65,syr65,syrm65,syrf65,teapri65,teasec65,ex1,im1,xr65,tot1
1,-0.024336,1,6.591674,0.2837,0.153491,0.043888,0.007,0.013,0.001,0.29,...,0.04,0.033,0.057,0.010,47.6,17.3,0.0729,0.0667,0.348,-0.014727
2,0.100473,1,6.829794,0.6141,0.313509,0.061827,0.019,0.032,0.007,0.91,...,0.64,0.173,0.274,0.067,57.1,18.0,0.0940,0.1438,0.525,0.005750
3,0.067051,1,8.895082,0.0000,0.204244,0.009186,0.260,0.325,0.201,1.00,...,18.14,2.573,2.478,2.667,26.5,20.7,0.1741,0.1750,1.082,-0.010040
4,0.064089,1,7.565275,0.1997,0.248714,0.036270,0.061,0.070,0.051,1.00,...,2.63,0.438,0.453,0.424,27.8,22.7,0.1265,0.1496,6.625,-0.002195
5,0.027930,1,7.162397,0.1740,0.299252,0.037367,0.017,0.027,0.007,0.82,...,2.11,0.257,0.287,0.229,34.5,17.6,0.1211,0.1308,2.500,0.003283
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86,0.031196,1,8.991064,0.0000,0.371898,0.014586,0.255,0.336,0.170,0.98,...,11.41,2.226,2.494,1.971,27.5,15.9,0.4407,0.4257,2.529,-0.011883
87,0.034096,1,8.025189,0.0050,0.296437,0.013615,0.108,0.117,0.093,1.00,...,1.95,0.510,0.694,0.362,20.2,15.7,0.1669,0.2201,25.553,-0.039080
88,0.046900,1,9.030137,0.0000,0.265778,0.008629,0.288,0.337,0.237,1.00,...,25.64,2.727,2.664,2.788,20.4,9.4,0.3238,0.3134,4.152,0.005175
89,0.039773,1,8.865312,0.0000,0.282939,0.005048,0.188,0.236,0.139,1.00,...,10.76,1.888,1.920,1.860,20.0,16.0,0.1845,0.1940,0.452,-0.029551


In [4]:
#Descriptive statistics
print ('## DATASET STATISTICS ## ')
display(dataset.describe())

## DATASET STATISTICS ## 


Unnamed: 0,Outcome,intercept,gdpsh465,bmp1l,freeop,freetar,h65,hm65,hf65,p65,...,seccf65,syr65,syrm65,syrf65,teapri65,teasec65,ex1,im1,xr65,tot1
count,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,...,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0
mean,0.045349,1.0,7.702907,0.168747,0.220102,0.028334,0.111556,0.137156,0.082233,0.893333,...,6.171778,0.912422,1.045444,0.783767,33.203333,19.412222,0.133981,0.144356,42.663856,0.009236
std,0.051314,0.0,0.896179,0.249116,0.074861,0.021855,0.101361,0.116826,0.091549,0.164938,...,6.857261,0.823222,0.831756,0.839486,9.818516,6.384194,0.118708,0.120937,119.335089,0.059182
min,-0.10099,1.0,5.762051,0.0,0.078488,0.0,0.002,0.004,0.0,0.29,...,0.04,0.033,0.057,0.01,18.2,7.2,0.0178,0.0222,0.003,-0.156878
25%,0.021045,1.0,7.131539,0.0,0.166044,0.011589,0.03225,0.04225,0.014,0.8325,...,1.5,0.35775,0.4675,0.21525,27.425,15.35,0.06345,0.070625,1.099,-0.016877
50%,0.046209,1.0,7.7257,0.0638,0.203972,0.025426,0.089,0.1145,0.055,0.985,...,3.87,0.6785,0.7705,0.5125,32.2,18.4,0.0923,0.1163,4.762,0.00489
75%,0.074029,1.0,8.441914,0.27455,0.286425,0.039745,0.1475,0.181,0.113,1.0,...,8.0425,1.13075,1.24675,1.002,37.475,22.8,0.169225,0.181475,19.619,0.018526
max,0.185526,1.0,9.229849,1.6378,0.416234,0.109921,0.573,0.635,0.527,1.0,...,36.61,4.211,4.227,4.198,62.4,37.1,0.747,0.8489,652.85,0.207492


In [5]:
# Extract the names of control and treatment variables from varnames
xnames_df = dataset.iloc[:, 3:]    # names of X variables	
xnames_string = '+'.join(xnames_df.columns)

dandxnames_df = dataset.iloc[:, 2:]     # names of D and X variables	
dandxnames_string = '+'.join(dandxnames_df.columns)


In [6]:
## Aplying the OLS regression as per instruction in the R program provided

ols_model = ols("""Outcome ~ """ + dandxnames_string, data=dataset).fit()
# summarize our model
ols_model_summary = ols_model.summary()
HTML(
(ols_model_summary
    .as_html()
    .replace('<th>  Adj. R-squared:    </th>', '<th style="background-color:#aec7e8;"> Adj. R-squared: </th>')
    .replace('<th>coef</th>', '<th style="background-color:#ffbb78;">coef</th>')
    .replace('<th>std err</th>', '<th style="background-color:#c7e9c0;">std err</th>')
    .replace('<th>P>|t|</th>', '<th style="background-color:#bcbddc;">P>|t|</th>')
    .replace('<th>[0.025</th>    <th>0.975]</th>', '<th style="background-color:#ff9896;">[0.025</th>    <th style="background-color:#ff9896;">0.975]</th>'))
)

0,1,2,3
Dep. Variable:,Outcome,R-squared:,0.887
Model:,OLS,Adj. R-squared:,0.641
Method:,Least Squares,F-statistic:,3.607
Date:,"Wed, 13 May 2020",Prob (F-statistic):,0.0002
Time:,17:15:18,Log-Likelihood:,238.24
No. Observations:,90,AIC:,-352.5
Df Residuals:,28,BIC:,-197.5
Df Model:,61,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.2472,0.785,0.315,0.755,-1.360,1.854
gdpsh465,-0.0094,0.030,-0.314,0.756,-0.071,0.052
bmp1l,-0.0689,0.033,-2.117,0.043,-0.135,-0.002
freeop,0.0801,0.208,0.385,0.703,-0.346,0.506
freetar,-0.4890,0.418,-1.169,0.252,-1.346,0.368
h65,-2.3621,0.857,-2.755,0.010,-4.118,-0.606
hm65,0.7071,0.523,1.352,0.187,-0.364,1.779
hf65,1.6934,0.503,3.365,0.002,0.663,2.724
p65,0.2655,0.164,1.616,0.117,-0.071,0.602

0,1,2,3
Omnibus:,0.439,Durbin-Watson:,1.982
Prob(Omnibus):,0.803,Jarque-Bera (JB):,0.417
Skew:,0.158,Prob(JB):,0.812
Kurtosis:,2.896,Cond. No.,752000000.0


In [7]:
#gdpsh465	-0.0094	0.030

In [8]:
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Linear regression of y (outcome) on covariates
ols_model_y = ols("""Outcome ~  """ + xnames_string , data=dataset).fit_regularized(alpha=0.2, L1_wt=0.5, refit=True)
# Linear regression of d (treatment) on covariates
ols_model_d = smf.ols("""gdpsh465 ~  """ + xnames_string, data=dataset).fit_regularized(alpha=0.2, L1_wt=0.5, refit=True)


In [9]:

rY = ols_model_y.resid
rD = ols_model_d.resid

In [10]:
resid_regression = sm.OLS(rY, rD).fit()

In [11]:
resid_regression_summary = resid_regression.summary()
HTML(
(resid_regression_summary
    .as_html()
    .replace('<th>  Adj. R-squared:    </th>', '<th style="background-color:#aec7e8;"> Adj. R-squared: </th>')
    .replace('<th>coef</th>', '<th style="background-color:#ffbb78;">coef</th>')
    .replace('<th>std err</th>', '<th style="background-color:#c7e9c0;">std err</th>')
    .replace('<th>P>|t|</th>', '<th style="background-color:#bcbddc;">P>|t|</th>')
    .replace('<th>[0.025</th>    <th>0.975]</th>', '<th style="background-color:#ff9896;">[0.025</th>    <th style="background-color:#ff9896;">0.975]</th>'))
)

0,1,2,3
Dep. Variable:,y,R-squared (uncentered):,0.018
Model:,OLS,Adj. R-squared (uncentered):,0.007
Method:,Least Squares,F-statistic:,1.677
Date:,"Wed, 13 May 2020",Prob (F-statistic):,0.199
Time:,17:15:19,Log-Likelihood:,143.91
No. Observations:,90,AIC:,-285.8
Df Residuals:,89,BIC:,-283.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,-0.0180,0.014,-1.295,0.199,-0.046,0.010

0,1,2,3
Omnibus:,2.377,Durbin-Watson:,1.557
Prob(Omnibus):,0.305,Jarque-Bera (JB):,1.731
Skew:,-0.239,Prob(JB):,0.421
Kurtosis:,3.483,Cond. No.,1.0


In [13]:
#Couldn't find the same coef as it was in R. I shall ask to TA
#                     Estimate Std. Error
# Least Squares              -0.0094      0.030
# Partialling-out via lasso  -0.0498      0.014
