# Fixed & random effects

## Table of contents
[Intro: Preprocessing data](#Intro:-Preprocessing-data)
   - [I.1 Load data](#I.1-Load-data)
   - [I.2 List variables](#I.2-List-variables)
   - [I.3 Choose x & y variables](#I.3-Choose-x-&-y-variables)

[Question 1:](#Question-1:)

Estimate models for infant or child mortality rates in developing countries with explanatory variables such as:GDP per capita, Total fertility rate, Health Expenditures, Immunization rates, etc. Using 5-yearly averages from the World Development Indicators data set. Select explanatory variables in the model so as to also reduce the number of missing observations in the estimation.
- [1.1 Basic regression](#1.1-Basic-regression)

[Question 2:](#Question-2:)

Explain how you account for between-country differences in the estimation methods.

- [2.1 Random effects](#2.1-Random-effects)
- [2.2 Fixed effects](#2.2-Fixed-effects)
    - [2.2.1 FE by entity](#2.2.1-FE-by-entity)
    - [2.2.2 FE by entity, time](#2.2.2-FE-by-entity,-time)

[Question 3:](#Question-3:)

Investigate if the OLS estimates differ from estimation methods that treat country specific effects as “fixed” (“dummy”) variables or “random” variables. 

- [3.1 Compare FE & RE](3.1-Compare-FE-&-RE)
- [3.2. Pooled, cluster entity, cluster entity and time](#3.2.-Pooled,-cluster-entity,-cluster-entity-and-time)


## Intro: Preprocessing data

## I.1 Load data

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

In [2]:
dataset = pd.read_excel('WDI-subset5y_extended 2014 fall.xlsx', 
                   index_col=[0,1], 
                   parse_dates=True
                  )
dataset.head(6)

Unnamed: 0_level_0,Unnamed: 1_level_0,"Birth rate, crude (per 1,000 people)",Births attended by skilled health staff (% of total),"Fertility rate, total (births per woman)",GDP per capita growth (annual %),"GDP per capita, PPP (constant 2005 international $)",Health expenditure per capita (current US$),"Hospital beds (per 1,000 people)","Immunization, DPT (% of children ages 12-23 months)","Immunization, measles (% of children ages 12-23 months)",Improved sanitation facilities (% of population with access),...,"Mortality rate, infant (per 1,000 live births)","Mortality rate, under-5 (per 1,000 live births)","Physicians (per 1,000 people)","Primary education, pupils","Literacy rate, adult female (% of females ages 15 and above)","Literacy rate, adult male (% of males ages 15 and above)","School enrollment, primary, female (% gross)","School enrollment, primary, male (% gross)","School enrollment, secondary, female (% gross)","School enrollment, secondary, male (% gross)"
Country,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Afghanistan,1990-01-01,52.4574,,7.9612,,,,0.2498,27.4,23.4,29.0,...,129.02,191.4,0.1192,631039.5,,,,,,
Afghanistan,1995-01-01,52.7212,,8.0382,,,0.0,,21.8,37.2,29.2,...,108.38,157.88,0.1264,1086724.333,,,21.875147,47.682127,9.001103,26.889893
Afghanistan,2000-01-01,50.5656,12.4,7.7116,,615.042245,14.480312,0.4,29.4,32.2,31.6,...,94.98,136.34,0.186,1287003.75,,,10.816135,51.75907,0.0,22.0938
Afghanistan,2005-01-01,46.737,16.6,6.9932,4.992695,692.240977,28.049842,0.41,54.0,49.0,35.0,...,84.38,119.54,0.2,4383432.6,,,65.697868,115.578456,9.46887,28.542322
Afghanistan,2010-01-01,44.222667,24.0,6.423,6.708179,924.171976,34.395607,0.406667,64.75,60.75,37.0,...,75.375,105.35,0.21,5066598.0,,,75.669057,112.60305,25.435957,53.192453
Albania,1990-01-01,24.5338,91.175,3.2188,-8.541723,3531.372774,,4.02092,91.2,89.4,76.333333,...,35.64,41.18,1.47445,277554.0,,,99.295897,98.641415,85.043295,91.63392


## I.2 List variables

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 425 entries, ('Afghanistan', Timestamp('1990-01-01 00:00:00')) to ('Zimbabwe', Timestamp('2010-01-01 00:00:00'))
Data columns (total 23 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   Birth rate, crude (per 1,000 people)                          422 non-null    float64
 1   Births attended by skilled health staff (% of total)          306 non-null    float64
 2   Fertility rate, total (births per woman)                      421 non-null    float64
 3   GDP per capita growth (annual %)                              413 non-null    float64
 4   GDP per capita, PPP (constant 2005 international $)           402 non-null    float64
 5   Health expenditure per capita (current US$)                   338 non-null    float64
 6   Hospital beds (per 1,000 people)                              293 non-n

## I.3 Choose x & y variables

In [4]:
#The variableslist can be extracted from the data set by typing in: dataset.columns 

variableslist = [
       #'Birth rate, crude (per 1,000 people)',
       #'Births attended by skilled health staff (% of total)',
       #'Fertility rate, total (births per woman)',
       #'GDP per capita growth (annual %)', -Flow variable, Y is a stock variable
       #'Life expectancy at birth, total (years)', -Highly correlated with Y
       #'Primary education, pupils', -School enrolment primary is a better proxy
       #'Literacy rate, adult female (% of females ages 15 and above)', #Not enough obs
       #'Literacy rate, adult male (% of males ages 15 and above)', -Not enough obs
       #'Hospital beds (per 1,000 people)', #Explanatory variable, -Not enough obs
       #'Health expenditure per capita (current US$)', -Not enough obs
       #'School enrollment, secondary, female (% gross)', -Not enough obs
       #'School enrollment, secondary, male (% gross)', -Not enough obs
    
       #'Mortality rate, infant (per 1,000 live births)', #Y variable NOT chosen for model

#The above variables were not included in the model: 
#It includes the name of the variable and why it was not chosen    
       'Mortality rate, under-5 (per 1,000 live births)', # Y variable (Dependent variables)
       'GDP per capita, PPP (constant 2005 international $)', #Control for overall wealth       
       'Immunization, DPT (% of children ages 12-23 months)', #Treatment control
       'Immunization, measles (% of children ages 12-23 months)', #Treatment control
       'Improved sanitation facilities (% of population with access)', #General health control
       'Improved water source (% of population with access)', #General health control
       'Incidence of tuberculosis (per 100,000 people)', #Systemic disease control
       'Physicians (per 1,000 people)', #Health system control, related to wealth
       'School enrollment, primary, female (% gross)', #Education control
       'School enrollment, primary, male (% gross)', #Education control
       
]

In [5]:
dataset_cleaned = dataset[variableslist].dropna() #Drop the variables we don't wnat to include in the model
x = dataset_cleaned.drop(['Mortality rate, under-5 (per 1,000 live births)'], axis=1) #Drop the y variable to leave only explanatory variables
y = dataset_cleaned[['Mortality rate, under-5 (per 1,000 live births)']]

x = np.log(x)
x = sm.add_constant(x) #Add constant for regression
y = np.log(y)

# Question 1:
Estimate models for infant or child mortality rates in developing countries with explanatory variables such as:

GDP per capita, Total fertility rate, Health Expenditures, Immunization rates, etc. 

Using 5-yearly averages from the World Development Indicators data set. Select explanatory variables in the model so as to also reduce the number of missing observations in the estimation.d


## 1.1 Basic regression

In [6]:
#Import necessary library to run regression
from linearmodels.panel import PooledOLS #Library for panel data
import statsmodels.api as sm #Library for regression

pooled_res = PooledOLS(y,x).fit(cov_type='robust') #Robust standard errors 
pooled_res.summary #Prints summary

0,1,2,3
Dep. Variable:,"Mortality rate, under-5 (per 1,000 live births)",R-squared:,0.7520
Estimator:,PooledOLS,R-squared (Between):,0.7492
No. Observations:,300,R-squared (Within):,0.5594
Date:,"Mon, Mar 02 2020",R-squared (Overall):,0.7520
Time:,16:16:39,Log-likelihood,-129.09
Cov. Estimator:,Robust,,
,,F-statistic:,97.692
Entities:,84,P-value,0.0000
Avg Obs:,3.5714,Distribution:,"F(9,290)"
Min Obs:,0.0000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,8.4470,0.7635,11.064,0.0000,6.9443,9.9497
"GDP per capita, PPP (constant 2005 international $)",-0.3661,0.0550,-6.6615,0.0000,-0.4743,-0.2580
"Immunization, DPT (% of children ages 12-23 months)",-0.1591,0.1963,-0.8105,0.4183,-0.5455,0.2273
"Immunization, measles (% of children ages 12-23 months)",-0.0366,0.1770,-0.2070,0.8362,-0.3849,0.3117
Improved sanitation facilities (% of population with access),-0.1271,0.0508,-2.5019,0.0129,-0.2271,-0.0271
Improved water source (% of population with access),-0.0529,0.1189,-0.4447,0.6568,-0.2868,0.1811
"Incidence of tuberculosis (per 100,000 people)",0.1724,0.0282,6.1234,0.0000,0.1170,0.2278
"Physicians (per 1,000 people)",-0.0870,0.0234,-3.7255,0.0002,-0.1330,-0.0410
"School enrollment, primary, female (% gross)",-0.4531,0.2279,-1.9880,0.0477,-0.9016,-0.0045


In [7]:
results_text = pooled_res.summary.as_text()

import csv
resultFile = open("table.csv",'w')
resultFile.write(results_text)
resultFile.close()

# Question 2:

## 2.1 Random effects

In [8]:
from linearmodels import RandomEffects
re_result = RandomEffects(y,x).fit(cov_type='robust') #Robust standard errors 
re_result.summary

0,1,2,3
Dep. Variable:,"Mortality rate, under-5 (per 1,000 live births)",R-squared:,0.7563
Estimator:,RandomEffects,R-squared (Between):,0.7158
No. Observations:,300,R-squared (Within):,0.6458
Date:,"Mon, Mar 02 2020",R-squared (Overall):,0.7279
Time:,16:16:39,Log-likelihood,114.38
Cov. Estimator:,Robust,,
,,F-statistic:,100.02
Entities:,84,P-value,0.0000
Avg Obs:,3.5714,Distribution:,"F(9,290)"
Min Obs:,0.0000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,9.9911,0.7695,12.985,0.0000,8.4767,11.506
"GDP per capita, PPP (constant 2005 international $)",-0.4471,0.0530,-8.4399,0.0000,-0.5514,-0.3428
"Immunization, DPT (% of children ages 12-23 months)",0.1561,0.1046,1.4919,0.1368,-0.0498,0.3620
"Immunization, measles (% of children ages 12-23 months)",-0.3908,0.0740,-5.2805,0.0000,-0.5365,-0.2452
Improved sanitation facilities (% of population with access),-0.2110,0.0634,-3.3268,0.0010,-0.3358,-0.0862
Improved water source (% of population with access),-0.2453,0.1449,-1.6929,0.0916,-0.5305,0.0399
"Incidence of tuberculosis (per 100,000 people)",0.1905,0.0405,4.6982,0.0000,0.1107,0.2703
"Physicians (per 1,000 people)",-0.0222,0.0259,-0.8581,0.3915,-0.0732,0.0287
"School enrollment, primary, female (% gross)",-0.0208,0.1835,-0.1136,0.9097,-0.3820,0.3403


In [9]:
#re_result.variance_decomposition

## 2.2 Fixed effects

## 2.2.1 FE by entity

In [10]:
from linearmodels import PanelOLS
fe_result = PanelOLS(y,x, entity_effects=True).fit(cov_type='robust')
fe_result

0,1,2,3
Dep. Variable:,"Mortality rate, under-5 (per 1,000 live births)",R-squared:,0.6529
Estimator:,PanelOLS,R-squared (Between):,0.6745
No. Observations:,300,R-squared (Within):,0.6529
Date:,"Mon, Mar 02 2020",R-squared (Overall):,0.6929
Time:,16:16:39,Log-likelihood,167.49
Cov. Estimator:,Robust,,
,,F-statistic:,43.674
Entities:,84,P-value,0.0000
Avg Obs:,3.5714,Distribution:,"F(9,209)"
Min Obs:,0.0000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,10.332,0.8331,12.402,0.0000,8.6898,11.974
"GDP per capita, PPP (constant 2005 international $)",-0.4877,0.0739,-6.6013,0.0000,-0.6333,-0.3420
"Immunization, DPT (% of children ages 12-23 months)",0.1549,0.1027,1.5084,0.1330,-0.0475,0.3573
"Immunization, measles (% of children ages 12-23 months)",-0.4004,0.0857,-4.6711,0.0000,-0.5694,-0.2314
Improved sanitation facilities (% of population with access),-0.2363,0.0917,-2.5754,0.0107,-0.4171,-0.0554
Improved water source (% of population with access),-0.3839,0.1879,-2.0431,0.0423,-0.7543,-0.0135
"Incidence of tuberculosis (per 100,000 people)",0.2349,0.0651,3.6069,0.0004,0.1065,0.3633
"Physicians (per 1,000 people)",-0.0033,0.0331,-0.0997,0.9207,-0.0685,0.0619
"School enrollment, primary, female (% gross)",0.0850,0.1706,0.4981,0.6189,-0.2514,0.4213


## 2.2.2 FE by entity, time

In [11]:
from linearmodels import PanelOLS
fe_result = PanelOLS(y,x, entity_effects=True, time_effects=True).fit(cov_type='robust')
fe_result

0,1,2,3
Dep. Variable:,"Mortality rate, under-5 (per 1,000 live births)",R-squared:,0.3324
Estimator:,PanelOLS,R-squared (Between):,0.4163
No. Observations:,300,R-squared (Within):,0.3122
Date:,"Mon, Mar 02 2020",R-squared (Overall):,0.4103
Time:,16:16:40,Log-likelihood,288.99
Cov. Estimator:,Robust,,
,,F-statistic:,11.343
Entities:,84,P-value,0.0000
Avg Obs:,3.5714,Distribution:,"F(9,205)"
Min Obs:,0.0000,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
const,5.3628,0.7411,7.2367,0.0000,3.9018,6.8239
"GDP per capita, PPP (constant 2005 international $)",-0.2355,0.0621,-3.7940,0.0002,-0.3579,-0.1131
"Immunization, DPT (% of children ages 12-23 months)",0.1693,0.0673,2.5140,0.0127,0.0365,0.3020
"Immunization, measles (% of children ages 12-23 months)",-0.1831,0.0606,-3.0227,0.0028,-0.3026,-0.0637
Improved sanitation facilities (% of population with access),-0.0536,0.0675,-0.7935,0.4284,-0.1867,0.0795
Improved water source (% of population with access),-0.0936,0.1279,-0.7316,0.4653,-0.3458,0.1586
"Incidence of tuberculosis (per 100,000 people)",0.2168,0.0508,4.2642,0.0000,0.1166,0.3171
"Physicians (per 1,000 people)",0.0321,0.0222,1.4455,0.1498,-0.0117,0.0758
"School enrollment, primary, female (% gross)",0.2023,0.1135,1.7828,0.0761,-0.0214,0.4261


## 3.1 Compare FE & RE

In [12]:
from linearmodels.panel import compare
compare({'FE':fe_result,'RE':re_result,'Pooled regression':pooled_res}).summary

#'''Code to export table to csv'''
#compare({'FE':fe_result,'RE':re_result,'Pooled regression':pooled_res}).summary.as_csv()

0,1,2,3
,FE,RE,Pooled regression
Dep. Variable,"Mortality rate, under-5 (per 1,000 live births)","Mortality rate, under-5 (per 1,000 live births)","Mortality rate, under-5 (per 1,000 live births)"
Estimator,PanelOLS,RandomEffects,PooledOLS
No. Observations,300,300,300
Cov. Est.,Robust,Robust,Robust
R-squared,0.3324,0.7563,0.7520
R-Squared (Within),0.3122,0.6458,0.5594
R-Squared (Between),0.4163,0.7158,0.7492
R-Squared (Overall),0.4103,0.7279,0.7520
F-statistic,11.343,100.02,97.692


## 3.2. Pooled, cluster entity, cluster entity and time

In [13]:
mod = PooledOLS(y, x)
robust = mod.fit(cov_type='robust')
clust_entity = mod.fit(cov_type='clustered', cluster_entity=True)
clust_entity_time = mod.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)

In [14]:
from collections import OrderedDict
res = OrderedDict()
res['Robust'] = robust
res['Entity'] = clust_entity
#res['Entity-Time'] = clust_entity_time
compare(res)

0,1,2
,Robust,Entity
Dep. Variable,"Mortality rate, under-5 (per 1,000 live births)","Mortality rate, under-5 (per 1,000 live births)"
Estimator,PooledOLS,PooledOLS
No. Observations,300,300
Cov. Est.,Robust,Clustered
R-squared,0.7520,0.7520
R-Squared (Within),0.5594,0.5594
R-Squared (Between),0.7492,0.7492
R-Squared (Overall),0.7520,0.7520
F-statistic,97.692,97.692
