Panel Data Regression models UN general assembly data from https://www.kaggle.com/unitednations/general-assembly (compiled and published by Professor Erik Voeten of Georgetown University.)
Panel data is essentially cross sectional data but rather than sampled once, it is sampled many times adding a time aspect to the data that can be controlled for as well as group variables (in this case nations). Controlling for time can allow the ability to see variables that change over time but are constant amongst certain groups.
This U.N assembly voting data qualifies as panel data because it
1.) Samples the same group muliple times, throughout time 2.) collects attributes on these groups (yes votes, no vote, abstain, affinity scores)
This code compares three panel data regression techniques: Pooled OlS, Fixed Effects, and Random Effects.
Assumptions of each model:
- Pooled Effects assumes that there are universal effects across time and that there is individual heterogeneity
- Fixed Effecs asssumes individual heterogeneity that does not vary over time, and that may or may not be correlated with dependent variable.
- Random effects asume there is unique, time constant atributes of groups/individuals (not correlated with regressors)
#Packages required
library(plm) #for panel data regressions
Lots of missing values in this data so I imputed all NA's with averages for the continuous variables. Excluded the missing values for categorical variable years in excel.
abstain[is.na(abstain) ] <- mean(abstain, na.rm = TRUE)
yes_votes[is.na(yes_votes) ] <- mean(yes_votes, na.rm = TRUE)
no_votes [is.na(no_votes ) ] <- mean(no_votes , na.rm = TRUE)
idealpoint_estimate[is.na(idealpoint_estimate) ] <- mean(idealpoint_estimate, na.rm = TRUE)
affinityscore_usa[is.na(affinityscore_usa) ] <- mean(affinityscore_usa, na.rm = TRUE)
affinityscore_russia[is.na(affinityscore_russia) ] <- mean(affinityscore_russia, na.rm = TRUE)
affinityscore_israel[is.na(affinityscore_israel) ] <- mean(affinityscore_israel, na.rm = TRUE)
affinityscore_china[is.na(affinityscore_china) ] <- mean(affinityscore_china, na.rm = TRUE)
affinityscore_brazil[is.na(affinityscore_brazil) ] <- mean(affinityscore_brazil, na.rm = TRUE)
affinityscore_india[is.na(affinityscore_india) ] <- mean(affinityscore_india, na.rm = TRUE)
Simple OLS regression that ignores the time and group aspect of the data.
pooled = lm(no_votes~yes_votes+abstain+idealpoint_estimate+affinityscore_usa+affinityscore_brazil+affinityscore_china+affinityscore_india
+affinityscore_israel+affinityscore_russia,data=paneldata)
summary(pooled)
.65 adjusted r-squared, can be better with time dummy variables.
#Pooled OLS estimator with time dummies:
Pooled2=plm(no_votes~yes_votes+abstain+idealpoint_estimate+affinityscore_usa+affinityscore_brazil+affinityscore_china+affinityscore_india
+affinityscore_israel+affinityscore_russia+factor(year),data=paneldata,index=c("state_name","year"),model='pooling')
summary(Pooled2)
There were a lot of significant years that affected the number of no votes. Adjusted R-squared increased to .75, meaning 75% of the variation of no_votes is explained by the model.
# can use this function to get cluster robust standard errors clustered by time. (can be group or both)
coeftest(Pooled2,vcov.=vcovHC,cluster="time")
Takes into consideration group variable
fixedeffects =plm(no_votes~yes_votes+abstain+idealpoint_estimate+affinityscore_usa+affinityscore_brazil+affinityscore_china+affinityscore_india
+affinityscore_israel+affinityscore_russia,data=paneldata,index=c("state_name","year"),model='within')
summary(fixedeffects)
Pretty low R-squared of .27, this is most likely due to missing important time related factors.
OlS with dummy variables for country
olscountrydv =lm(no_votes~yes_votes+abstain+idealpoint_estimate+affinityscore_usa+affinityscore_brazil+affinityscore_china+affinityscore_india
+affinityscore_israel+affinityscore_russia+factor(state_name),data=paneldata)
summary(olscountrydv)
Actually performs quite well with an adjusted R-squared of .786. However is still missing time related factors.
Takes into consideration group and time variables, eliminating bias from unobserved time related factors (prevents omitted variable bias).
#random effects model
randomeffects=plm(no_votes~yes_votes+abstain+idealpoint_estimate+affinityscore_usa+affinityscore_brazil+affinityscore_china+affinityscore_india
+affinityscore_israel+affinityscore_russia,data=paneldata,index=c("state_name","year"),model='random')
summary(randomeffects)
Predictive power is still relatively low, let's try adding time dummy variables:
randomeffect2 =plm(no_votes~yes_votes+abstain+idealpoint_estimate+affinityscore_usa+affinityscore_brazil+affinityscore_china+affinityscore_india
+affinityscore_israel+affinityscore_russia,data=paneldata,index=c("state_name","year"),effect="time",model='random')
summary(randomeffect2)
72% of variation within the data no_votes can be explained by our random effects model.
Panel Data conclusion:
Fixed effects with dummy for countries had the highest predictive power with an adjusted of R squared .78. 2nd was OLS with time dummies adjusted with an adjusted R squared .75. Lastly, random effects with time dummies had an adjusted R squared of .72. All models had very similar predictive power, the DV effect of being in a certain country had slightly higher significance than the no_votes being in a certain year.