# PS 3 - Week 14 - Multivariate Regression

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

## Part 1: Voting for the ACA and 2010 Vote Share as Difference of Means/Regression
We are going to return to the example of the Affordable Care Act, and look at the political implications of the bill. Recall that the Democrats lost 63 seats in the House in the 2010 midterms after the passing of the ACA. Let's explore if there is evidence that this was driven by votes on the bill.

First, we are going to load up some data on these house elections. The data come from <a href="https://journals.sagepub.com/doi/abs/10.1177/1532673X11433768">this paper</a>, and are stored in Stata format. We can read this in using the `read_stata` function from the pandas library.

In [None]:
hcr_mid = pd.read_stata("hcr_midterm.dta")
hcr_mid

How someone voted on the ACA is stored in the variable `hcr_yes`, and their whether they are a Republican or Democrat is in `party`. Let's look at the relationship between these two.

In [None]:
pd.crosstab(hcr_mid["hcr_yes"], hcr_mid["party"])

Only <a href="https://en.wikipedia.org/wiki/Joseph_Cao">one Republican</a> voted for the bill, and 39 democrats voted against it. The main comparison we want to make is whether the Democrats who voted for the bill did better or worse in the 2010 midterms than those who voted against it.

To do this, let's first subset the data to districts with democratic incumbents who ran in competitive elections.

In [None]:
hcr_mid = hcr_mid[(hcr_mid["dem_n"] > 0) & (hcr_mid["party"]=="D")]
hcr_mid

The `dem_n` variable is the democratic vote share in the 2010 midterms. Let's look at the distribution:

In [None]:
np.mean(hcr_mid["dem_n"])

In [None]:
sns.distplot(hcr_mid["dem_n"])
plt.axvline(50)

The average D running for re-election recieved almos 60% of the vote, but quite a few lost re-election.

Now let's compare the performance of those who voted for and against the bill. First, let's create separate data files for the Y and N voters.

In [None]:
hcr_mid_y = hcr_mid[hcr_mid["hcr_yes"] == 1]
hcr_mid_n = hcr_mid[hcr_mid["hcr_yes"] == 0]

And take the averages

In [None]:
mean_y = np.mean(hcr_mid_y["dem_n"])
mean_y

In [None]:
mean_n = np.mean(hcr_mid_n["dem_n"])
mean_n

Here is the raw difference of means:

In [None]:
dom = mean_y - mean_n
dom

This indicates those who voted Y did about 13% better, which i a huge difference! Let's do a t-test to see if this is statistically significant

In [None]:
t_model = stats.ttest_ind(hcr_mid_y["dem_n"], hcr_mid_n["dem_n"])
t_model

With a t-statistic above 6, this is easily statistically significant at the p < .01 level.

As discussed in the lecture, we can also test this with a bivariate regression where our IV is the vote and the DV is the vote share.

In [None]:
ols_model = stats.linregress(hcr_mid["hcr_yes"], hcr_mid["dem_n"])
ols_model

The "slope" is equal to the difference of means, and the intercept is the mean of those who voted N. (Why?) Note we also get the same p value as the difference of means test. We can also check the t value is the same by dividing the slope by the standard error:

In [None]:
ols_model[0]/ols_model[4]

## Part 2: Adding District Liberalness
Now let's think about some reasons why this relationship might not be causal. A major confounding variable is that those who voted Yes likely represent more liberal districts, making their re-election easier. To check this, we will look also bring Obama's 2008 vote share into our analysis.

First, let's look at the relationship between Obama's 2008 vote share and the House members 2010 vote share.

In [None]:
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid)

As we can see, there is a strong positive relationship. This shouldn't surprise us: most people vote for the same party consistently.



In [None]:
sns.scatterplot(x='obama', y='hcr_yes', data=hcr_mid)

This looks a little goofy because the hcr_yes variable just takes on value of 0 or 1. We can still run a "linear probability model" with Obama vote share as the independent ($X$) variable and the vote as the dependent ($Y$) variable.

In [None]:
vote_model = stats.linregress(hcr_mid["obama"], hcr_mid["hcr_yes"])
vote_model

Since the DV here is binary, we can interpret the slope as meaning "as Obama vote share goes up by 1%, the probability of voting for the ACA increases by 1.4%.

We can plot this prediction:


In [None]:
sns.scatterplot(x='obama', y='hcr_yes', data=hcr_mid)
xrange = np.arange(30, 100)
plt.plot(xrange, vote_model[1] + xrange*vote_model[0])

Another way to plot this is by looking at the what proportion of D's voted yes for "bins". The red dots plot out the proportion who voted for the ACA among those in districts where Obama got 30%-39%, 40%-49%, 50%-59%, etc.

In [None]:
hcr_mid["obama_group"] = 10*np.floor(hcr_mid["obama"]/10)
plt.plot(hcr_mid.groupby(["obama_group"]).mean()["hcr_yes"], "ro")
plt.plot(xrange, vote_model[1] + xrange*vote_model[0])

This may not be the best way to model the relationship, but it is certainly positive. See Chapter 12 of K&W for more informationa about alternative approxaches.

## Multivariate Analysis
Now visualize the three variable together, but doing a scatter plot of Obama vote share and 2010 Democratic vote share, with green dots for those who voted Y and orange dots for those who voted N.

In [None]:
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid_y, color="green")
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid_n, color="orange")

Note the orange dots are to the left, meaning those who voted N where generally in more moderate/conservative districts.

We can visualize the difference in average vote share by adding horizontal lines:

In [None]:
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid_y, color="green")
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid_n, color="orange")
plt.axhline(mean_y, color='green')
plt.axhline(mean_n, color='orange')

And we can look at the relationship between the vote and how liberal the district was by plotting the average Obama vote share among the Y and N districts:

In [None]:
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid_y, color="green")
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid_n, color="orange")
plt.axvline(np.mean(hcr_mid_y["obama"]), color='green')
plt.axvline(np.mean(hcr_mid_n["obama"]), color='orange')

Combining what we know so far: districts where members voted Y did much better in the election, but these were also just more liberal ("safe") disticts. 

Now lets use mulitvariate regression to "control" for Obama vote share.

First, let's re-do our bivariate analysis using the `OLS` function from the statsmodels.formula.api library, which is a nice libray for multivariate regression. (As a side note, it mimics the syntax of regression from R.) 

We will do this in two steps. In the first, we "fit a model" using `modelname = smf.ols(formula, data=df).fit()`. The formula will always take the form DV = IV1 + IV2 + ..., using the names of the variables in `df`. 

We then get a summary of the output by using the `.summary()` function on our fitted model.

In [None]:
ols_biv = smf.ols('dem_n ~ hcr_yes', data=hcr_mid).fit()
ols_biv.summary()

Now let's do the same, adding the Obama vote share. The code is the same but we add `+ obama` to the formula to indicate we should include this variable too.

In [None]:
ols_biv = smf.ols('dem_n ~ hcr_yes + obama', data=hcr_mid).fit()
ols_biv.summary()

Now there is a negative coefficient on `hcr_yes`!

To make a plot, we can pull out the coefficients by adding a `.params` after the name of our fitted model.

In [None]:
ols_biv.params

Let's overlay the predicted value for Y and N votes as a function of Obama vote share.

In [None]:
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid_y, color="green")
sns.scatterplot(x='obama', y='dem_n', data=hcr_mid_n, color="orange")
xrange = np.arange(30, 100)
plt.plot(xrange, ols_biv.params[0] + xrange*ols_biv.params[2], color="orange")
plt.plot(xrange, ols_biv.params[0] + ols_biv.params[1] + xrange*ols_biv.params[2], color="green")

This is nice illustration of what we mean by "controlling for Obama vote share" or "holding Obama vote share constant". The model accounts for both of these variables, and so the prediction is that *for a fixed level of Obama vote share*, those who vote N do 4.5% better. Visually, there are two parallel lines, and the N line is always 4.5% higher. 

Knowing the Obama vote share for each Democrat in the house, the model does not predict that those who vote N will do better in general, because they tend to be in more conservative districts. But it does predict for any two members in districts where Obama got the same vote share, but one votes Y and one votes N, the one voting N will do 4.5% better.

The takeaway is that the main drive of how well D's running did in 2010 is how well Obama did in 2008. The fact that those who voted for the ACA did better overall is hence misleading!

This model predicts that in a counterfactual world those who voted against the ACA would have done about 4.5% worse in their re-elction bids. Conversely, those who voted for it could have done better (and some likely would have been re-elected) if they voted against it. 

Some important caveats:
- This may not be the "right" model: there might be remaining confounding variables.
- We are implicitly assumign that the effect of voting for the ACA is the same for everyone, which is probably not true: in very liberal disticts voters probably would have been unappy in their representative voted no!

# Ordinal Variables

To get a bit more practice with multivariate regression and how to interpret regression with ordinal variables, we will do some quick analysis with the National Election Study of 2004 (in the United States). This is a major survey of Americans and their political attitudes.

<a href='https://berkeley.app.box.com/file/745549999463'>Here is a codebook</a> with some detail about the variables. In short, the main variables we will look at are:
- `bush_therm`, which is how much people like George W Bush on a scale from 0 to 100
- `education`, an ordinal variable ranging from 1 (8 grades or less) to 7 (Advanced degree)
- `income`, an ordinal variable ranging from 1 (none or less than 2999) to 23 (120,000 and over)

In [3]:
nes = pd.read_stata("nes2004subset.dta")
nes

Unnamed: 0,religion,bush,female,unionhouse,partyid,eval_WoT,eval_HoE,ideology,bush_therm,education,income
0,7.0,0.0,0.0,1.0,3.0,,0.0,4.0,70.0,7.0,17.0
1,1.0,0.0,0.0,0.0,2.0,-1.0,-1.0,4.0,40.0,4.0,19.0
2,1.0,1.0,1.0,0.0,6.0,2.0,2.0,6.0,100.0,6.0,23.0
3,1.0,,0.0,0.0,3.0,-2.0,-1.0,4.0,50.0,2.0,3.0
4,1.0,1.0,1.0,0.0,6.0,2.0,2.0,6.0,100.0,3.0,12.0
...,...,...,...,...,...,...,...,...,...,...,...
1207,1.0,1.0,1.0,0.0,5.0,2.0,,,100.0,3.0,6.0
1208,1.0,,1.0,0.0,5.0,2.0,1.0,7.0,70.0,4.0,13.0
1209,2.0,1.0,0.0,1.0,6.0,2.0,1.0,6.0,85.0,6.0,18.0
1210,2.0,,1.0,1.0,1.0,2.0,-1.0,4.0,70.0,5.0,20.0


Lets first look at the bivariate relationship between `education` and `bush_therm`

In [4]:
idmodel_biv1 = smf.ols('bush_therm ~ education', data=nes).fit()
idmodel_biv1.summary()

0,1,2,3
Dep. Variable:,bush_therm,R-squared:,0.006
Model:,OLS,Adj. R-squared:,0.005
Method:,Least Squares,F-statistic:,7.213
Date:,"Tue, 24 Nov 2020",Prob (F-statistic):,0.00734
Time:,12:39:55,Log-Likelihood:,-5948.7
No. Observations:,1207,AIC:,11900.0
Df Residuals:,1205,BIC:,11910.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,61.8608,2.751,22.490,0.000,56.464,67.257
education,-1.6074,0.598,-2.686,0.007,-2.782,-0.433

0,1,2,3
Omnibus:,719.143,Durbin-Watson:,1.958
Prob(Omnibus):,0.0,Jarque-Bera (JB):,84.762
Skew:,-0.286,Prob(JB):,3.93e-19
Kurtosis:,1.834,Cond. No.,13.7


This means that going up by one education category (e.g., "high school degree" to "some college" or "completed college" to "advanced degree") or is associated with liking Bush less by 1.6 points on the 100 point scale. This isn't a huge effect, but it is highly statisticaly significant.

Now let's doe the same thing for income.

In [5]:
idmodel_biv2 = smf.ols('bush_therm ~ income', data=nes).fit()
idmodel_biv2.summary()

0,1,2,3
Dep. Variable:,bush_therm,R-squared:,0.008
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,8.305
Date:,"Tue, 24 Nov 2020",Prob (F-statistic):,0.00403
Time:,12:41:16,Log-Likelihood:,-5258.4
No. Observations:,1066,AIC:,10520.0
Df Residuals:,1064,BIC:,10530.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,46.9584,2.773,16.932,0.000,41.516,52.400
income,0.4959,0.172,2.882,0.004,0.158,0.834

0,1,2,3
Omnibus:,638.066,Durbin-Watson:,1.908
Prob(Omnibus):,0.0,Jarque-Bera (JB):,76.193
Skew:,-0.296,Prob(JB):,2.85e-17
Kurtosis:,1.832,Cond. No.,43.6


Going up by 1 income category (e.g., 17,000-19,999 to 20,000-21,999 or  60,000-74,999 to 75,000-89,999) is associated with liking Bush more by half a point.

Note that income and education are also going to be positively correlated. So, when looking at an increase in education, we are probably capturing the effect of getting educated by increasing income and through other channels. Similalr, by comparing people in different income categories, we are also looking at different levels of education.

Now let's run a multivariate regression with both of these factors, which will allow us to anser "keeping income fixed, what is the relationship betweeen education and liking Bush", and "keeping education fixed, what is the relationship between income and liking Bush?"

In [6]:
idmodel_multi1 = smf.ols('bush_therm ~ education + income', data=nes).fit()
idmodel_multi1.summary()

0,1,2,3
Dep. Variable:,bush_therm,R-squared:,0.027
Model:,OLS,Adj. R-squared:,0.025
Method:,Least Squares,F-statistic:,14.48
Date:,"Tue, 24 Nov 2020",Prob (F-statistic):,6.26e-07
Time:,12:42:34,Log-Likelihood:,-5248.2
No. Observations:,1066,AIC:,10500.0
Df Residuals:,1063,BIC:,10520.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,55.3680,3.317,16.691,0.000,48.859,61.877
education,-3.2054,0.708,-4.527,0.000,-4.595,-1.816
income,0.8699,0.189,4.591,0.000,0.498,1.242

0,1,2,3
Omnibus:,500.403,Durbin-Watson:,1.902
Prob(Omnibus):,0.0,Jarque-Bera (JB):,69.215
Skew:,-0.262,Prob(JB):,9.34e-16
Kurtosis:,1.867,Cond. No.,54.8


Both of the relationships become *stronger* here. This shouldn't surprise us, since they are positively correlated with each other but have opposite correlations with liking Bush. 

So, it seems that, for a fixed income level, people with more education like Bush a fair amount less. However, without controlling for income, some of this relationship gets masked because better educated people are also more wealthy, and wealthy people tend to like Bush.

Similarly, for a fixed level of education, people with more income like Bush a fair amount more. However, without controlling for education, some of this relationship gets masked because richer people tend to be better educated.

We can also control for lots of other variables:

In [7]:
idmodel_multi1 = smf.ols('bush_therm ~ education + income + female + ideology', data=nes).fit()
idmodel_multi1.summary()

0,1,2,3
Dep. Variable:,bush_therm,R-squared:,0.319
Model:,OLS,Adj. R-squared:,0.316
Method:,Least Squares,F-statistic:,96.02
Date:,"Tue, 24 Nov 2020",Prob (F-statistic):,5.1699999999999995e-67
Time:,12:46:57,Log-Likelihood:,-3931.1
No. Observations:,825,AIC:,7872.0
Df Residuals:,820,BIC:,7896.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.8839,4.641,0.837,0.403,-5.226,12.994
education,-1.5650,0.686,-2.280,0.023,-2.912,-0.218
income,0.2925,0.190,1.543,0.123,-0.080,0.665
female,-0.6110,2.007,-0.305,0.761,-4.550,3.328
ideology,12.6849,0.680,18.641,0.000,11.349,14.021

0,1,2,3
Omnibus:,14.679,Durbin-Watson:,1.968
Prob(Omnibus):,0.001,Jarque-Bera (JB):,15.225
Skew:,-0.332,Prob(JB):,0.000494
Kurtosis:,2.948,Cond. No.,84.9


One interesting observation here is that there is no statistically significant relationship between gender and liking Bush. (I suspect this would not be true of more recent Republican presidents!)

Note the ideology variable (measured on a 7 point scale) has a large coefficient which is highly statistically significant. This shouldn't surprise us: more conservative people like Bush better.

Note that oru coefficients for education and income go down once we control for ideology. One reason that people with more income like Bush more is that people with more income tend to like Bush more. So, controlling for ideology says "keeping ideology (and other factors) fixed, how is income associated with liking Bush". But if we want to see the total effect of income on liking Bush, we might not want to control for ideology. We won't have time to get into the details here, but this issue is called <a href="https://catalogofbias.org/biases/collider-bias/#:~:text=The%20collider%20bias%20occurs%20when,effect%20of%20obesity%20on%20mortality.">collider bias</a>.