# PS 88 Week 10 Lecture Notebook

Loading Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from IPython.display import display, Markdown

Let's look at the economic performance and election result data we studied in week 2.

First we loading the election data, which is stored in .dta (Stata) format, and then subsetting to years with elections after 1936.

In [None]:
elec = pd.read_stata("presvote.dta")

The full data contains all years since 1789, but we are only interested in election years with the relevant economic data

In [None]:
elec = elec[elec['incvote']>0]
elec = elec[elec['year'] > 1936]
elec

The NaN for 1964 creates some problems later, let's make a guess of that based on the RDIy column

In [None]:
# Guess at growth
add = 100*(16748-14350)/14350
elec.loc[elec['year']==1964, 'RDIg_term'] = add
elec

We can make a scatterplot with  the `scatterplot` function from seaborn (loaded here as sns). The first argument tells what variable  to use for the x axis, the second argumnt is the y  axis, and the third argument is the data frame containing these variables.

In [None]:
sns.scatterplot(x='RDIyrgrowth', y='incvote', data=elec)

Adding horizontal and vertical lines at the means using the `axvline` and `axhline` functions.

In [None]:
m1 = smf.ols('incvote~RDIyrgrowth', data=elec).fit()
m1.summary()

Lets add another variable: how many years the current party has been in office, which is captured by `inc_yrs`. First run a bivariate regression

In [None]:
m2 = smf.ols('incvote~RDIyrgrowth + inc_yrs', data=elec).fit()
m2.summary()

## Competing theories

Run a multivariate regression with both `RDIyrgrowth` and `RDIg_term`. 

## Lab Preview

In the lab we will do some work on simulated data about the use of violent tactics in protest movements and whether this affects their success.

Here is a simulation of this process. There  are three parameters which we will vary later:
- `b_rep` is effect of repression on whether the protest  movement succeeds. We assume this is negative, meaning success is harder with a more repressive government
- `b_viol` is the effect of violence of movement success. For the first simulation, we set this to zero, meaning there is no real causal effect
- `b_rv` is the effect of repressiveness on the use of violence, which we assume is positive

The following code simulates 1000 protest movements, with three variables we will observe: 
- `rep` is the repressiveness, which we will assume is normally distributed
- `viol` is equal to 1 for a violent movement and 0 for nonviolent. We assume movments are more likely to be violent with a repressive government.
- `succ` is a continuous measure of success, where higher numbers indicate a more successful movement. This is (potentially) a function of repressiveness, the choice of violent tactics, and random noise.

We then put the variables in a pandas data frame.

In [None]:
np.random.seed(89)
b_rep = -1
b_viol = 0
b_rv = 1
# Random repressiveness levels
rep = np.random.normal(0,1,1000)
viol = np.where(b_rv*rep + np.random.normal(0,1,1000) > 0, 1, 0)
succ = b_rep*rep + b_viol*viol + np.random.normal(0,.3,1000)
protest = pd.DataFrame(data={'Repressive': rep, 
                             'Violent': viol, 
                             'Success': succ})
protest

Since violence is binary, we can look at the difference in average success between violent and nonviolent movements. Here is some pandas code to compute the average success of violent movements. 

In general, if we want to pull the values of `Var1` for the subset of rows where `Var2=x` from a data frame `df`, the code is:
`df.loc[df['Var2']==x, Var1]`. Think of `.loc` as a combination of what the `.where` and `.column` functions do in the Table library.

In [None]:
suc1 = np.mean(protest.loc[protest['Violent']==1, 'Success'])
suc1

Now we can compute the success of nonviolent movements and the differences of means

In [None]:
suc0=np.mean(protest.loc[protest['Violent']==0, 'Success'])
suc0

In [None]:
suc1-suc0

In this simulation, violent movements are less successful than violent ones.

While we used regression for continuous variables last week, there is nothing to stop us from using it on binary (0 or 1) variables. Let's run a bivariate regression with `Success` as the dependent variable and `Violent` as the independent variable.

Recall the code to "fit" a model with independent variable IV and dependent variable DV and data frame df is `smf.ols('DV~IV', data=df).fit()`. We will save this fitted model as `succ_ols` and then use the `.summary()` function to get the output.

In [None]:
succ_ols = smf.ols('Success~Violent', data=protest).fit()
succ_ols.summary()

We can also visualize this with a scatterplot, where we use the `sns.regplot` function to draw a best fit line too:

In [None]:
sns.regplot(x='Violent',y='Success', data=protest, ci=0)

The best fit "line" goes from the mean of the nonviolent movements to the mean of the violent movements. Visually:

In [None]:
sns.regplot(x='Violent',y='Success', data=protest)
plt.axhline(np.mean(protest.loc[protest['Violent']==1, 'Success']), color="red")
plt.axhline(np.mean(protest.loc[protest['Violent']==0, 'Success']), color="green")

The constant in the regression is equal to the mean of nonviolent protests, and the coefficient on `Violent` is the difference of means. 

We can also run regressions with dependent variables that take on values 0 or 1. This is called a "linear probability model". In this example, `Violent` is binary, and by the way we generated the data we know that higher values of `Repressive` make it more likely to be 1 versus 0. 

To check this, let's fit and summarize a regression where `Violent` is the dependent variable and `Repressive` is the independent variable.

In [None]:
smf.ols('Violent~Repressive', data=protest).fit().summary()

You should get a positive coefficient on `Repressive`, which confirms that we are more likely to have violence when this variable is high. Or, visually:

In [None]:
sns.regplot(x='Repressive', y='Violent', data=protest, ci=0)

While our dependent variable really only takes on values of 0 and 1, we can interpret the predicted value here as being a prediction about the probability of violent protest. One drawback to this is that it will sometimes predict negative probabilities or probabilities greater than 1. There are some other ways to analyze data like this that doesn't make predictions outside of 0 and 1, but at the cost of being a bit more complicated to interpret.

The nice thing about this probability interpretation is that the slope then tells us how an increase in our independent variable affects the probability that the dependent variable is a 1. In this case, a 1 unit increase in our `Repressive` measure leads to a 28% increase in the probability of violent protest.

Now let's run a multivaraite regression with `Success` as the dependent variable, and `Violent` and `Repressive` as independent variables.

In [None]:
smf.ols('Success~Violent + Repressive', data=protest).fit().summary()

When we control for how repressive the government is, there is no longer a relationship between violence and protest success. This is because repression was a confounding variable which makes violence more likely and success less likely. So, when we didn't control for repression, it looked like violent movements where less successful. 