# PS 88 Week 9 Lecture Notebook

Loading Libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from IPython.display import display, Markdown

Let's look at the economic performance and election result data we studied in week 2.

First we loading the election data, which is stored in .dta (Stata) format, and then subsetting to years with elections after 1936.

Note we are using pandas syntax, which we will learn more about in the lab.

In [None]:
elec = pd.read_stata("presvote.dta")

The full data contains all years since 1789, but we are only interested in election years with the relevant economic data. 

In [None]:
elec = elec[elec['incvote']>0]
elec = elec[elec['year'] > 1936]
elec

We can make a scatterplot with  the `scatterplot` function from seaborn (loaded here as sns). The first argument tells what variable  to use for the x axis, the second argumnt is the y  axis, and the third argument is the data frame containing these variables.

In [None]:
sns.scatterplot(x='RDIyrgrowth', y='incvote', data=elec)

Adding horizontal and vertical lines at the means using the `axvline` and `axhline` functions.

In [None]:
sns.scatterplot(x='RDIyrgrowth', y='incvote', data=elec)
plt.axvline(np.mean(elec['RDIyrgrowth']))
plt.axhline(np.mean(elec['incvote']))

In [None]:
np.corrcoef(elec['RDIyrgrowth'], elec['incvote'])

 Biviarate regression is used to determine how changes in one variable -- the independent variable, often denoted $X$ -- can predict changes in another, the dependent variable, often denoted $Y$. Bivariate regression relies on a linear model, which follows the form $Y_i= a + b X_i$, where $a$ is the y-intercept and $b$ is the slope. 

If we assume that the relationship between our variables is not perfect (or, in the real world, if there is some predictable inaccuracy in our measurement), we add an error term $e$: $Y_i= a + b X_i + e_i$. 

Here is a function which draws such a line through the data and then compute the *total sum of squares* $\sum e_i^2$. A good line will make this small

In [None]:
def draw_line(slope, intercept):
    #The Linear Model
    def f(x):
        return intercept + slope*x
    x = np.arange(0,7)
    y_pred = f(x)
    display(Markdown(rf'$\hat y$= {slope}$X$ + {intercept}:'))
    #The line
    plt.plot(x,y_pred)
    #The Data
    sns.scatterplot(x='RDIyrgrowth', y='incvote', data=elec)

    #Print the loss
    print("Square Residual Sum:", sum([(y-f(x))**2 for x,y in zip(elec.RDIyrgrowth, elec.incvote)]))
 

In [None]:
draw_line(0, .5)

In [None]:
draw_line(.04, .4)

In [None]:
draw_line(-.05, .6)


An easy way to add the best regression line is to use the `regplot` function in seaborn. The `ci=0` option tells it to not plot a confidence interval, which we aren't discussing yet.

In [None]:
sns.regplot(x='RDIyrgrowth', y='incvote', data=elec, ci=0)

For the next plot we  are going to want to loop through the elections, which will be a bit more straightforward with the following step (don't worry about the details here)

In [None]:
elec = elec.reset_index(drop=True)

To illustrate the "total sum of squares", we can make the scatter plot, and then loop  through each election and draw a line between the realized incumbent vote share and the average. 

In [None]:
sns.scatterplot(x='RDIyrgrowth', y='incvote', data=elec)
ybar = np.mean(elec['incvote'])
plt.axhline(ybar)
for el in range(0,elec.shape[0]):
    plt.vlines(elec.RDIyrgrowth[el],ybar, elec.incvote[el])

To do the same on the best fit line, we need the regression output:

In [None]:
m1 = smf.ols('incvote~RDIyrgrowth', data=elec).fit()
m1.summary()

If you want to do it in one line:

In [None]:
smf.ols('incvote~RDIyrgrowth', data=elec).fit().summary()

To retreive the parameters we can use the `.params` function. 

In [None]:
b0 = m1.params[0]
b1 = m1.params[1]

Now we can draw lines between the points and the regression line.

In [None]:
sns.regplot(x='RDIyrgrowth', y='incvote', data=elec, ci=0)
for el in range(0,elec.shape[0]):
    plt.vlines(elec.RDIyrgrowth[el],b0 + b1*elec.RDIyrgrowth[el], elec.incvote[el])

Unfortunately seaborn does not have a good function to label points, but we can use the `scatter` function in the plotly.express library  for this.

In [None]:
import plotly.express as px

In [None]:
fig=px.scatter(elec, x='RDIyrgrowth',y='incvote', text='initials2', trendline='ols')
fig.update_traces(textposition='top center')
fig.show()