# Regression Homework
In this homework we will review the process of generating an Ordinary Least Squares regression model. We will review the information that it can tell us about the relationship between variables.

Like always, we need to load in a few libaries first:

In [197]:
import statsmodels.formula.api as smf
import statsmodels.api as sm
from datascience import Table
from question_maker import question_maker 

Next, we need to load in our election data. This table represents presidential election outcomes from 1880 to now. In each row, we have collected information about different features during that year, such as inflation or the presence of a war.

In [198]:
elections = Table.read_table('jason_data/data/fair.csv')
elections

YEAR,VOTE,PARTY,PERSON,DURATION,WAR,GROWTH,INFLATION,GOODNEWS
1880,50.22,-1,0,1.75,0,3.879,1.974,9
1884,49.846,-1,0,2.0,0,1.589,1.055,2
1888,50.414,1,1,0.0,0,-5.553,0.604,3
1892,48.268,-1,1,0.0,0,2.763,2.274,7
1896,47.76,1,0,0.0,0,-10.024,3.41,6
1900,53.171,-1,1,0.0,0,-1.425,2.548,7
1904,60.006,-1,0,1.0,0,-2.421,1.442,5
1908,54.483,-1,0,1.25,0,-6.281,1.879,8
1912,54.708,-1,1,1.5,0,4.164,2.172,8
1916,51.682,1,1,0.0,0,2.229,4.252,3


If we want to see the relationship between voteshare and a variable, such as economic growth, we use the <code>smf.ols('y_variables ~ x_variable', data=data_table).fit()</code> function. To see the result, we call <code>.summary()</code>. Below, we produce the regression results for the relationship between voteshare and inflation.  

**NOTE:** Most of the results of this table are outside the scope of this course. The important values for you to consider are the R-squared value, and the coefficients and p-values associated with the different independent variables.

In [199]:
vote_inflation_ols = smf.ols('VOTE ~ INFLATION', data=elections).fit()
vote_inflation_ols.summary()

0,1,2,3
Dep. Variable:,VOTE,R-squared:,0.024
Model:,OLS,Adj. R-squared:,-0.009
Method:,Least Squares,F-statistic:,0.7374
Date:,"Sun, 22 Sep 2019",Prob (F-statistic):,0.397
Time:,21:20:21,Log-Likelihood:,-102.22
No. Observations:,32,AIC:,208.4
Df Residuals:,30,BIC:,211.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,53.4353,1.733,30.839,0.000,49.897,56.974
INFLATION,-0.4385,0.511,-0.859,0.397,-1.481,0.604

0,1,2,3
Omnibus:,2.798,Durbin-Watson:,2.575
Prob(Omnibus):,0.247,Jarque-Bera (JB):,1.511
Skew:,-0.412,Prob(JB):,0.47
Kurtosis:,3.674,Cond. No.,5.75


In the table that is produced by the <code>.summary()</code> function, there is a row labeled "INFLATION".What does it tell us about the coefficient for the linear relationship between inflation and presidential vote share?

*YOUR ANSWER HERE*

Now, let's produce the OLS regression results for voteshare and economic growth (the "GROWTH" column): 

In [200]:
vote_inflation_ols = smf.ols('VOTE ~ GROWTH', data=elections).fit()
vote_inflation_ols.summary()

0,1,2,3
Dep. Variable:,VOTE,R-squared:,0.356
Model:,OLS,Adj. R-squared:,0.334
Method:,Least Squares,F-statistic:,16.55
Date:,"Sun, 22 Sep 2019",Prob (F-statistic):,0.000316
Time:,21:20:21,Log-Likelihood:,-95.584
No. Observations:,32,AIC:,195.2
Df Residuals:,30,BIC:,198.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,51.8598,0.882,58.821,0.000,50.059,53.660
GROWTH,0.6536,0.161,4.068,0.000,0.325,0.982

0,1,2,3
Omnibus:,1.495,Durbin-Watson:,2.309
Prob(Omnibus):,0.473,Jarque-Bera (JB):,1.128
Skew:,0.211,Prob(JB):,0.569
Kurtosis:,2.183,Cond. No.,5.53


Using the "GROWTH" row, we can review what we can determine about the relationship between growth and presidential vote share. Is the relationship statistically significant?

*YOUR ANSWER HERE*

To use multiple variables, we can modify how we interact with the original function like so: <code>smf.ols('dependent_variable ~ independent_var_1 + independent_var_2 + ...', data=data_table).fit()</code>. Below, we run the regression between the two independent variables economic growth and monetary inflation.

In [201]:
vote_inflation_growth_ols = smf.ols('VOTE ~ INFLATION + GROWTH', data=elections).fit()
vote_inflation_growth_ols.summary()

0,1,2,3
Dep. Variable:,VOTE,R-squared:,0.359
Model:,OLS,Adj. R-squared:,0.315
Method:,Least Squares,F-statistic:,8.132
Date:,"Sun, 22 Sep 2019",Prob (F-statistic):,0.00157
Time:,21:20:21,Log-Likelihood:,-95.49
No. Observations:,32,AIC:,197.0
Df Residuals:,29,BIC:,201.4
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,52.3341,1.456,35.954,0.000,49.357,55.311
INFLATION,-0.1760,0.426,-0.413,0.683,-1.048,0.696
GROWTH,0.6428,0.165,3.896,0.001,0.305,0.980

0,1,2,3
Omnibus:,1.103,Durbin-Watson:,2.329
Prob(Omnibus):,0.576,Jarque-Bera (JB):,0.973
Skew:,0.213,Prob(JB):,0.615
Kurtosis:,2.259,Cond. No.,9.24


Compare the coefficient and p-values for the two independent variables compared to when we just ran bivariate regression using each of them individually. How do these values change?

*YOUR ANSWER HERE*

Now, run the multivariate regression for the relationship between voteshare and "GOODNEWS" and "WAR":

In [202]:
vote_goodnews_war_ols = smf.ols('VOTE ~ GOODNEWS + WAR', data=elections).fit()
vote_goodnews_war_ols.summary()

0,1,2,3
Dep. Variable:,VOTE,R-squared:,0.204
Model:,OLS,Adj. R-squared:,0.149
Method:,Least Squares,F-statistic:,3.713
Date:,"Sun, 22 Sep 2019",Prob (F-statistic):,0.0367
Time:,21:20:21,Log-Likelihood:,-98.965
No. Observations:,32,AIC:,203.9
Df Residuals:,29,BIC:,208.3
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,47.0359,2.781,16.913,0.000,41.348,52.724
GOODNEWS,0.9843,0.443,2.224,0.034,0.079,1.890
WAR,0.3851,4.265,0.090,0.929,-8.338,9.108

0,1,2,3
Omnibus:,0.615,Durbin-Watson:,2.697
Prob(Omnibus):,0.735,Jarque-Bera (JB):,0.718
Skew:,-0.225,Prob(JB):,0.698
Kurtosis:,2.42,Cond. No.,28.9


**Coeffecient Review:**
Using the coefficients for the intercept, GOODNEWS, and WAR variables, during peace time, how many months of good economic news is necessary for the incumbent to win?

*YOUR ANSWER HERE*

Is GOODNEWS statistically significant at the .05 level? What about at .01? What does this imply about positive economic news and incumbent voteshare?

*YOUR ANSWER HERE*

Let's practice generating confidence intervals. As we have seen in past lectures, the 95% confidence interval is calculated with $\beta \pm 1.96 * se(\beta)$. From the table above, what is the standard error for the GOODNEWS variable?

In [203]:
goodnews_se =  0.443

Using the standard error, calculate the 95% confidence interval. In the cell below, fill out the values for the lower and upper bound of the interval. Does it match what the <code>.summary()</code> function returns?

In [204]:
goodnews_lower =  0.9843 - 1.96*goodnews_se
goodnews_upper = 0.9843 + 1.96*goodnews_se
goodnews_lower, goodnews_upper

(0.11602000000000001, 1.85258)

## OLS Review: Population and Sample Models:
In the following questions, the models in focus are bivariate, using the population model ${Y_i} = \alpha + \beta X_i+u_i$ and sample model ${Y_i} = \hat{\alpha} + \hat{\beta}X_i+\hat{u_i}$

Which of the following statements are accurate about the population regression model?

In [205]:
question_maker('$u_i$ is the stochastic component of $Y_i$.')
question_maker('$\hat{α}+\hat{β}X_i$ is the systematic component of $Y_i$')
question_maker('Both (a) and (b) are correct')
question_maker('Neither (a) nor (b) are correct')

HBox(children=(Checkbox(value=False), Label(value='$u_i$ is the stochastic component of $Y_i$.', layout=Layout…

HBox(children=(Checkbox(value=False), Label(value='$\\hat{α}+\\hat{β}X_i$ is the systematic component of $Y_i$…

HBox(children=(Checkbox(value=False), Label(value='Both (a) and (b) are correct', layout=Layout(margin='0 -5 0…

HBox(children=(Checkbox(value=False), Label(value='Neither (a) nor (b) are correct', layout=Layout(margin='0 -…

Which of the following statements are accurate about the population regression model?

In [206]:
question_maker('$\hat{u}_i$ is an estimate of u_i')
question_maker('$X_i$ is assumed to be measured without error')
question_maker('Both (a) and (b) are correct')
question_maker('Neither (a) nor (b) are correct')

HBox(children=(Checkbox(value=False), Label(value='$\\hat{u}_i$ is an estimate of u_i', layout=Layout(margin='…

HBox(children=(Checkbox(value=False), Label(value='$X_i$ is assumed to be measured without error', layout=Layo…

HBox(children=(Checkbox(value=False), Label(value='Both (a) and (b) are correct', layout=Layout(margin='0 -5 0…

HBox(children=(Checkbox(value=False), Label(value='Neither (a) nor (b) are correct', layout=Layout(margin='0 -…

Which of the statements are accurate?

In [207]:
question_maker('By specifying a bivariate regression model we are assuming that the impact of a one unit increase in $X_i$ will always be β.')
question_maker('By specifying a bivariate regression model we are assuming that there are no other variables that cause $Y_i$.')
question_maker('Both (a) and (b) are correct')
question_maker('Neither (a) nor (b) are correct')

HBox(children=(Checkbox(value=False), Label(value='By specifying a bivariate regression model we are assuming …

HBox(children=(Checkbox(value=False), Label(value='By specifying a bivariate regression model we are assuming …

HBox(children=(Checkbox(value=False), Label(value='Both (a) and (b) are correct', layout=Layout(margin='0 -5 0…

HBox(children=(Checkbox(value=False), Label(value='Neither (a) nor (b) are correct', layout=Layout(margin='0 -…