In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
from datascience import Table
from datascience.predicates import are
df = pd.read_stata('rd_analysis.dta').dropna(subset=['turnout_party_share'])
extremist_df = df[['treat', 'turnout_party_share', 'low_info_turnout_party', 'high_info_turnout_party','low_info_turnout_opp_party','high_info_turnout_opp_party'  ]].dropna()
extremist=Table.from_df(extremist_df)

# statsmodels OLS introduction

In this problem, we will be introducing statsmodels, a Python Module that will provide us with helpful functions that we will use to run linear regression on our data. Regression is useful because it allows us to predict unknown quantities from existing data. In this example, we know all the quantities of our data, but this may not always be the case! Let’s first practice using statsmodels on a toy data set. In the cell below, we will create a table with two columns. The first column will be numbers 1-20 and the second column will be the first column multiplied by 2. Run the cell below to create and view this table called `toy_data`.

In [10]:
x=np.arange(1, 21)
toy_data=Table.from_df(pd.DataFrame(data= {'x': x, 'y': x*2} ))  
toy_data

x,y
1,2
2,4
3,6
4,8
5,10
6,12
7,14
8,16
9,18
10,20


Now we will use statsmodels to find a linear model that will predict the 'y' column (the dependant variable) from the 'x' column (the independent variable) of the toy data set. In the first cell of this notebook, we imported the `statsmodels.formula.api` as `smf`. We will use Ordinarily Least Squares (ols) function provided by `smf` to define a linear regression model of our toy data set. 

`smf.ols` takes in two parameters:
  -  'dependant variable ~ independant variable'
  -  data set

The function will find the coefficient to multiply the independent value by to find an estimate of the corresponding dependant value, an R-squared value that will tell us how accurate the model is, along with other useful information about the model. 

Run the cell below to fit the model.



In [11]:
fit_model= smf.ols('y ~ x', toy_data).fit()

Now that the model is fitted, we can run `.summary()` to view the OLS Regression Results.

In [12]:
fit_model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1.0449999999999998e+32
Date:,"Tue, 17 Nov 2020",Prob (F-statistic):,2.48e-278
Time:,11:30:45,Log-Likelihood:,631.08
No. Observations:,20,AIC:,-1258.0
Df Residuals:,18,BIC:,-1256.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.994e-15,2.34e-15,3.410,0.003,3.07e-15,1.29e-14
x,2.0000,1.96e-16,1.02e+16,0.000,2.000,2.000

0,1,2,3
Omnibus:,3.11,Durbin-Watson:,0.093
Prob(Omnibus):,0.211,Jarque-Bera (JB):,1.307
Skew:,0.156,Prob(JB):,0.52
Kurtosis:,1.787,Cond. No.,25.0


Running `.summary()` yields a lot of information that may seem confusing. Let’s extract some of the useful information from the OLS regression results. 

First, let’s look at the parameter of x. This is the number that the model suggests the independent variable be multiplied by to obtain the dependant variable. We expect this parameter to be ~2 since the 'y' column is simply the 'x' column multiplied by 2. Run the cell below to see if the model’s parameter for 'x' matches our expectation. 

In [13]:
x_param=fit_model.params.x
x_param

1.9999999999999993

Now we will check how the model's accuracy in predicting the 'y' value by adding a new column to the `toy_data` table called 'predicted'. This column will contain the models predicted 'y' which we will find by multiplying the 'x' column by the `x_param` parameter given by the model.

In [14]:
toy_data.with_column('predicted', toy_data.column('x')* x_param)

x,y,predicted
1,2,2
2,4,4
3,6,6
4,8,8
5,10,10
6,12,12
7,14,14
8,16,16
9,18,18
10,20,20


The model perfectly predicts the 'y' value! We will expect to see an R-Squared value of 1 to represent the perfect fit of the data. Run the cell below to view the R-Squared value our model calculated.

In [15]:
fit_model.rsquared

1.0

In the real world, it is unlikely to get such a perfect fit on data, but this way a good way to be introduced the the statsmodels library. You will now use statsmodels to find a linear model on some real world data regarding extremists candidates. 

# Problem 3: OLS with extremist data

We will be working with the `extremist` table below. The `extremist` table has 6 columns: 
* `treat`: whether the extremist candidate won in the primary
* `turnout_party_share`: share of the turnout for that party that voted for the extremist candidate
* `low_info_turnout_party`: low info voter turnout for that party
* `high_info_turnout_party`: high info voter turnout for that party
* `low_info_turnout_opp_party`: low info voter turnout for the opposing party
* `high_info_turnout_opp_party`: high info voter turnout for the opposing party

Run the cell below to view the table.

In [45]:
extremist

treat,turnout_party_share,low_info_turnout_party,high_info_turnout_party,low_info_turnout_opp_party,high_info_turnout_opp_party
0,0.529412,0.333333,0.866667,0.666667,0.842105
0,0.352941,0.666667,0.842105,0.333333,0.866667
0,0.585366,0.722222,0.767442,0.3,0.8125
1,0.741935,0.75,0.875,0.285714,1.0
0,0.792453,0.619048,0.896552,0.142857,1.0
1,0.657534,0.6,0.846154,0.235294,0.714286
0,0.205479,0.235294,0.714286,0.6,0.846154
1,0.630769,0.65,0.757576,0.545455,0.666667
0,0.632353,0.5,0.815789,0.384615,0.722222
1,0.581395,0.5,0.714286,0.444444,0.5


**Part 1: exploratory data analysis**

First, let's do some exploratory data analysis to understand the data.  Run a difference of means test to determine whether there is a meaningful difference in the turn out of the party when the candidate wins versus when they don’t. 

In the cell below use the `extremist` table to assign `tps_won` to the `turnout_party_share` column for all candidates who won (`treat` value is 1) and `tps_lost` to the `turnout_party_share` column for all candidates who lost (`treat` value is 0). 

    (*hint: use `.where` and `.column`*)


In [27]:
tps_won= ...
tps_lost= ...

Use the `stats.ttest_ind` function introduced in previous problem sets to find out if there is a difference in means between `tps_won` and `tps_lost`. 

    Reminder: `stats.ttest_ind` takes in two arrays and outputs the pvalue and test statistic.

In [32]:
stats.ttest_ind(..., ...)

TypeError: unsupported operand type(s) for /: 'ellipsis' and 'int'

Was the difference between the turnout_party_share of candidates who won and candidates who did not win? Was this what you were expecting? Explain why or why not. Write your answer in the cell below.

*Write Answer Here*

It would also be interesting to explore whether there is a difference in means between the high information voters turnout and the low information voter turnout. Use the `high_info_turnout_party` and `low_info_turnout_party` columns to run a t-test in the cell below. 

    (*hint: use `.column`*)

In [19]:
stats.ttest_ind(...,...)

Ttest_indResult(statistic=21.44504084901919, pvalue=2.5596814734025318e-79)

It would be interesting to compare these results to the high/low information voter turnout of the opposing party. Do this by running a t-test to find the difference in means between the high information voter turnout of the opposite party and the low information voter turnout in the opposite party. Use the `high_info_turnout_opp_party` and `low_info_turnout_opp_party` columns to run a t-test in the cell below.

In [22]:
stats.ttest_ind(...,...)

Ttest_indResult(statistic=21.65278779877982, pvalue=1.6689281582221342e-80)

Compare the results found by running the t-tests between the two parties’ high/low information voter turnout. Did you find the differences between high and low information voter turnout to be statistically significant? Write your answer in the cell below.

*Write answer here*

**Part 2: OLS**

Now that you have done some exploratory data analysis with the `extremist` data you should have a better understanding of the data on when is a good time to use OLS to run a linear regression and when running a linear regression model may not be useful. 

From your findings of the t-tests, is a good idea to run a linear regression model to predict `turnout_party_share` from `treat` (whether the candidate won or lost)? Do you expect the r-squared value to be low or high? Write your answer in the cell below.

*Write answer here*

Use `smf.ols` as introduced earlier to run a linear regression model to predict `turnout_party_share` from `treat` to test whether your intuition was correct.

All you have to do is correctly input the parameters for `smf.ols` in replacement of the elipses in the cell below. With the correct parameters, running the cell below should output the OLS regression results with `turnout_party_share` as the Dep. Variable.

In [42]:
tps_model = smf.ols(...).fit()
tps_model.summary()

TypeError: from_formula() missing 1 required positional argument: 'data'

Run a linear regression model using `smf.ols`to predict the `high_info_turnout_party` from `low_info_turnout_party`. Assign the model to the variable info_model. Use `.summary()` to output the OLS regression results.

In [40]:
info_model = ...

0,1,2,3
Dep. Variable:,high_info_turnout_party,R-squared:,0.046
Model:,OLS,Adj. R-squared:,0.043
Method:,Least Squares,F-statistic:,17.29
Date:,"Tue, 17 Nov 2020",Prob (F-statistic):,4.01e-05
Time:,15:56:54,Log-Likelihood:,251.09
No. Observations:,362,AIC:,-498.2
Df Residuals:,360,BIC:,-490.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.7490,0.023,32.744,0.000,0.704,0.794
low_info_turnout_party,0.1524,0.037,4.158,0.000,0.080,0.224

0,1,2,3
Omnibus:,223.433,Durbin-Watson:,1.925
Prob(Omnibus):,0.0,Jarque-Bera (JB):,2538.413
Skew:,-2.407,Prob(JB):,0.0
Kurtosis:,15.046,Cond. No.,7.86


Now run a linear regression model to predict `high_info_turnout_opp_party` from `low_info_turnout_opp_party`. Assign the model to the variable info_model. Use `.summary()` to output the OLS regression results.

In [41]:
opp_model = ...

0,1,2,3
Dep. Variable:,high_info_turnout_opp_party,R-squared:,0.041
Model:,OLS,Adj. R-squared:,0.039
Method:,Least Squares,F-statistic:,15.53
Date:,"Tue, 17 Nov 2020",Prob (F-statistic):,9.74e-05
Time:,15:56:55,Log-Likelihood:,229.97
No. Observations:,362,AIC:,-455.9
Df Residuals:,360,BIC:,-448.2
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.7571,0.020,37.840,0.000,0.718,0.796
low_info_turnout_opp_party,0.1329,0.034,3.941,0.000,0.067,0.199

0,1,2,3
Omnibus:,191.499,Durbin-Watson:,1.976
Prob(Omnibus):,0.0,Jarque-Bera (JB):,1662.672
Skew:,-2.055,Prob(JB):,0.0
Kurtosis:,12.661,Cond. No.,6.6


What information can you deduce from the R-squared values of the `opp_model` and the `info_model`? Were the R-squared values similar or not between the two models? What other intresting features did you see in the OLS regression results? Do you think it is a good idea to run a linear regression between the high and low information voters? Write your answer in the cell below.

*Write answer here*