In [217]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
from datascience import Table
from datascience.predicates import are
df = pd.read_stata('rd_analysis.dta').dropna(subset=['turnout_party_share'])

In [218]:
extremist_df = df[['rv', 'treat', 'turnout_party_share', 'low_info_turnout_party', 'high_info_turnout_party','low_info_turnout_opp_party','high_info_turnout_opp_party'  ]].dropna()
extremist=Table.from_df(extremist_df)

**RUNNING T-TESTS**

Use the `stats.ttest_ind` function to do a difference of means test for whether there is a meaningful difference in the turn out of the party when the candidate wins versus when they dont. 

In [219]:
low_won_opp=extremist.where('treat', 1).column('low_info_turnout_opp_party')
low_lose_opp=extremist.where('treat', 0).column('low_info_turnout_opp_party')
stats.ttest_ind(low_lose_opp, low_won_opp)

Ttest_indResult(statistic=-1.0256671245668916, pvalue=0.30573719284698475)

In [220]:
high_won_opp=extremist.where('treat', 1).column('high_info_turnout_opp_party')
high_lose_opp=extremist.where('treat', 0).column('high_info_turnout_opp_party')
stats.ttest_ind(high_won_opp, high_lose_opp)

Ttest_indResult(statistic=1.6139387807001697, pvalue=0.10741684526167859)

In [221]:
high_won=extremist.where('treat', 1).column('high_info_turnout_party')
high_lose=extremist.where('treat', 0).column('high_info_turnout_party')
stats.ttest_ind(high_won, high_lose)

Ttest_indResult(statistic=-0.16101698170341555, pvalue=0.8721703451934503)

In [222]:
low_won=extremist.where('treat', 1).column('low_info_turnout_party')
low_lose=extremist.where('treat', 0).column('low_info_turnout_party')
stats.ttest_ind(lowwon, lowlose)

Ttest_indResult(statistic=-0.346446096980798, pvalue=0.7292098875508181)

In [223]:
won=extremist.where('treat', 1).column('turnout_party_share')
lost=extremist.where('treat', 0).column('turnout_party_share')

In [224]:
t_test = stats.ttest_ind(lost, won)
t_test

Ttest_indResult(statistic=1.5450154343694895, pvalue=0.12322110705731085)

**OLS EXAMPLE**

In this problem, we will be introducing statsmodels, a Python Module that will provide us with helpful functions that we will use to run linear regression on our data. Regression is useful because it allows us to predict unknown quantities from existing data. In this example, we know all the quantities of our data but this may not always be the case! Let’s first practice using statsmodels on a toy data set. In the cell below, we will create a table with two columns. The first column will be numbers 1-10 and the second column will be the first column multiplied by 2. Run the cell below to create and view this table called `toy_data`.

In [225]:
x=np.arange(1, 11)
toy_data=Table.from_df(pd.DataFrame(data= {'x': x, 'y': x*2} ))  
toy_data

x,y
1,2
2,4
3,6
4,8
5,10
6,12
7,14
8,16
9,18
10,20


Now we will use statsmodels to find a linear model that will predict the 'y' column (the dependant variable) from the 'x' column (the independent variable) of the toy data set. In the first cell of this notebook, we imported the `statsmodels.formula.api` as `smf`. We will use Ordinarily Least Squares (ols) function provided by `smf` to define a linear regression model of our toy data set. 

`smf.ols` takes in two parameters:
  -  'dependant variable ~ independant variable'
  -  data set

The function will find the coefficient to multiply the independent value by to find an estimate of the corresponding dependant value, an R-squared value that will tell us how accurate the model is, along with other useful information about the model. 

Run the cell below to fit the model.



In [226]:
fit_model= smf.ols('y ~ x', toyds).fit()

Now that the model is fitted, we can run `.summary()` to view the OLS Regression Results.

In [227]:
fit_model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,1.0449999999999998e+32
Date:,"Sat, 14 Nov 2020",Prob (F-statistic):,2.48e-278
Time:,13:31:27,Log-Likelihood:,631.08
No. Observations:,20,AIC:,-1258.0
Df Residuals:,18,BIC:,-1256.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,7.994e-15,2.34e-15,3.410,0.003,3.07e-15,1.29e-14
x,2.0000,1.96e-16,1.02e+16,0.000,2.000,2.000

0,1,2,3
Omnibus:,3.11,Durbin-Watson:,0.093
Prob(Omnibus):,0.211,Jarque-Bera (JB):,1.307
Skew:,0.156,Prob(JB):,0.52
Kurtosis:,1.787,Cond. No.,25.0


Running `.summary()` yields a lot of information that may seem confusing. Let’s extract some of the useful information from the OLS regression results. 

First, let’s look at the parameter of x. This is the number that the model suggests the independent variable be multiplied by to obtain the dependant variable. We expect this parameter to be ~2 since the 'y' column is simply the 'x' column multiplied by 2. Run the cell below to see if the model’s parameter for 'x' matches our expectation. 

In [238]:
x_param=fit_model.params.x
x_param

1.9999999999999993

Now we will check how the model's accuracy in predicting the 'y' value by adding a new column to the `toy_data` table called 'predicted'. This column will contain the models predicted 'y' which we will find by multiplying the 'x' column by the `x_param` parameter given by the model.

In [239]:
toy_data.with_column('predicted', toy_data.column('x')* x_param)

x,y,predicted
1,2,2
2,4,4
3,6,6
4,8,8
5,10,10
6,12,12
7,14,14
8,16,16
9,18,18
10,20,20


The model perfectly predicts the 'y' value! We will expect to see an R-Squared value of 1 to represent the perfect fit of the data. Run the cell below to view the R-Squared value our model calculated.

In [230]:
fit_model.rsquared

1.0

In the real world, it is unlikely to get such a perfect fit on data, but this way a good way to be introduced the the statsmodels library. You will now use statsmodels to find a linear model on some real world data regarding extremists candidates. 

**OLS with extremist by treat**

In [231]:
model = smf.ols('turnout_party_share ~ treat', extr_const).fit()
model.summary()

0,1,2,3
Dep. Variable:,turnout_party_share,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.004
Method:,Least Squares,F-statistic:,2.387
Date:,"Sat, 14 Nov 2020",Prob (F-statistic):,0.123
Time:,13:31:27,Log-Likelihood:,263.46
No. Observations:,362,AIC:,-522.9
Df Residuals:,360,BIC:,-515.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.5052,0.009,59.268,0.000,0.488,0.522
treat,-0.0191,0.012,-1.545,0.123,-0.043,0.005

0,1,2,3
Omnibus:,2.287,Durbin-Watson:,1.891
Prob(Omnibus):,0.319,Jarque-Bera (JB):,2.263
Skew:,0.193,Prob(JB):,0.323
Kurtosis:,2.959,Cond. No.,2.57


In [232]:
X= sm.add_constant(extremist_df['treat'])
y= extremist_df['turnout_party_share']
lm = sm.OLS(y, X).fit()
lm.summary()

0,1,2,3
Dep. Variable:,turnout_party_share,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.004
Method:,Least Squares,F-statistic:,2.387
Date:,"Sat, 14 Nov 2020",Prob (F-statistic):,0.123
Time:,13:31:27,Log-Likelihood:,263.46
No. Observations:,362,AIC:,-522.9
Df Residuals:,360,BIC:,-515.1
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.5052,0.009,59.268,0.000,0.488,0.522
treat,-0.0191,0.012,-1.545,0.123,-0.043,0.005

0,1,2,3
Omnibus:,2.287,Durbin-Watson:,1.891
Prob(Omnibus):,0.319,Jarque-Bera (JB):,2.263
Skew:,0.193,Prob(JB):,0.323
Kurtosis:,2.959,Cond. No.,2.57


In [233]:
X_var = sm.add_constant(extremist_df.drop(['treat'], axis= 1))
y_var= extremist_df[['treat']]
linearmodel = sm.OLS(y_var, X_var).fit()
linearmodel.summary()

0,1,2,3
Dep. Variable:,treat,R-squared:,0.645
Model:,OLS,Adj. R-squared:,0.639
Method:,Least Squares,F-statistic:,107.5
Date:,"Sat, 14 Nov 2020",Prob (F-statistic):,1.03e-76
Time:,13:31:27,Log-Likelihood:,-74.983
No. Observations:,362,AIC:,164.0
Df Residuals:,355,BIC:,191.2
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,0.6038,0.155,3.894,0.000,0.299,0.909
rv,2.5013,0.100,25.117,0.000,2.305,2.697
turnout_party_share,-0.1009,0.144,-0.701,0.484,-0.384,0.182
low_info_turnout_party,-0.1292,0.096,-1.348,0.179,-0.318,0.059
high_info_turnout_party,-0.1594,0.140,-1.139,0.256,-0.435,0.116
low_info_turnout_opp_party,0.0078,0.086,0.091,0.927,-0.161,0.176
high_info_turnout_opp_party,0.1982,0.133,1.486,0.138,-0.064,0.460

0,1,2,3
Omnibus:,250.084,Durbin-Watson:,2.085
Prob(Omnibus):,0.0,Jarque-Bera (JB):,22.717
Skew:,-0.019,Prob(JB):,1.17e-05
Kurtosis:,1.773,Cond. No.,22.9


In [234]:
sns.lmplot('rv', 'turnout_party_share', data=df, col='treat')



<seaborn.axisgrid.FacetGrid at 0x2b7a66a0438>

In [235]:
df['low_info_turnout_party'].hist()
df['high_info_turnout_party'].hist()

<AxesSubplot:title={'center':'treat = 1.0'}, xlabel='rv'>

In [236]:
df['low_info_turnout_opp_party'].hist(label='low')
df['high_info_turnout_opp_party'].hist()
plt.legend()

<matplotlib.legend.Legend at 0x2b7a6833d68>