In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<h1> Hypothesis Testing and ANOVA</h1>

I used the GapMinder data set to investigate the three variables incomeperperson, armedforcesrate, and polityscore.

<h4>SET UP</h4>

<i>Import the packages to use</i>

In [2]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

<i>Set some options</i>

In [3]:
pd.set_option('display.max_rows', 200)
pd.set_option('expand_frame_repr', False)
pd.set_option('display.float_format', '{:,.2f}'.format)

<i>Read in the data</i>

In [22]:
data = pd.read_csv('../gapminder.csv', low_memory=False).set_index('country')
data = data[['incomeperperson','armedforcesrate','polityscore']]
data = data[(data['incomeperperson'] != ' ') & (data['armedforcesrate'] != ' ') & (data['polityscore'] != ' ')]
data.loc[:,'incomeperperson'] = pd.to_numeric(data.loc[:,'incomeperperson']).astype('int')
data.loc[:,'armedforcesrate'] = pd.to_numeric(data.loc[:,'armedforcesrate']).round(4)

<i>Map some values so the analysis doesn't get too big to interpret</i>

In [23]:
data.polityscore = data.polityscore \
    .map({'-10':'-10:-5','-9':'-10:-5','-8':'-10:-5','-7':'-10:-5','-6':'-10:-5'
         ,'-5':'-10:-5','-4':'-4:-','-3':'-4:-1','-2':'-4:-1','-1':'-4:-1','0':'0'
         ,'1':'1:5','2':'1:5','3':'1:5','4':'1:5','5':'1:5','6':'6:10','7':'6:10'
         ,'8':'6:10', '9':'6:10','10':'6:10'})

<h4>ANOVA ANALYSIS</h4>

<i>Perform the ANOVA test</i>

In [27]:
model1 = smf.ols(formula='incomeperperson~C(polityscore)', data=data)
results1 = model1.fit()
results1.summary()

0,1,2,3
Dep. Variable:,incomeperperson,R-squared:,0.12
Model:,OLS,Adj. R-squared:,0.089
Method:,Least Squares,F-statistic:,3.896
Date:,"Sat, 20 May 2017",Prob (F-statistic):,0.00243
Time:,21:20:14,Log-Likelihood:,-1572.1
No. Observations:,149,AIC:,3156.0
Df Residuals:,143,BIC:,3174.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
Intercept,6165.0000,2059.996,2.993,0.003,2093.021 1.02e+04
C(polityscore)[T.-4:-],-4803.6667,4369.912,-1.099,0.274,-1.34e+04 3834.305
C(polityscore)[T.-4:-1],-2983.2000,3191.333,-0.935,0.351,-9291.482 3325.082
C(polityscore)[T.0],-5703.7500,5149.991,-1.108,0.270,-1.59e+04 4476.197
C(polityscore)[T.1:5],-5007.1667,3032.233,-1.651,0.101,-1.1e+04 986.624
C(polityscore)[T.6:10],3335.6471,2300.435,1.450,0.149,-1211.604 7882.899

0,1,2,3
Omnibus:,52.425,Durbin-Watson:,1.702
Prob(Omnibus):,0.0,Jarque-Bera (JB):,100.574
Skew:,1.658,Prob(JB):,1.45e-22
Kurtosis:,5.282,Cond. No.,9.28


<i>Look at some statistics and perform a post hoc test</i>

In [28]:
mean1 = data.groupby('polityscore').mean()
sd1 = data.groupby('polityscore').std()

In [29]:
mc1 = multi.MultiComparison(data['incomeperperson'],data['polityscore'])
res1 = mc1.tukeyhsd()
res1.summary()

group1,group2,meandiff,lower,upper,reject
-10:-5,-4:-,-4803.6667,-17427.1898,7819.8565,False
-10:-5,-4:-1,-2983.2,-12202.1178,6235.7178,False
-10:-5,0,-5703.75,-20580.7147,9173.2147,False
-10:-5,1:5,-5007.1667,-13766.4863,3752.153,False
-10:-5,6:10,3335.6471,-3309.7019,9980.996,False
-4:-,-4:-1,1820.4667,-11352.1594,14993.0927,False
-4:-,0,-900.0833,-18502.7453,16702.5787,False
-4:-,1:5,-203.5,-13058.6667,12651.6667,False
-4:-,6:10,8139.3137,-3379.8136,19658.4411,False
-4:-1,0,-2720.55,-18066.195,12625.095,False


<h4>MODEL INTERPRETATION</h4>

My main question is between two quantitative variables so I have incorporated my further variable in order to perform, which is categorical, so I will be looking at the association between the income per person and the polity score of countries. I have also mapped my values to 5 sets to make the analysis displays more compact, the groups are: -10:-5, -6:-1,0,1:5,6:10. 

My Analysis of Variance (ANVOA) revealed that there is an association between income per person and polity score by country, F = 4.859, p = 0.00105, so p < 0.05. As my categorical variable (polity score) has 5 levels I need to perform a post hoc comparison of mean income per person against polity score grouping, using the Tukey Honestly Significantly Difference test. This revealed that there is a significant difference between countries scoring -4:-1 and 6:10 as well as countries scoring 1:5 and 6:10, with the 6:10 group earning much more income per person than the other groups. All other comparisons were statistically similar.