# Instagram Factorial Example

Import the necessary libraries.

In [1]:
import os
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

Change the working directory.

In [2]:
os.chdir("/Users/nstevens/Dropbox/Teaching/MSDS_629/2023/Lectures/Lecture6")

Read in the data.

In [3]:
data = pd.read_csv('instagram-factorial.csv')
data.head(10)

Unnamed: 0,Time,Frequency,Type
0,7.008647,0,1
1,6.692199,0,1
2,8.486351,0,1
3,7.168847,0,1
4,5.413766,0,1
5,6.524732,0,1
6,7.838157,0,1
7,6.08078,0,1
8,6.315004,0,1
9,8.199489,0,1


Let's next fit the appropriate linear regression models and conduct the relevant F-tests. We begin by fitting the "full" model and evaluating the significance of the interaction effect.

In [4]:
model = smf.ols('Time ~ C(Frequency) * C(Type)', data = data).fit()
model.summary()

0,1,2,3
Dep. Variable:,Time,R-squared:,0.85
Model:,OLS,Adj. R-squared:,0.85
Method:,Least Squares,F-statistic:,6455.0
Date:,"Fri, 13 Jan 2023",Prob (F-statistic):,0.0
Time:,12:52:22,Log-Likelihood:,-10442.0
No. Observations:,8000,AIC:,20900.0
Df Residuals:,7992,BIC:,20960.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.9779,0.028,247.104,0.000,6.922,7.033
C(Frequency)[T.1],-1.9693,0.040,-49.312,0.000,-2.048,-1.891
C(Frequency)[T.2],-3.0020,0.040,-75.173,0.000,-3.080,-2.924
C(Frequency)[T.3],-5.9586,0.040,-149.206,0.000,-6.037,-5.880
C(Type)[T.2],0.1099,0.040,2.753,0.006,0.032,0.188
C(Frequency)[T.1]:C(Type)[T.2],0.5177,0.056,9.166,0.000,0.407,0.628
C(Frequency)[T.2]:C(Type)[T.2],0.7492,0.056,13.266,0.000,0.639,0.860
C(Frequency)[T.3]:C(Type)[T.2],0.3473,0.056,6.150,0.000,0.237,0.458

0,1,2,3
Omnibus:,59.573,Durbin-Watson:,2.027
Prob(Omnibus):,0.0,Jarque-Bera (JB):,89.75
Skew:,-0.045,Prob(JB):,3.24e-20
Kurtosis:,3.511,Cond. No.,12.5


In [5]:
model = smf.ols('Time ~ C(Frequency) + C(Type) + C(Frequency) : C(Type)', data = data).fit()
model.summary()

0,1,2,3
Dep. Variable:,Time,R-squared:,0.85
Model:,OLS,Adj. R-squared:,0.85
Method:,Least Squares,F-statistic:,6455.0
Date:,"Mon, 02 Jan 2023",Prob (F-statistic):,0.0
Time:,16:17:17,Log-Likelihood:,-10442.0
No. Observations:,8000,AIC:,20900.0
Df Residuals:,7992,BIC:,20960.0
Df Model:,7,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,6.9779,0.028,247.104,0.000,6.922,7.033
C(Frequency)[T.1],-1.9693,0.040,-49.312,0.000,-2.048,-1.891
C(Frequency)[T.2],-3.0020,0.040,-75.173,0.000,-3.080,-2.924
C(Frequency)[T.3],-5.9586,0.040,-149.206,0.000,-6.037,-5.880
C(Type)[T.2],0.1099,0.040,2.753,0.006,0.032,0.188
C(Frequency)[T.1]:C(Type)[T.2],0.5177,0.056,9.166,0.000,0.407,0.628
C(Frequency)[T.2]:C(Type)[T.2],0.7492,0.056,13.266,0.000,0.639,0.860
C(Frequency)[T.3]:C(Type)[T.2],0.3473,0.056,6.150,0.000,0.237,0.458

0,1,2,3
Omnibus:,59.573,Durbin-Watson:,2.027
Prob(Omnibus):,0.0,Jarque-Bera (JB):,89.75
Skew:,-0.045,Prob(JB):,3.24e-20
Kurtosis:,3.511,Cond. No.,12.5


Compare this "full" model to the "reduced" one without interaction terms. In other words, the one for which $$H_0:\beta_5=\beta_6=\beta_7=0$$ is true.

In [6]:
model_red1 = smf.ols('Time ~ C(Frequency) + C(Type)', data = data).fit()
sm.stats.anova_lm(model_red1, model)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,7995.0,6522.193335,0.0,,,
1,7992.0,6372.921995,3.0,149.27134,62.398198,7.167357e-40


This very small p-value leads us to reject the null hypothesis above, indicating that the frequency-by-type interaction effect is significant.

Next let us -- for illustration only -- consider the main effects model (the one without the interaction effect) and then evaluate the significance of the main effects in the context of that model. 

First we evaluate the main effect of ad type. The following test statistic and p-value are associated with a test of $$H_0: \beta_4=0 \text{ vs. }H_A: \beta_4 \neq 0$$ in the contect of the main effect model.

In [7]:
model_red2 = smf.ols('Time ~ C(Frequency)', data = data).fit()
sm.stats.anova_lm(model_red2, model_red1)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,7996.0,7049.537052,0.0,,,
1,7995.0,6522.193335,1.0,527.343717,646.425642,3.3851829999999997e-137


Next we evaluate the main effect of ad frequency. The following test statistic and p-value are associated with a test of $$H_0: \beta_1=\beta_2=\beta_3=0 \text{ vs. }H_A: \beta_j \neq 0 \text{ for some }j=1,2,3$$ in the contect of the main effect model.

In [8]:
model_red3 = smf.ols('Time ~ C(Type)', data = data).fit()
sm.stats.anova_lm(model_red3, model_red1)

Unnamed: 0,df_resid,ssr,df_diff,ss_diff,F,Pr(>F)
0,7998.0,41875.132753,0.0,,,
1,7995.0,6522.193335,3.0,35352.939418,14445.383433,0.0


Based on the results of these tests we find that our intuition from the plots was correct: the interaction is minimal but significant, and both main effects are are also significant.