This example is based on the one-way ANOVA example in Table 11.2 from Ken Black, *Business Statistics for Contemporary Decision Making*, 6th Edition, John Wiley and Sons, p. 407, 2010.

Import dependent libraries: pandas (`pd`) for data manipulation, statsmodels (`sm`) for statistics, and statsmodels formulas (`smf`) for building models using formulas.

In [1]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

First, make sure that the data file `data-valve-operator.csv` is accessible to your Python environment. If you are using your local machine, you can put it in the same directory as this Jupyter notebook. If you are using a cloud service like Google Colab, make sure to upload the file to the remote machine (expand folder icon on left side and click upload button).

Next, read the data file using pandas.

In [2]:
df = pd.read_csv('data-valve-operator.csv')
display(df)

Unnamed: 0,Operator,ValveOpening
0,1,6.33
1,1,6.26
2,1,6.31
3,1,6.29
4,1,6.4
5,2,6.26
6,2,6.36
7,2,6.23
8,2,6.27
9,2,6.19


Next, build and fit an ordinary least squares (OLS) model using statsmodels. The desired equation for a one-way ANOVA models the `ValveOpening` variable as a function of the *categorical* variable `Operator`. Results show that the operator has a significant effect on valve opening.

In [3]:
model = smf.ols('ValveOpening ~ C(Operator)', data=df).fit()
print(sm.stats.anova_lm(model))

               df    sum_sq   mean_sq          F    PR(>F)
C(Operator)   3.0  0.236580  0.078860  10.181025  0.000279
Residual     20.0  0.154916  0.007746        NaN       NaN


We can also perform a Tukey Honestly Significant Difference (HSD) test to identify which (if any) of the operators have significantly different valve openings. Results show that operators 1-3, 2-3, and 3-4 are different from each other.

In [4]:
results = sm.stats.multicomp.MultiComparison(df.ValveOpening, df.Operator)
print(results.tukeyhsd())

Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower   upper  reject
----------------------------------------------------
     1      2  -0.0405 0.8362 -0.1809  0.0999  False
     1      3   0.1706 0.0169  0.0263  0.3148   True
     1      4   -0.088 0.4625 -0.2533  0.0773  False
     2      3   0.2111  0.001  0.0836  0.3386   True
     2      4  -0.0475 0.7958 -0.1984  0.1034  False
     3      4  -0.2586  0.001  -0.413 -0.1042   True
----------------------------------------------------


We can also view a summary of the regression model which quantifies the coefficients for each operator, using dummy variables. Results show that operators 2 and 4 have slightly smaller (but not statistically significant) coefficients relative to operator 1 (the baseline) while operator 3 has a significantly larger coefficient.

In [5]:
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:           ValveOpening   R-squared:                       0.604
Model:                            OLS   Adj. R-squared:                  0.545
Method:                 Least Squares   F-statistic:                     10.18
Date:                Wed, 17 Nov 2021   Prob (F-statistic):           0.000279
Time:                        00:46:05   Log-Likelihood:                 26.461
No. Observations:                  24   AIC:                            -44.92
Df Residuals:                      20   BIC:                            -40.21
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            6.3180      0.039  