# One-way ANOVA

We use one-way analysis of variance (ANOVA) to test whether there are statistically significant differences between the means of independent groups. In this tutorial we will carry out a one-way ANOVA test for a small dataset.


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

In [None]:
%matplotlib inline

### The data
We are going to use a hypothetical example to illustrate how to conduct a one-way ANOVA test. The example comes from [Comparing Three or More Means (One-Way Anaylsis of Variance)](http://www.napavalley.edu/people/cburditt/Documents/Math%20232%20-%20Statistics/Sullivan%20ANOVA.pdf). "A stock analyst wondered whether the mean rate of return of financial, energy, and utility stocks differed over the past 5 years. He obtained a simple random sample of eight companies from each of the three sectors and obtained the 5-year rates of return shown in
the following table (in percent):"

In [None]:
data = pd.read_csv('returns.csv', sep=',')

In [None]:
data

### Explore the data

In order to perform the one-way ANOVA test, the groups of observations must be independent. For this example, let's take it that since the analyst used a simple random sampling method, his data is independent.

Now, let's generate descriptive statistics of the data to explore the data further.

In [None]:
data.describe()

As you can see, the three stock groups have differing means. However, considering the standard deviations, there is some overlap. We want to establish whether these means are actually equal or not. Enter the one-way ANOVA below.

### The hypothesis

We state the null and alternative hypotheses as follows:

$H_0:$ the means of the different groups are equal

$H_1:$ the means are different.

We use a significance level of 0.05.



### Visualize the means

In [None]:
f, ax = plt.subplots(figsize=(9, 8))
sns.boxplot(data=data)
ax.xaxis.grid(True) # tweak the visual presentation
ax.set(ylabel="")

In [None]:
print(data['Financial'].std(), data['Energy'].std(), data['Utilities'].std())

Energy stock returns seem better than the other two returns as the mean stock return is greater. The standard deviations are similar so stock returns within each group are equally spread out.

### The test

We perform the one-way ANOVA test using the `f_oneway()` function in `scipy.stats`.

In [None]:
stats.f_oneway(data['Financial'], data['Energy'], data['Utilities'])

We obtain a p value of 0.1, which is greater than the 0.05 significance level. This value indicates evidence in support of the null hypothesis, that the mean values of the different groups are equal.

### Post-hoc tests

A significant p-value indicates there is some difference between group means but it does not indicate which pairs of groups are different. We perform multiple pairwise comparison to determine which group means differ significantly. In practice, there are a few tests to choose from to do this further comparison, each with its own assumptions and limitations. For us, we will use the Tukey HSD (Honest Significant Differences) to performn multiple pairwise comparison between group means. See the references section at the end of this notebook for more on multiple comparisons. 

Unfortunately our test results do not indicate a significant different between the means of the three groups. However, if necessary,the Tukey HSD test function is available from the following library:

In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

### Reporting on ANOVA

A one-way ANOVA was conducted to compare the three stock return options. There was no significant difference between the mean returns of the three stock options (F=2.0774, p = 0.15). No post hoc comparisons were carried out.

### Exercise

#### Question 1
Load the following dataset into a Pandas dataframe: `stcp-Rdataset-Diet.csv`. Investigate the data. Generate summary statistics.

#### Question 2
You need to compare three diets. Generate visualizations to visualize the mean. Discuss the results.

#### Question 3 
Formulate an hypothesis about the diet groups in the dataset. Test your hypothesis.

In [None]:
#response to question 1

In [None]:
#response to question 2

In [None]:
#response to question 3

## References


For more on multiple comparisons, you can refer to the resource linked below which also provides a comparison of other methods available apart from Tukey and when to use them.

[Using Multiple Comparisons](https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/anova/supporting-topics/multiple-comparisons/using-multiple-comparisons-to-assess-differences-in-means/)
