# Steps to conduct one-way ANOVA:

# 1. Check the assumptions of ANOVA.
# 2. Formulate the Null and Alternate Hypothesis.
# 3. Compute the test statistic [here we are using the F test] and the corresponding p-value.
# 4. Decide whether to accept the Null Hypothesis.
# 5. Check for the particular group where the variation lies so that we can take corrective measures.

##### Note : ANOVA uses the F-test to determine whether the variability between group means is larger than the variability of the observations within the groups. 

### Now that we have checked the assumptions of ANOVA in the previous notebook. Let us go ahead and perform the One Way ANOVA.

In [7]:
import numpy as np
import pandas as pd
import seaborn as sns
from statsmodels.formula.api import ols      # For n-way ANOVA
from statsmodels.stats.anova import _get_covariance,anova_lm # For n-way ANOVA
from statsmodels.stats.multicomp import pairwise_tukeyhsd # For performing the Tukey-HSD test
from statsmodels.stats.multicomp import MultiComparison # To compare the levels of independent variables with the 
                                                        # dependent variables

%matplotlib inline 

In [8]:
iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


The Hypothesis for the One Way ANOVA are:
    
* $H_0$: $All\,the\, population\, means\, are\, equal.$
* $H_a$: $At \,least\,one\,of\,the\, population\, means\, are\, unequal.$


In [9]:
iris['species'] = pd.Categorical(iris['species'])

In [10]:
iris['species'].unique()

[setosa, versicolor, virginica]
Categories (3, object): [setosa, versicolor, virginica]

In [11]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   sepal_length  150 non-null    float64 
 1   sepal_width   150 non-null    float64 
 2   petal_length  150 non-null    float64 
 3   petal_width   150 non-null    float64 
 4   species       150 non-null    category
dtypes: category(1), float64(4)
memory usage: 5.1 KB


Let us see the ANOVA.

In [14]:
formula = 'sepal_width ~ C(species) '
model = ols(formula, iris).fit()
aov_table = anova_lm(model)
print(aov_table)

               df     sum_sq   mean_sq         F        PR(>F)
C(species)    2.0  11.344933  5.672467  49.16004  4.492017e-17
Residual    147.0  16.962000  0.115388       NaN           NaN


Now, we see that the results that we got using the F-Test method is same as the result that we are getting after performing  the ANOVA.

Following is the interpretation of the p-value that we got after doing the F-Test in the previous notebook.

Now, we see that the corresponding p-value is less than alpha (0.05). Thus, we reject the null hypothesis.
This means at least one particular category in the 'species' variable has different mean of sepal_width as compared to the other categories.