## ANOVA --------------------

In [48]:
import numpy as np
import pandas as pd

In [49]:
from scipy import stats as stats

* One Way F-test(Anova) :- It tell whether two or more groups are similar or not based on their mean similarity and f-score.

### Example : there are 3 different category of plant and their weight and need to check whether all 3 group are similar or not

In [50]:
df_anova = pd.read_excel(r'datasets/PlantGrowth.xlsx')
df_anova.head()

Unnamed: 0.1,Unnamed: 0,weight,group
0,1,4.17,ctrl
1,2,5.58,ctrl
2,3,5.18,ctrl
3,4,6.11,ctrl
4,5,4.5,ctrl


In [51]:
df_anova = df_anova[['weight','group']] # Unnamed: 0 column is not required
df_anova.head()

Unnamed: 0,weight,group
0,4.17,ctrl
1,5.58,ctrl
2,5.18,ctrl
3,6.11,ctrl
4,4.5,ctrl


In [52]:
df_anova['group'].unique()

array(['ctrl', 'trt1', 'trt2'], dtype=object)

##### For  one way ANOVA, we should use the inbuilt python function: stats.f_oneway(x1, x2, x3)

* stats.f_oneway(x1, x2, x3) 
* x1 = sample 1, x2 =  sample 2, ...
* So, segggregate the weight data with respect to group

Extract all samples

In [53]:
df_anova[df_anova['group'] == 'ctrl']['weight'].values

array([4.17, 5.58, 5.18, 6.11, 4.5 , 4.61, 5.17, 4.53, 5.33, 5.14])

In [54]:
x1 = df_anova[df_anova['group']=='ctrl']['weight'].values # simple way
x2 = df_anova[df_anova['group']=='trt1']['weight'].values
x3 = df_anova[df_anova['group']=='trt2']['weight'].values
print(x1,x2,x3)

[4.17 5.58 5.18 6.11 4.5  4.61 5.17 4.53 5.33 5.14] [4.81 4.17 4.41 3.59 5.87 3.83 6.03 4.89 4.32 4.69] [6.31 5.12 5.54 5.5  5.37 5.29 4.92 6.15 5.8  5.26]


In [55]:
alpha = 0.05

In [56]:
F_stat, p_twosided = stats.f_oneway(x1, x2, x3) 
print("F_statistic = ", F_stat, ", p = ", p_twosided)

F_statistic =  4.846087862380136 , p =  0.0159099583256229


In [57]:
if  p_twosided < alpha:   
    print(" Reject null hypothesis")
else:
    print("Accept null hypothesis")

 Reject null hypothesis


The three samples x1,x2,x3 belong to different population

In [58]:
## or we can compare F_stat with F_crit

In [59]:
k = 3
N = len(x1) + len(x2) + len(x3)

In [60]:
df_b = k-1
df_w = N-k

In [61]:
F_crit = stats.f.ppf(1-alpha, df_b, df_w)
F_crit

3.3541308285291986

In [62]:
if F_stat >= F_crit:
    print("Reject null hypothesis.")
else:
    print("Accept null hypothesis.")

Reject null hypothesis.


The three samples x1,x2,x3 belong to different population

### Question (slide)

Q. Three groups of samples of factory emissions of different plants of the same company were collected. The score is computed based on the composition of the emissions. We want to find out if there is any inconsistency or difference across the three groups.

A = 57,56,58,58,56,59,56,55,53,54,53,42,44,34,54,54,34,64,84,24

B = 49,47,49,47,49,47,49,46,45,46,41,42,41,42,42,42,14,14,34

C = 49,48,46,46,49,46,45,55,61,45,45,45,49,54,44,74,54,84,39

In [63]:
xa = [57,56,58,58,56,59,56,55,53,54,53,42,44,34,54,54,34,64,84,24]

xb = [49,47,49,47,49,47,49,46,45,46,41,42,41,42,42,42,14,14,34]

xc = [49,48,46,46,49,46,45,55,61,45,45,45,49,54,44,74,54,84,39]

In [64]:
𝐻_0 = 'Factory emissions are same across all plants.' 
𝐻_𝑎 = 'There is a significant difference across the three groups of plant emissions.'

In [65]:
alpha = 0.05

In [66]:
F_stat, p_twosided = stats.f_oneway(xa, xb, xc) 
print("F_statistic = ", F_stat, ", p = ", p_twosided)

F_statistic =  5.605295675427708 , p =  0.0060879156389451235


In [67]:
if  p_twosided < alpha: 
    print("Reject null hypothesis.", 𝐻_𝑎)
else:
    print("Accept null hypothesis.", 𝐻_0)

Reject null hypothesis. There is a significant difference across the three groups of plant emissions.


In [68]:
## or 

In [69]:
N = len(xa) + len(xb) + len(xc) 
k = 3

In [70]:
df_b = 𝑘 - 1
df_w = 𝑁 - 𝑘
# df_t = 𝑁 - 1 

In [71]:
F_crit = stats.f.ppf(1-alpha,df_b, df_w) 
F_crit

3.1649933957687586

In [72]:
if F_stat >= F_crit:
    print("Reject null hypothesis.", 𝐻_𝑎)
else:
    print("Accept null hypothesis.", 𝐻_0)

Reject null hypothesis. There is a significant difference across the three groups of plant emissions.


#### Two Way F-test :- Two way F-test is extension of 1-way f-test, it is used when we have 2 independent variable and 2+ groups.
* Two-way F-test does not tell which variable is dominant. if we need to check individual significance then Post-hoc testing need to be performed.
* Homework !

https://www.statology.org/two-way-anova-python/