<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#ANOVA" data-toc-modified-id="ANOVA-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>ANOVA</a></span></li><li><span><a href="#One-Way-F-test-(ANOVA)" data-toc-modified-id="One-Way-F-test-(ANOVA)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>One Way F-test (ANOVA)</a></span><ul class="toc-item"><li><span><a href="#Parameters" data-toc-modified-id="Parameters-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Parameters</a></span></li><li><span><a href="#Returns" data-toc-modified-id="Returns-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Returns</a></span></li><li><span><a href="#Notes" data-toc-modified-id="Notes-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Notes</a></span></li></ul></li><li><span><a href="#Two-Way-F-test" data-toc-modified-id="Two-Way-F-test-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Two Way F-test</a></span></li></ul></div>

# ANOVA
ANOVA (F-TEST) :- The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

F = Between group variability / Within group variability

![](images/F_test.png)

Unlike the z and t-distributions, the F-distribution does not have any negative values because between and within-group variability are always positive due to squaring each deviation.

# One Way F-test (ANOVA)

It tell whether two or more groups are similar or not based on their mean similarity and f-score.
Example : there are 3 different category of plant and their weight and need to check whether all 3 group are similar or not.

The one-way ANOVA tests the null hypothesis that two or more groups have
the same population mean.  The test is applied to samples from two or
more groups, possibly with differing sizes.

Parameters
----------
sample1, sample2, ... : array_like
    The sample measurements for each group.

Returns
-------
statistic : float
    The computed F-value of the test.
pvalue : float
    The associated p-value from the F-distribution.

Notes
-----
The ANOVA test has important assumptions that must be satisfied in order
for the associated p-value to be valid.

1. The samples are independent.
2. Each sample is from a normally distributed population.
3. The population standard deviations of the groups are all equal.  This
   property is known as homoscedasticity.

If these assumptions are not true for a given set of data, it may still be
possible to use the Kruskal-Wallis H-test (`scipy.stats.kruskal`) although
with some loss of power.


In [5]:
import numpy as np
import pandas as pd
import seaborn as sns
import os,sys,time

from scipy import stats
from scipy.stats import ttest_1samp
from scipy.stats import ttest_ind # independent means two samples.
from statsmodels.stats import weightstats as stests # stests.ztest

SEED = 100
pd.set_option('max_columns',100)
pd.set_option('plotting.backend','plotly') # matplotlib, bokeh, altair, plotly
%load_ext watermark
%watermark -iv

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
json     2.0.9
numpy    1.18.4
seaborn  0.11.0
pandas   1.1.0
autopep8 1.5.2



In [16]:
df = pd.read_csv('data/PlantGrowth.csv',index_col=0)
print(df.shape)
df.head()

(30, 2)


Unnamed: 0,weight,group
1,4.17,ctrl
2,5.58,ctrl
3,5.18,ctrl
4,6.11,ctrl
5,4.5,ctrl


In [10]:
grps = df['group'].unique()
grps

array(['ctrl', 'trt1', 'trt2'], dtype=object)

In [12]:
dic_data = {grp:df['weight'][df['group'] == grp] for grp in grps}

In [13]:
F, p = stats.f_oneway(dic_data['ctrl'], dic_data['trt1'], dic_data['trt2'])

In [15]:
print(f"p-value for significance is: {p:.6f}")
if p<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

p-value for significance is: 0.015910
reject null hypothesis


# Two Way F-test
Two way F-test is extension of 1-way f-test, it is used when we have 2 independent variable and 2+ groups. 2-way F-test does not tell which variable is dominant. if we need to check individual significance then Post-hoc testing need to be performed.
Now let’s take a look at the Grand mean crop yield (the mean crop yield not by any sub-group), as well the mean crop yield by each factor, as well as by the factors grouped together


In [20]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [29]:
df = pd.read_csv("data/crop_yield.csv")
print(f"shape       : {df.shape}")
print(f"fertilizers : {df['Fert'].unique()}")
print(f"water       : {df['Water'].unique()}")
display(df.head())

# in statsmodels formuls C is categorical
model = smf.ols('Yield ~ C(Fert)*C(Water)', df).fit()

print(f"Overall model F({model.df_model: .0f},{model.df_resid: .0f}) = {model.fvalue: .3f}, p = {model.f_pvalue: .4f}")

# in results, df is degree of freedom.
res = sm.stats.anova_lm(model, typ= 2)
res

shape       : (20, 3)
fertilizers : ['A' 'B']
water       : ['High' 'Low']


Unnamed: 0,Fert,Water,Yield
0,A,High,27.4
1,A,High,33.6
2,A,High,29.8
3,A,High,35.2
4,A,High,33.0


Overall model F( 3, 16) =  4.112, p =  0.0243


Unnamed: 0,sum_sq,df,F,PR(>F)
C(Fert),69.192,1.0,5.766,0.028847
C(Water),63.368,1.0,5.280667,0.035386
C(Fert):C(Water),15.488,1.0,1.290667,0.272656
Residual,192.0,16.0,,
