# ANOVA (Analysis of Variance) Example

Here are the steps for calculating the metrics for performing ANOVA (Analysis of Variance). 

Note that in the context of ANOVA, the word "group" is synonymous with independent variable.

1. State the hypothesis:

    - $H_0:$ There is no statistically signficicant difference between the performance of the models,
    - $H_1​​:$ There is statistically signficicant difference between the performance of the models.
    
<br/>

2. Calculate the mean of observations:

    - $\bar{X} = \dfrac{1}{N} \displaystyle\sum_{i=1}^{k} \displaystyle\sum_{j=1}^{n_{j}} ​​X_{ij}​$ <br/>

    where: <br/>
    - $N$ is the total number of observations <br/>
    - $k$ is the number of groups <br/>
    - $n_i$ is the number of observations in group $i$ <br/>
    - $X_{ij}$ is the $j$th observation in group $i$ <br/>

<br/>

3. Calculate the sum of squares between groups ($SS_{\text{between}}$):

    - $SS_{\text{between}} = \sum_{i=1}^{k} n_i (\bar{X}_i - \bar{X})^2$

<br/>

4. Calculate the sum of squares within groups ($SS_{\text{within}}$):

    - $SS_{\text{within}} = \sum_{i=1}^{k}\sum_{j=1}^{n_i}(X_{ij} - \bar{X}_i)^2$

<br/>

4. Calculate the mean square between groups (MSB):

    - $MSB = \dfrac{SS_{\text{between}}}{k-1}$

<br/>

5. Calculate the mean square within groups (MSW):

    - $MSW = \dfrac{{SS_{\text{within}}}}{{N-k}}$

<br/>

6. Calculate the F-statistic:

    - $F = \dfrac{MSW}{MSB}$

<br/>

7. Determine the critical F-value based on the significance level ($\alpha$) and degrees of freedom ($df_1$ and $df_2$) using a statistical table or software.

<br/>

8. Compare the calculated F-value with the critical F-value. If the calculated F-value is greater than the critical F-value, reject the null hypothesis; otherwise, fail to reject the null hypothesis is rejected to identify which groups are significantly different from each other.


In [2]:
# Import libraries
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [5]:
# Save randomization pattern
np.random.seed(1)

# Generate synthetic data, D = {X1, X2, X3, Y}
N = 100  # Number of samples
X1 = np.random.normal(0.1, 1, N) + 0.01 * np.random.normal(0, 1, N)  # Independent variable with added noise
X2 = np.random.normal(-0.2, 1.5, N) + 0.01 * np.random.normal(0, 1, N)  # Additional predictor with added noise
X3 = np.random.normal(0.3, 2, N) + 0.01 * np.random.normal(0, 1, N)  # Additional predictor with added noise
Y = X1 + X2 + X3 + np.random.normal(0, 1, N)  # Dependent variable

In [6]:
# Create a dataframe
df = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'Y': Y})

# Fit the linear regression model 1
model_lm1 = ols('Y ~ X1 + X2', data=df).fit()

# Fit the linear regression model 2
model_lm2 = ols('Y ~ X1 + X3', data=df).fit()

# Compute residuals for linear regression model 1
residuals_lm1 = df['Y'] - model_lm1.fittedvalues

# Compute residuals for linear regression model 2
residuals_lm2 = df['Y'] - model_lm2.fittedvalues

# Perform ANOVA (or comparison) between the two linear regression models
anova_results = sm.stats.anova_lm(model_lm1, model_lm2)
anova_results['Residuals_lm1'] = residuals_lm1
anova_results['Residuals_lm2'] = residuals_lm2

# Drop rows with missing values in ss_diff and F
anova_results.dropna(subset=['ss_diff', 'F'], inplace=True)

print(anova_results)

   df_resid         ssr  df_diff     ss_diff    F  Pr(>F)  Residuals_lm1  \
1      97.0  362.171161     -0.0  230.922702 -inf     NaN       0.227269   

   Residuals_lm2  
1       1.634136  
