# **Analysis of Variance (ANOVA)**

Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to determine if there is a significant difference between them. It generalizes the t-test to more than two groups. It may be a capable apparatus for analyzing information and can be utilized in a wide range of areas, counting science, medication, social sciences, and building. 

There are three main types of ANOVA: one-way ANOVA, two-way ANOVA, and Repeated Measures ANOVA. In this lesson, we are going center on one-way ANOVA, which is the foremost common sort of ANOVA.

## **One-Way ANOVA**

One way ANOVA is used when there is one independent variable (also known as a factor) with three or more groups. 


**Assumptions One-way ANOVA**
1. **Normality**: The data should be normally distributed within each group. This assumption can be checked using a normal probability plot or a Shapiro-Wilk test. 
2. **Homogeneity of variance**: The variance of the dependent variable should be equal across all groups. This assumption can be checked using a Levene's test or Bartlette's test. 
3. **Independence**: The observations should be independent of each other. 


**Example**

If these assumptions are not met, the results of ANOVA may not be reliable.

**Research Question**: Do different teaching methods (Method A, Method B, and Method C) result in different average test scores among students?
1. **Hypotheses**
The null hypothesis (H0) in ANOVA is that there is no difference in the means of the groups. The alternative hypothesis (HA) is that at least one group's mean is different from the others.

2. **Test statistic**
The test statistic in ANOVA is the F-statistic, which is calculated as the ratio of the variance between groups to the variance within groups. The formula for the F-statistic is:
$$
F = \frac{\text{MSB}}{\text{MSW}}
$$
where:
- **MSB** is the mean square between groups
- **MSW** is the mean square within groups

3. Compare the F-statistic to the critical value from the F-distribution table or use the p-value to make a decision.
4. If the p-value is less than or equal to the significance level ( α=0.05), reject the null hypothesis.


**Post-hoc tests**

If ANOVA reveals a significant difference between the groups, we need to perform post-hoc tests to determine which groups are significantly different from each other. They are needed to control for Type I error when making multiple comparisons.

There are several post-hoc tests available, including Tukey's test, Dunnett's test, and Scheffe's test.

- **How do you handle violations of the assumptions of one-way ANOVA?**
    - If the assumptions are violated, you can use alternative methods such as the Welch ANOVA (for unequal variances) or non-parametric tests like the Kruskal-Wallis test (for non-normal data).

- **You perform a one-way ANOVA and get an F-statistic of 4.5 with a p-value of 0.02. How do you interpret these results?**
    - The p-value of 0.02 is less than the significance level (α=0.05), so you reject the null hypothesis. This indicates that there is a significant difference between the group means.

In [1]:
import numpy as np
import scipy.stats as stats

# Sample data: Test scores from three different teaching methods
method_A = [85, 88, 90, 85, 87]
method_B = [78, 82, 84, 80, 81]
method_C = [90, 92, 94, 93, 95]

# Combine data into a single array
data = [method_A, method_B, method_C]

# Perform one-way ANOVA
F_stat, p_value = stats.f_oneway(*data)
print(f"F-statistic: {F_stat}")
print(f"P-value: {p_value}")

# Decision based on significance level
alpha = 0.05
if p_value <= alpha:
    print("Reject the null hypothesis (H0).")
    print("There is a significant difference between the group means.")
else:
    print("Fail to reject the null hypothesis (H0).")
    print("There is no significant difference between the group means.")


F-statistic: 39.560606060606005
P-value: 5.216395924738024e-06
Reject the null hypothesis (H0).
There is a significant difference between the group means.


# **Two Way ANOVA**

Two way ANOVA is a statistical method to used analyze the differences between two independent variables (also called factors) and one dependent variable. 

It is used to test whether there is a significant interaction between the two independent variables and whether there are any main effects of each independent variable on the dependent variable.

The two independent variables can be either categorical or continuous. The dependent variable must be continuous, normally distributed, and have equal variances across all groups. 

**Hypotheses in Two-Way ANOVA**

1. **Main Effect of Factor A:**
    - H0: The means of the dependent variable are equal across the levels of Factor A.
    - H1:  The means of the dependent variable are not equal across the levels of Factor A.

2. **Main Effect of Factor B:**
    - H0: The means of the dependent variable are equal across the levels of Factor B.
    - H1:  The means of the dependent variable are not equal across the levels of Factor B.

3. Interaction Effect between Factor A and Factor B:
    - H0: There is no interaction effect between Factor A and Factor B on the dependent variable.
    - H1: There is an interaction effect between Factor A and Factor B on the dependent variable.

**Example**

**Research Question**: How do different diets (Diet A, Diet B) and different exercise programs (Exercise 1, Exercise 2) affect weight loss?

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = {
    'WeightLoss': [2, 4, 5, 3, 6, 8, 7, 6, 5, 4, 7, 8],
    'Diet': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'A', 'A', 'B', 'B'],
    'Exercise': ['1', '1', '2', '2', '1', '1', '2', '2', '1', '2', '1', '2']
}

df = pd.DataFrame(data)

# Perform two-way ANOVA
model = ols('WeightLoss ~ C(Diet) + C(Exercise) + C(Diet):C(Exercise)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


                        sum_sq   df        F    PR(>F)
C(Diet)              30.083333  1.0  22.5625  0.001445
C(Exercise)           0.083333  1.0   0.0625  0.808887
C(Diet):C(Exercise)   0.083333  1.0   0.0625  0.808887
Residual             10.666667  8.0      NaN       NaN
