# Lab 9: ANOVA Test
### Author: Aimal Khan (aimalexe)
### Objective
Perform various ANOVA-tests using problems from lecture slides

In [66]:
# import the required libraries
import numpy as np
import pandas as pd
from scipy.stats import f

## **Function: one_way_anova**

### **Description**
Performs a One-Way Analysis of Variance (ANOVA) to determine if there are statistically significant differences between the means of two or more independent groups. The function provides a step-by-step calculation of the ANOVA components and outputs detailed results.

---

### **Parameters**
- **`data` (dict)**: A dictionary where:
  - Keys represent group names (e.g., `"Group A"`).
  - Values are lists of numerical observations for each group.

- **`alpha` (float, optional)**: Significance level for hypothesis testing (default: `0.05`).

---

### **Returns**
A dictionary containing:
- **Group Means**: The mean of observations for each group.
- **Grand Mean**: The overall mean of all observations combined.
- **ANOVA Results**:
  - **SSB (Between-Group Variation)**: Variation due to differences between group means.
  - **SSW (Within-Group Variation)**: Variation within each group.
  - **F-statistic**: Ratio of `MSB` to `MSW`.
  - **p-value**: Probability of observing an F-statistic as extreme under the null hypothesis.
  - **F-critical**: Critical value of F for the given alpha and degrees of freedom.
  - **Conclusion**: Outcome of the hypothesis test.

---

### **Steps and Formulas**
1. **Organize Data**:  
   - Extract group names and observations.

2. **Calculate Descriptive Statistics**:  
   - **Group Means**:  
     $$
     \text{Mean for each group } i: \mu_i = \frac{\sum_{j=1}^{n_i} X_{ij}}{n_i}
     $$
   - **Grand Mean**:  
     $$
     \text{Grand Mean: } \mu_G = \frac{\sum_{i=1}^{k} \sum_{j=1}^{n_i} X_{ij}}{N}
     $$  
     where $k$ = number of groups, $n_i$ = size of group $i$, $N$ = total observations.

3. **Compute Sum of Squares**:
   - **Between-Group Variation (SSB)**:  
     $$
     \text{SSB} = \sum_{i=1}^{k} n_i (\mu_i - \mu_G)^2
     $$
   - **Within-Group Variation (SSW)**:  
     $$
     \text{SSW} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (X_{ij} - \mu_i)^2
     $$

4. **Calculate Degrees of Freedom**:
   - **Between Groups**: $df_{\text{between}} = k - 1$
   - **Within Groups**: $df_{\text{within}} = N - k$

5. **Mean Squares (MS)**:
   - **Between-Group Mean Square (MSB)**:  
     $$
     \text{MSB} = \frac{\text{SSB}}{df_{\text{between}}}
     $$
   - **Within-Group Mean Square (MSW)**:  
     $$
     \text{MSW} = \frac{\text{SSW}}{df_{\text{within}}}
     $$

6. **F-Statistic**:  
   $$
   F = \frac{\text{MSB}}{\text{MSW}}
   $$

7. **Calculate p-value**:  
   Using the F-distribution:
   $$
   p = 1 - F_{\text{cdf}}(F, df_{\text{between}}, df_{\text{within}})
   $$

8. **Determine F-critical Value**:  
   Using the inverse of the F-distribution:
   $$
   F_{\text{critical}} = F_{\text{ppf}}(1 - \alpha, df_{\text{between}}, df_{\text{within}})
   $$

9. **Hypothesis Test Conclusion**:  
   - Reject $H_0$ (Null Hypothesis): $F > F_{\text{critical}}$ or $p < \alpha$
   - Fail to Reject $H_0$: No significant differences between group means.


In [67]:
def one_way_anova(data, alpha=0.05):
    """
    Perform One-Way ANOVA step-by-step.

    Parameters:
    data (dict): A dictionary where keys are group names and values are lists of observations.
    alpha (float): Significance level for hypothesis testing (default: 0.05).

    Returns:
    dict: A dictionary containing ANOVA results, including F-statistic, p-value, and hypothesis test conclusion.
    """
    # Step 1: Organize the data
    groups = list(data.keys())
    observations = list(data.values())

    # Step 2: Descriptive statistics
    group_means = {group: np.mean(values) for group, values in data.items()}
    total_observations = sum(len(values) for values in observations)
    grand_mean = np.mean([value for values in observations for value in values])

    # Step 3: Compute ANOVA without using f_oneway
    # Calculate SSB (Between-Group Variation)
    ssb = sum(len(values) * (np.mean(values) - grand_mean) ** 2 for values in observations)

    # Calculate SSW (Within-Group Variation)
    ssw = sum(sum((value - np.mean(values)) ** 2 for value in values) for values in observations)

    # Degrees of Freedom
    df_between = len(groups) - 1
    df_within = total_observations - len(groups)

    # Mean Squares
    ms_between = ssb / df_between
    ms_within = ssw / df_within

    # F-Statistic
    f_statistic = ms_between / ms_within

    # Step 4: Calculate p-value (using F-distribution)
    p_value = 1 - f.cdf(f_statistic, df_between, df_within)

    # Step 5: Determine F-critical value
    f_critical = f.ppf(1 - alpha, df_between, df_within)

    # Step 5: Hypothesis Test Conclusion
    if f_statistic > f_critical:
        conclusion = "Reject H0: At least one group mean is significantly different."
    else:
        conclusion = "Fail to Reject H0: No significant difference among group means."

    # Step 8: Prepare detailed results
    results = {
        "Group Means": group_means,
        "Grand Mean": grand_mean,
        "ANOVA Results": {
            "SSB (Between-Group Variation)": ssb,
            "SSW (Within-Group Variation)": ssw,
            "F-statistic": f_statistic,
            "p-value": p_value,
            "F-critical": f_critical,
            "Conclusion": conclusion
        }
    }

    return results

## **Function: two_way_anova**

### **Description**
Performs a Two-Way Analysis of Variance (ANOVA) to determine if there are statistically significant effects of two independent factors and their interaction on a dependent variable. The function calculates the ANOVA components step-by-step, including F-values, p-values, F-critical values, and hypothesis test conclusions.

---

### **Parameters**
- **`data` (dict)**: A nested dictionary where:
  - Outer keys represent levels of Factor A (e.g., `"Diet A"`, `"Diet B"`).
  - Inner keys represent levels of Factor B (e.g., `"No Exercise"`, `"Cardio"`).
  - Values are lists of numerical observations (replications) for each combination of Factor A and Factor B.

- **`alpha` (float, optional)**: Significance level for hypothesis testing (default: `0.05`).

---

### **Returns**
A Pandas DataFrame containing:
- **Source**: Source of variation (Factor A, Factor B, Interaction, Error, Total).
- **SS**: Sum of Squares for each source.
- **df**: Degrees of Freedom for each source.
- **MS**: Mean Squares (SS divided by df).
- **F**: F-statistic for Factor A, Factor B, and Interaction.
- **F-critical**: Critical F-value for hypothesis testing at the given alpha level.
- **p-value**: Probability of observing the F-statistic under the null hypothesis.
- **Conclusion**: Result of hypothesis testing (Reject or Fail to Reject $H_0$).

---

### **Steps and Formulas**

#### **1. Organize Data**
- Flatten data into cells for each combination of Factor A and Factor B.

#### **2. Compute Descriptive Statistics**
- **Grand Total $G$**:  
  $$
  G = \sum_{i=1}^{a} \sum_{j=1}^{b} \sum_{k=1}^{n_{ij}} X_{ijk}
  $$
- **Grand Mean $\mu_G$**:  
  $$
  \mu_G = \frac{G}{N}
  $$  
  where $a$ = levels of Factor A, $b$ = levels of Factor B, $n_{ij}$ = replications per cell, $N = a \cdot b \cdot n_{ij}$.

#### **3. Compute Sum of Squares**
- **Total Sum of Squares (SST)**:  
  $$
  SST = \sum_{i=1}^{a} \sum_{j=1}^{b} \sum_{k=1}^{n_{ij}} (X_{ijk} - \mu_G)^2
  $$

- **Sum of Squares for Factor A (SSA)**:  
  $$
  SSA = \frac{1}{b \cdot n_{cell}} \sum_{i=1}^{a} T_i^2 - \frac{G^2}{N}
  $$  
  where $T_i$ = sum of observations for level $i$ of Factor A.

- **Sum of Squares for Factor B (SSB)**:  
  $$
  SSB = \frac{1}{a \cdot n_{cell}} \sum_{j=1}^{b} T_j^2 - \frac{G^2}{N}
  $$  
  where $T_j$ = sum of observations for level $j$ of Factor B.

- **Sum of Squares for Interaction (SSI)**:  
  $$
  SSI = \frac{1}{n_{cell}} \sum_{i=1}^{a} \sum_{j=1}^{b} T_{ij}^2 - SSA - SSB - \frac{G^2}{N}
  $$  
  where $T_{ij}$ = sum of observations in the $(i, j)$ cell.

- **Error Sum of Squares (SSE)**:  
  $$
  SSE = SST - SSA - SSB - SSI
  $$

#### **4. Compute Degrees of Freedom**
- $df_A = a - 1$  
- $df_B = b - 1$  
- $df_{interaction} = (a - 1)(b - 1)$  
- $df_E = N - a \cdot b$  
- $df_T = N - 1$

#### **5. Compute Mean Squares**
- **Mean Square for Factor A (MSA)**:  
  $$
  MSA = \frac{SSA}{df_A}
  $$
- **Mean Square for Factor B (MSB)**:  
  $$
  MSB = \frac{SSB}{df_B}
  $$
- **Mean Square for Interaction (MSI)**:  
  $$
  MSI = \frac{SSI}{df_{interaction}}
  $$
- **Mean Square for Error (MSE)**:  
  $$
  MSE = \frac{SSE}{df_E}
  $$

#### **6. Calculate F-Statistics**
- **For Factor A**:  
  $$
  F_A = \frac{MSA}{MSE}
  $$
- **For Factor B**:  
  $$
  F_B = \frac{MSB}{MSE}
  $$
- **For Interaction**:  
  $$
  F_{interaction} = \frac{MSI}{MSE}
  $$

#### **7. Determine F-critical Values**
- **For Factor A**:  
  $$
  F_{\text{critical},A} = F_{\text{ppf}}(1 - \alpha, df_A, df_E)
  $$
- **For Factor B**:  
  $$
  F_{\text{critical},B} = F_{\text{ppf}}(1 - \alpha, df_B, df_E)
  $$
- **For Interaction**:  
  $$
  F_{\text{critical},interaction} = F_{\text{ppf}}(1 - \alpha, df_{interaction}, df_E)
  $$

#### **8. Hypothesis Test Conclusion**
- Reject $H_0$: $F > F_{\text{critical}}$ or $p < \alpha$.  
- Fail to Reject $H_0$: No significant effect.

In [68]:
def two_way_anova(data, alpha=0.05):
    """
    Perform a two-way ANOVA test.

    Parameters:
    data (dict): Nested dictionary with structure {
                    "Factor A Level 1": {
                        "Factor B Level 1": [values],
                        "Factor B Level 2": [values],
                        ...
                    },
                    ...
                }
    alpha (float): Significance level (default=0.05).

    Returns:
    pd.DataFrame: ANOVA table with F-values, p-values, F-critical values, and conclusions.
    If the degrees of freedom for error are zero, it will raise a warning and stop further computation.
    """
    # Flatten data into cells
    factors_a = list(data.keys())
    factors_b = list(next(iter(data.values())).keys())
    
    a = len(factors_a)  # Number of levels for Factor A
    b = len(factors_b)  # Number of levels for Factor B
    n_cell = len(next(iter(next(iter(data.values())).values())))  # Replications per cell

    if n_cell < 2:
        raise ValueError("Two-Way ANOVA requires at least 2 replications per cell for valid calculations.")

    # Compute grand total and total number of observations
    grand_total = sum(sum(sum(values) for values in level.values()) for level in data.values())
    N = a * b * n_cell
    
    # Compute grand mean
    grand_mean = grand_total / N

    # Compute row totals (Factor A)
    row_totals = {
        factor_a: sum(sum(values) for values in levels.values())
        for factor_a, levels in data.items()
    }

    # Compute column totals (Factor B)
    col_totals = {
        factor_b: sum(data[factor_a][factor_b][i] for factor_a in factors_a for i in range(n_cell))
        for factor_b in factors_b
    }

    # Compute cell totals
    cell_totals = {
        (factor_a, factor_b): sum(values)
        for factor_a, levels in data.items()
        for factor_b, values in levels.items()
    }

    # Compute SSA (Factor A)
    SSA = (1 / (b * n_cell)) * sum(t ** 2 for t in row_totals.values()) - (grand_total ** 2) / N

    # Compute SSB (Factor B)
    SSB = (1 / (a * n_cell)) * sum(t ** 2 for t in col_totals.values()) - (grand_total ** 2) / N

    # Compute SSI (Interaction)
    SSI = (1 / n_cell) * sum(t ** 2 for t in cell_totals.values()) - SSA - SSB - (grand_total ** 2) / N

    # Compute SST (Total Sum of Squares)
    SST = sum((x - grand_mean) ** 2 for factor_a in data.values() for values in factor_a.values() for x in values)

    # Compute SSE (Error)
    SSE = SST - SSA - SSB - SSI

    # Degrees of freedom
    df_A = a - 1
    df_B = b - 1
    df_interaction = (a - 1) * (b - 1)
    df_E = N - (a * b)
    df_T = N - 1

    if df_E <= 0:
        raise ValueError("Degrees of freedom for error (df_E) must be greater than zero. Increase replications per cell.")

    # Mean squares
    MS_A = SSA / df_A
    MS_B = SSB / df_B
    MS_interaction = SSI / df_interaction
    MS_E = SSE / df_E

    # F-values
    F_A = MS_A / MS_E
    F_B = MS_B / MS_E
    F_interaction = MS_interaction / MS_E

    # F-critical values
    F_critical_A = f.ppf(1 - alpha, df_A, df_E)
    F_critical_B = f.ppf(1 - alpha, df_B, df_E)
    F_critical_interaction = f.ppf(1 - alpha, df_interaction, df_E)

    # p-values
    p_A = 1 - f.cdf(F_A, df_A, df_E)
    p_B = 1 - f.cdf(F_B, df_B, df_E)
    p_interaction = 1 - f.cdf(F_interaction, df_interaction, df_E)

    # Conclusions based on F-critical
    conclusion_A = "Reject H0" if F_A > F_critical_A else "Fail to Reject H0"
    conclusion_B = "Reject H0" if F_B > F_critical_B else "Fail to Reject H0"
    conclusion_interaction = "Reject H0" if F_interaction > F_critical_interaction else "Fail to Reject H0"

    # Create ANOVA table
    anova_table = pd.DataFrame({
        "Source": ["Factor A", "Factor B", "Interaction", "Error", "Total"],
        "SS": [SSA, SSB, SSI, SSE, SST],
        "df": [df_A, df_B, df_interaction, df_E, df_T],
        "MS": [MS_A, MS_B, MS_interaction, MS_E, None],
        "F": [F_A, F_B, F_interaction, None, None],
        "F-critical": [F_critical_A, F_critical_B, F_critical_interaction, None, None],
        "p-value": [p_A, p_B, p_interaction, None, None],
        "Conclusion": [conclusion_A, conclusion_B, conclusion_interaction, None, None]
    })

    return anova_table



Prints a nested dictionary (`dict` containing sub-dictionaries) in a structured, readable format. The function ensures proper indentation for sub-dictionaries, making it easier to visualize hierarchical data.

In [69]:
def print_nested_dict(nested_dict):
    """
    Prints a nested dictionary in a structured format with proper indentation for sub-dictionaries.

    Parameters:
    nested_dict (dict): The dictionary to be printed. Can contain sub-dictionaries.

    Returns:
    None
    """
    for key, value in nested_dict.items():
        if isinstance(value, dict):
            print(f"{key}:")
            for sub_key, sub_value in value.items():
                print(f"\t{sub_key}: {sub_value}")
        else:
            print(f"{key}: {value}")


## Task 1
Lecture 11, Slide 13

You want to see if three different golf clubs yield different distances. You randomly select five measurements from trials on an automated driving machine for each club. At the .05 significance level, is there a difference in mean distance?

In [70]:
golf_data = {
    "Club 1": [254, 241, 263, 237, 251],
    "Club 2": [234, 218, 235, 227, 216],
    "Club 3": [200, 222, 197, 206, 204]
}

golf_results = one_way_anova(golf_data)
print_nested_dict(golf_results)

Group Means:
	Club 1: 249.2
	Club 2: 226.0
	Club 3: 205.8
Grand Mean: 227.0
ANOVA Results:
	SSB (Between-Group Variation): 4716.399999999995
	SSW (Within-Group Variation): 1119.6
	F-statistic: 25.275455519828483
	p-value: 4.985235046039982e-05
	F-critical: 3.8852938346523946
	Conclusion: Reject H0: At least one group mean is significantly different.


## Task 2:
Lecture 11, Slide 17

In [71]:
groups_data = {
    "Group 1": [0, 6, 2, 4, 3],
    "Group 2": [1, 4, 3, 2, 0],
    "Group 3": [5, 6, 10, 8, 6]
}

groups_results = one_way_anova(groups_data)
print_nested_dict(groups_results)

Group Means:
	Group 1: 3.0
	Group 2: 2.0
	Group 3: 7.0
Grand Mean: 4.0
ANOVA Results:
	SSB (Between-Group Variation): 70.0
	SSW (Within-Group Variation): 46.0
	F-statistic: 9.130434782608695
	p-value: 0.0038886517793826902
	F-critical: 3.8852938346523946
	Conclusion: Reject H0: At least one group mean is significantly different.


## Task 3
A researcher wants to study the effects of **Diet Type** (Factor A) and **Exercise Type** (Factor B) on weight loss.  
- **Factor A (Diet Type):** 2 levels (Diet A, Diet B).  
- **Factor B (Exercise Type):** 3 levels (No Exercise, Cardio, Strength).  
- **Response Variable:** Weight loss (in kg).  
- **Replications:** 3 individuals for each combination of Diet and Exercise.

### Data
The weight loss data (in kg) is as follows:

| **Diet**  | **No Exercise** | **Cardio** | **Strength** |
|-----------|-----------------|------------|--------------|
| **Diet A** | 1, 2, 1         | 4, 5, 6    | 3, 4, 3      |
| **Diet B** | 2, 1, 3         | 5, 6, 5    | 4, 4, 6      |

In [72]:
# Example usage
dietData = {
    "Diet A": {
        "No Exercise": [1, 2, 1],
        "Cardio": [4, 5, 6],
        "Strength": [3, 4, 3]
    },
    "Diet B": {
        "No Exercise": [2, 1, 3],
        "Cardio": [5, 6, 5],
        "Strength": [4, 4, 6]
    }
}

dietResult = two_way_anova(dietData)
print(dietResult)

        Source         SS  df         MS          F  F-critical   p-value  \
0     Factor A   2.722222   1   2.722222   3.769231    4.747225  0.076052   
1     Factor B  38.111111   2  19.055556  26.384615    3.885294  0.000040   
2  Interaction   0.777778   2   0.388889   0.538462    3.885294  0.597110   
3        Error   8.666667  12   0.722222        NaN         NaN       NaN   
4        Total  50.277778  17        NaN        NaN         NaN       NaN   

          Conclusion  
0  Fail to Reject H0  
1          Reject H0  
2  Fail to Reject H0  
3               None  
4               None  


## Task 4
Lecture 12, Page 7.

A study is conducted to analyze the study hours of four students (A, B, C, D) over three days (Monday, Tuesday, and Wednesday). The goal is to determine:  
1. If there is a significant difference in the study hours across the days.  
2. If there is a significant difference in the study hours among the students.  
3. If there is a significant interaction between the day and the student affecting the study hours.  

The results will indicate whether there are significant main effects for **Days** (Factor A), **Students** (Factor B), and their interaction.

In [73]:
studyDataFromPage = {
    "Monday": {"A": [2], "B": [3], "C": [4], "D": [5]},
    "Tuesday": {"A": [4], "B": [5], "C": [6], "D": [7]},
    "Wednesday": {"A": [6], "B": [7], "C": [8], "D": [9]},
}
try:
    studyResultFromPage = two_way_anova(studyDataFromPage)
    print(studyResultFromPage)
except ValueError as e:
    print(f"Error: {e}")

Error: Two-Way ANOVA requires at least 2 replications per cell for valid calculations.


In [74]:
studyDataIncreasedReplication = {
    "Monday": {"A": [2, 3], "B": [3, 4], "C": [4, 5], "D": [5, 6]},
    "Tuesday": {"A": [4, 5], "B": [5, 6], "C": [6, 7], "D": [7, 8]},
    "Wednesday": {"A": [6, 7], "B": [7, 8], "C": [8, 9], "D": [9, 10]},
}

try:
    studyResultIncreasedReplication = two_way_anova(studyDataIncreasedReplication)
    print(studyResultIncreasedReplication)
except ValueError as e:
    print(f"Error: {e}")


        Source     SS  df    MS     F  F-critical       p-value  \
0     Factor A   64.0   2  32.0  64.0    3.885294  3.965695e-07   
1     Factor B   30.0   3  10.0  20.0    3.490295  5.818935e-05   
2  Interaction    0.0   6   0.0   0.0    2.996120  1.000000e+00   
3        Error    6.0  12   0.5   NaN         NaN           NaN   
4        Total  100.0  23   NaN   NaN         NaN           NaN   

          Conclusion  
0          Reject H0  
1          Reject H0  
2  Fail to Reject H0  
3               None  
4               None  


## Task 5: 
Lecture 12, Page 9

The provided data represents study scores under three water temperature conditions (Cold, Warm, Hot) and three detergent types (A, B, C).

In [75]:
tempData = {
    "Cold": {"A": [47], "B": [45], "C": [50]},
    "Warm": {"A": [39], "B": [42], "C": [52]},
    "Hot": {"A": [44], "B": [36], "C": [48]},
}

try:
    tempResult = two_way_anova(tempData)
    print(tempResult)
except ValueError as e:
    print(f"Error: {e}")


Error: Two-Way ANOVA requires at least 2 replications per cell for valid calculations.


In [76]:
tempDataIncreasedReplication = {
    "Cold": {"A": [47, 48], "B": [45, 46], "C": [50, 51]},
    "Warm": {"A": [39, 40], "B": [42, 43], "C": [52, 53]},
    "Hot": {"A": [44, 45], "B": [36, 37], "C": [48, 49]},
}

try:
    tempResultIncreasedReplication = two_way_anova(tempDataIncreasedReplication)
    print(tempResultIncreasedReplication)
except ValueError as e:
    print(f"Error: {e}")

        Source          SS  df          MS           F  F-critical  \
0     Factor A   67.111111   2   33.555556   67.111111    4.256495   
1     Factor B  261.777778   2  130.888889  261.777778    4.256495   
2  Interaction   98.222222   4   24.555556   49.111111    3.633089   
3        Error    4.500000   9    0.500000         NaN         NaN   
4        Total  431.611111  17         NaN         NaN         NaN   

        p-value Conclusion  
0  3.908809e-06  Reject H0  
1  1.060350e-08  Reject H0  
2  4.087683e-06  Reject H0  
3           NaN       None  
4           NaN       None  
