<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Objectives" data-toc-modified-id="Objectives-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Load-the-data" data-toc-modified-id="Load-the-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Load the data</a></span></li><li><span><a href="#Generate-the-ANOVA-table" data-toc-modified-id="Generate-the-ANOVA-table-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Generate the ANOVA table</a></span></li><li><span><a href="#Interpret-the-output" data-toc-modified-id="Interpret-the-output-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Interpret the output</a></span></li><li><span><a href="#Compare-to-t-tests" data-toc-modified-id="Compare-to-t-tests-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Compare to t-tests</a></span></li><li><span><a href="#A-2-Category-ANOVA-F-test-is-equivalent-to-a-2-tailed-t-test!" data-toc-modified-id="A-2-Category-ANOVA-F-test-is-equivalent-to-a-2-tailed-t-test!-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!</a></span></li><li><span><a href="#Run-multiple-t-tests" data-toc-modified-id="Run-multiple-t-tests-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Run multiple t-tests</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Summary</a></span></li></ul></div>

# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

In this lab you will: 

- Use ANOVA for testing multiple pairwise comparisons 
- Interpret results of an ANOVA and compare them to a t-test

## Load the data

Start by loading in the data stored in the file `'ToothGrowth.csv'`: 

In [1]:
# Your code here
import pandas as pd

# Load the ToothGrowth.csv dataset
df = pd.read_csv('ToothGrowth.csv')

# Display the first few rows to verify
print(df.head())

    len supp  dose
0   4.2   VC   0.5
1  11.5   VC   0.5
2   7.3   VC   0.5
3   5.8   VC   0.5
4   6.4   VC   0.5


## Generate the ANOVA table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage:  

In [2]:
# Your code here
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit a two-way ANOVA model
# Formula: len ~ supp + dose + supp:dose (main effects and interaction)
model = ols('len ~ C(supp) + C(dose) + C(supp):C(dose)', data=df).fit()

# Generate the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                      sum_sq    df          F        PR(>F)
C(supp)           205.350000   1.0  15.571979  2.311828e-04
C(dose)          2426.434333   2.0  91.999965  4.046291e-18
C(supp):C(dose)   108.319000   2.0   4.106991  2.186027e-02
Residual          712.106000  54.0        NaN           NaN


## Interpret the output

Make a brief comment regarding the statistics and the effect of supplement and dosage on tooth length: 

**Interpretation:**

- **Supplement (supp):** The p-value (0.000231) is less than 0.05, indicating a statistically significant effect of supplement type (OJ vs. VC) on tooth length. OJ appears to promote greater tooth growth on average.
- **Dosage (dose):** The p-value (0.000000) is less than 0.05, showing a highly significant effect of dosage (0.5, 1.0, 2.0 mg) on tooth length. Higher doses are associated with increased tooth length.
- **Interaction (supp:dose):** The p-value (0.021860) is less than 0.05, suggesting a significant interaction between supplement and dosage. The effect of dosage on tooth length varies depending on the supplement type (e.g., OJ may be more effective at lower doses, VC at higher doses).
- **Recommendations:** Flatiron Health Insurance (FHI) could use these findings to inform nutritional supplement strategies, emphasizing higher doses and potentially favoring OJ for certain dosages to maximize tooth growth in relevant applications.

## Compare to t-tests

Now that you've had a chance to generate an ANOVA table, its interesting to compare the results to those from the t-tests you were working with earlier. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterward, you'll conduct a t-test to compare the tooth length of these two different samples: 

In [3]:
# Your code here
from scipy.stats import ttest_ind

# Split data into OJ and VC groups
oj = df[df['supp'] == 'OJ']['len']
vc = df[df['supp'] == 'VC']['len']

Now run a t-test between these two groups and print the associated two-sided p-value: 

In [4]:
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups
# Perform two-sided t-test
t_stat, p_value = ttest_ind(oj, vc, equal_var=True)
print(f"T-test results for OJ vs. VC:")
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")

T-test results for OJ vs. VC:
T-statistic: 1.9153, P-value: 0.0604


## A 2-Category ANOVA F-test is equivalent to a 2-tailed t-test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [5]:
# Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
# Compare the p-value to that of the t-test above. 
# They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

# Fit a one-way ANOVA model with only supplement
model_supp = ols('len ~ C(supp)', data=df).fit()

# Generate the ANOVA table
anova_supp = sm.stats.anova_lm(model_supp, typ=2)
print("ANOVA results for supplement only:")
print(anova_supp)

# Compare p-value to t-test
t_test_p_value = p_value  # From previous t-test
anova_p_value = anova_supp['PR(>F)'][0]
print(f"\nT-test p-value: {t_test_p_value:.4f}")
print(f"ANOVA p-value: {anova_p_value:.4f}")

ANOVA results for supplement only:
               sum_sq    df         F    PR(>F)
C(supp)    205.350000   1.0  3.668253  0.060393
Residual  3246.859333  58.0       NaN       NaN

T-test p-value: 0.0604
ANOVA p-value: 0.0604


## Run multiple t-tests

While the 2-category ANOVA test is identical to a 2-tailed t-test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [10]:
for group_name, group_data in df.groupby(['supp', 'dose']):
    print(group_name)
    

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2-sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [None]:
# Your code here; reuse your t-test code above to calculate the p-value for a 2-sided t-test
# for all combinations of the supplement-dose groups listed above. 
# (Since there isn't a control group, compare each group to every other group.)
# Calculate the 2-sided p-value for a t-test comparing the two supplement groups
from itertools import combinations
from scipy.stats import ttest_ind

# Get unique group combinations based on supp and dose
groups = df.groupby(['supp', 'dose'])['len']
group_names = [group[0] for group in groups]
group_data = {group[0]: group[1] for group in groups}

# Generate all pairwise combinations
pairwise_combinations = list(combinations(group_names, 2))

# Perform t-tests for each pair
print("Pairwise t-test results:")
for group1, group2 in pairwise_combinations:
    data1 = group_data[group1]
    data2 = group_data[group2]
    t_stat, p_value = ttest_ind(data1, data2, equal_var=True)
    print(f"{group1} vs. {group2}: P-value = {p_value:.4f}")

## Summary

In this lesson, you implemented the ANOVA technique to generalize testing methods to multiple groups and factors.