# The Complete Guide to ANOVA (Analysis of Variance) for Data Science

## 1. What is ANOVA?
ANOVA is a statistical hypothesis test used to check if the **means** of three or more independent groups are significantly different.

* **Why not just use T-tests?**
    If you have 3 groups (A, B, C) and run three separate T-tests (A vs B, B vs C, A vs C), your error rate compounds. A single T-test has a 5% chance of a false positive ($\alpha = 0.05$). Three T-tests increase this risk to $\approx 14\%$.
    * **ANOVA Solution:** It tests all variances simultaneously to maintain a 5% error rate.

* **Use Cases in Data Science:**
    * **Feature Selection:** Determining if a categorical feature (e.g., *Region*) statistically impacts a numerical target (e.g., *House Price*).
    * **A/B/C Testing:** Comparing multiple versions of a webpage or algorithm.

---

## 2. The Hypothesis

* **Null Hypothesis ($H_0$):** All group means are equal.
    $$\mu_1 = \mu_2 = \mu_3 = ... = \mu_k$$
    *(Translation: The groups are essentially the same; any difference is due to luck.)*

* **Alternative Hypothesis ($H_1$):** At least one group mean is different.
    *(Translation: The category/grouping actually matters.)*

---

## 3. The Logic (F-Statistic)
ANOVA compares two types of variation using the **F-Ratio**:

$$F = \frac{\text{Variance BETWEEN Groups}}{\text{Variance WITHIN Groups}}$$

* **Variance Between Groups:** How different the group means are from the global mean (Signal).
* **Variance Within Groups:** How spread out the data is inside each group (Noise).

**Interpretation:**
* **High F-value:** Signal > Noise. The groups are distinct. (Reject $H_0$).
* **Low F-value:** Noise > Signal. The groups overlap. (Fail to reject $H_0$).

---

## 4. Assumptions (Crucial for Interviews)
If your data fails these, your ANOVA results are invalid.

1.  **Normality:** Residuals should be normally distributed (Shapiro-Wilk Test).
2.  **Homogeneity of Variance:** Variances in each group must be roughly equal (Leveneâ€™s Test).
3.  **Independence:** Samples must be independent (Random sampling).

*Note: If data is NOT normal, use the **Kruskal-Wallis H Test** (Non-parametric ANOVA).*

---

## 5. Implementation in Python

### A. One-Way ANOVA (Scipy)
Comparing one categorical factor against a numerical target.

```python
import scipy.stats as stats

# Example: Website Time Spent (seconds) for 3 different Designs
design_A = [25, 30, 28, 35, 29]
design_B = [45, 55, 60, 50, 58]
design_C = [30, 32, 31, 33, 29]

# 1. Run the test
f_stat, p_value = stats.f_oneway(design_A, design_B, design_C)

print(f"F-Stat: {f_stat:.2f} | P-Value: {p_value:.5f}")

# 2. Interpret
alpha = 0.05
if p_value < alpha:
    print("REJECT Null Hypothesis: At least one design is significantly different.")
else:
    print("FAIL TO REJECT: No significant difference found.")
```

### B. Post-Hoc Analysis (Tukey's HSD)
ANOVA tells you *that* there is a difference, but not *where*. Use Tukey's HSD to find the specific pair.

```python
import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Prepare Data
df = pd.DataFrame({
    'score': design_A + design_B + design_C,
    'group': ['A']*5 + ['B']*5 + ['C']*5
})

# Run Tukey
tukey = pairwise_tukeyhsd(endog=df['score'], groups=df['group'], alpha=0.05)
print(tukey)
```

**Sample Output:**
```text
group1 group2  meandiff   p-adj    reject
-----------------------------------------
     A      B      24.0   0.001      True  <-- Significant Diff
     A      C       1.6   0.900     False  <-- No Diff
     B      C     -22.4   0.001      True  <-- Significant Diff
```

---

## 6. Feature Selection (Scikit-Learn)
In ML pipelines, use ANOVA (`f_classif`) to select the best features.

```python
from sklearn.feature_selection import SelectKBest, f_classif

# X = Numerical Features, y = Categorical Target
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)

# Get scores for each feature
print(selector.scores_)
```

In [None]:
import pandas as pd
import seaborn as sns
from scipy.stats import f_oneway

In [None]:
print(1)