# Regression as Hypothesis Testing

Linear regression is not just a predictive tool.
It is a **generalized hypothesis testing framework**.

In this notebook, we show that:
- each regression coefficient corresponds to a hypothesis test
- t-tests and ANOVA are special cases of regression
- regression allows controlled, multivariate inference


### ðŸŸ¦ Imports & Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

from src.data_generation import generate_student_dataset

sns.set(style="whitegrid")

df = generate_student_dataset(n=4000, random_state=42)


### ðŸŸ¦ Why Regression?

Earlier notebooks tested one factor at a time.

Regression allows us to:
- test multiple hypotheses simultaneously
- control for confounding variables
- estimate adjusted effects
- quantify uncertainty with confidence intervals

This is how hypothesis testing is done in real research.


## ðŸŸ© Part I â€” Simple Linear Regression

### Research Question

Does **study time** influence exam score?

We model:

$$
\text{score} = \beta_0 + \beta_1 \cdot \text{study\_hours} + \varepsilon
$$


### ðŸŸ¦ Hypotheses (Coefficient Test)

For the slope coefficient:

$$
H_0: \beta_1 = 0
$$

$$
H_1: \beta_1 \neq 0
$$

This tests whether study hours have any linear effect on score.


### ðŸŸ¦ Fit Simple Regression

In [None]:
model_simple = smf.ols("score ~ study_hours", data=df).fit()
model_simple.summary()


#### Interpretation

- The slope coefficient represents the expected score increase
  per additional study hour
- The t-statistic and p-value test \( H_0: \beta_1 = 0 \)
- A narrow confidence interval indicates precise estimation

This is equivalent to a one-sample t-test on the slope.


In [None]:
plt.figure(figsize=(6, 4))
sns.regplot(x="study_hours", y="score", data=df, line_kws={"color": "red"})
plt.title("Exam Score vs Study Hours")
plt.show()


## ðŸŸ© Part II â€” Multiple Linear Regression


To isolate effects, we include multiple predictors:

$$
\text{score} =
\beta_0
+ \beta_1 \cdot \text{study\_hours}
+ \beta_2 \cdot \text{attendance\_rate}
+ \beta_3 \cdot \text{previous\_gpa}
+ \varepsilon
$$

Each coefficient corresponds to a **separate hypothesis test**.


### ðŸŸ¦ Fit Multivariate Model

In [None]:
model_multi = smf.ols(
    "score ~ study_hours + attendance_rate + previous_gpa",
    data=df
).fit()

model_multi.summary()


#### Coefficient Interpretation

For each predictor $ X_i $:

$$
H_0: \beta_i = 0
$$

$$
H_1: \beta_i \neq 0
$$

Interpretation example:
> Holding attendance and GPA constant, one additional study hour
> increases the expected exam score by $ \beta_1 $ points.


## ðŸŸ© Part III â€” Regression and ANOVA Connection

### Regression with Categorical Variables

ANOVA is a special case of regression using indicator variables.

We include teaching method as a categorical predictor:

$$
\text{score} = \beta_0 + \beta_1 D_B + \beta_2 D_C + \varepsilon
$$

where $ D_B, D_C $ are dummy variables.


### ðŸŸ¦ Regression with Categorical Variable

In [None]:
model_cat = smf.ols(
    "score ~ C(teaching_method)",
    data=df
).fit()

model_cat.summary()

### Global Hypothesis Test

The overall F-test evaluates:

$$
H_0: \beta_1 = \beta_2 = \dots = 0
$$

This is **exactly the ANOVA null hypothesis**.


### ðŸŸ¦ ANOVA Table from Regression

In [None]:
sm.stats.anova_lm(model_cat, typ=2)


#### Interpretation

- The F-test determines whether teaching method matters at all
- Individual coefficients compare each group to the reference category
- This unifies ANOVA and regression into a single framework


## ðŸŸ© Part IV â€” Model Diagnostics (Assumptions)

### Regression Assumptions

1. Linearity
2. Independence
3. Homoscedasticity
4. Normality of residuals

Violations affect inference, not necessarily prediction.


### ðŸŸ¦ Residual Diagnostics

In [None]:
residuals = model_multi.resid
fitted = model_multi.fittedvalues


#### Residuals vs Fitted

In [None]:
plt.figure(figsize=(6, 4))
sns.scatterplot(x=fitted, y=residuals)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs Fitted")
plt.show()


#### Qâ€“Q Plot

In [None]:
sm.qqplot(residuals, line="45")
plt.title("Qâ€“Q Plot of Residuals")
plt.show()


#### Diagnostics Interpretation

- Residuals are approximately centered around zero
- No strong heteroscedasticity is visible
- Normality is reasonable given sample size

Inference from regression is reliable.


## ðŸŸ© Part V â€” Confidence Intervals & Practical Meaning

### Confidence Intervals for Coefficients

Each coefficient has a confidence interval:

$$
\beta_i \pm t_{\alpha/2} \cdot SE(\beta_i)
$$

Intervals excluding zero indicate statistical significance.


### ðŸŸ¦ Extract Confidence Intervals

In [None]:
model_multi.conf_int()

## Final Summary

This notebook demonstrated that:

- Regression generalizes hypothesis testing
- Each coefficient corresponds to a null hypothesis
- ANOVA and t-tests are special cases of regression
- Diagnostics are essential for valid inference

Regression provides the most flexible and powerful
hypothesis testing framework in statistics.
