<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/code/01RAD_Ex05_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01RAD Exercise 05

Learning objectives:
- Recap: Understand and interpret OLS output (t-tests, F-test, R², adj. R²).
- Explain why $\hat{\beta}$ and $s^2$ are independent under the classical assumptions.
- Compute over all F=Test and select variables


**Setup & Requirements**

- Python packages: `pandas`, `numpy`, `seaborn`, `matplotlib`, `statsmodels`, `scikit-learn`, `scipy`, `mlxtend` (optional for stepwise).
- If running locally, install missing packages as needed:

```bash
pip install pandas numpy seaborn matplotlib statsmodels scikit-learn scipy mlxtend
```

- Notes:
  - You may set a global random seed for reproducibility.
  - Consider setting a plotting theme via `seaborn` for consistent visuals.
  - Formula API (`statsmodels.formula.api.ols`) vs matrix API (`statsmodels.api.OLS`): we use both for clarity.


**Feature Selection: Stepwise**

Stepwise procedures can be unstable, prefer cross-validation and/or penalized regression (Ridge/LASSO) when prediction is the goal.



In [None]:
# Imports: data handling, modeling, plots, and utilities
import warnings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.utils import resample

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.stats.outliers_influence import variance_inflation_factor

from scipy import stats
from scipy.stats import pearsonr, spearmanr

try:
    from mlxtend.feature_selection import SequentialFeatureSelector as SFS
    HAS_MLXTEND = True
except ImportError:
    HAS_MLXTEND = False

# Configure plotting style and reproducibility
sns.set_theme(style="whitegrid", context="notebook")
pd.set_option("display.float_format", lambda value: f"{value:,.3f}")
RNG = np.random.default_rng(42)
warnings.filterwarnings("ignore", category=FutureWarning, module="statsmodels")


**Data Overview**

We use the classic `mpg` dataset from `seaborn`, containing fuel efficiency and vehicle characteristics. Key variables include:
- `mpg`: miles per gallon (response)
- `horsepower`, `displacement`, `weight`, `acceleration`: numeric predictors
- `origin`: region (categorical), `model_year`, etc.

Preprocessing notes:
- Remove rows with missing values (done here via `.dropna()`).
- Ensure numeric dtype for key predictors (e.g., `horsepower` can be non-numeric in some sources).
- Decide upfront whether the goal is inference (explain relationships) or prediction (forecast new `mpg`).


In [None]:

# Load the dataset and enforce numeric dtype where needed
data = sns.load_dataset("mpg").dropna().copy()

# Ensure numeric columns have the expected dtype
data.loc[:, "mpg"] = data["mpg"].astype(float)
data.loc[:, "horsepower"] = data["horsepower"].astype(float)
data.loc[:, "origin"] = data["origin"].astype("category")

# Reset the index to keep resampling operations straightforward
data.reset_index(drop=True, inplace=True)


**Exploratory Analysis**

- Inspect dtypes and first rows to confirm data integrity.
- Explore pairwise relationships (scatterplots) and correlations among predictors.
- Look for skewness/outliers and potential non-linear patterns.


In [None]:

# Inspect dtypes to confirm numeric variables for modeling and look at the first few rows
display(data.dtypes.to_frame(name="dtype"))
display(data.head())

In [None]:
# Spot-check a random sample of rows
display(data.sample(5, random_state=RNG.integers(0, 10_000)))


In [None]:
# Summaries via describe()
print(f"Dataset shape: {data.shape[0]} rows × {data.shape[1]} columns")
display(data.describe().T)


In [None]:
# Pairplot to inspect linear relationships and potential non-linearity
pairplot_features = ["mpg", "horsepower", "weight", "displacement", "acceleration"]
plot_sample = data[pairplot_features]
if len(plot_sample) > 200:
    plot_sample = plot_sample.sample(n=200, random_state=RNG.integers(0, 10_000))
sns.pairplot(plot_sample, diag_kind="kde", corner=True)
plt.suptitle("Pairplot of Selected Features", y=1.02)
plt.show()


In [None]:
# Correlations among selected predictors
corr_subset = data[["displacement", "horsepower", "weight", "acceleration"]].corr()
display(corr_subset)


In [None]:

# Correlation heatmap to quantify linear association
corr = data[["mpg", "horsepower", "weight", "displacement", "acceleration"]].corr()
plt.figure(figsize=(6, 4))
sns.heatmap(corr, annot=True, cmap="vlag", center=0, fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()


In [None]:

# Distribution of the response variable to check skewness and outliers
plt.figure(figsize=(6, 4))
sns.histplot(data["mpg"], bins=20, kde=True, color="#4472C4")
plt.xlabel("mpg")
plt.title("Distribution of MPG")
plt.show()


**Simple OLS: mpg ~ horsepower**

We begin with a simple linear model to build intuition.

Interpretation tips:
- Coefficient sign and magnitude reflect average linear association (holding nothing else constant).
- Check residual plots later for non-linearity or heteroscedasticity.
- R² in simple regression equals the squared Pearson correlation between `mpg` and `horsepower`.


In [None]:

# Fit simple OLS: mpg ~ horsepower and review key statistics
simple_model = smf.ols("mpg ~ horsepower", data=data).fit()

print(f"R-squared: {simple_model.rsquared:.3f}  (Adjusted: {simple_model.rsquared_adj:.3f})")
pearson_r, pearson_p = pearsonr(data["mpg"], data["horsepower"])
print(f"Pearson r: {pearson_r:.3f}; r^2: {pearson_r**2:.3f}; p-value: {pearson_p:.3g}")

display(simple_model.summary2().tables[1])


In [None]:

# Residual diagnostics for the simple model
residuals_simple = simple_model.resid
fitted_simple = simple_model.fittedvalues

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.scatterplot(x=fitted_simple, y=residuals_simple, ax=axes[0])
axes[0].axhline(0, color="red", linestyle="--")
axes[0].set_title("Residuals vs Fitted")
axes[0].set_xlabel("Fitted values")
axes[0].set_ylabel("Residuals")

sm.qqplot(residuals_simple, ax=axes[1])
axes[1].set_title("Q-Q Plot")

plt.tight_layout()
plt.show()


# Theory: Independence of $\hat{\beta}$ and $s^2$

Consider the linear model $y = X\beta + \varepsilon$, with $\varepsilon \sim \mathcal{N}(0, \sigma^2 I)$. The OLS estimator and residuals are:
- $\hat{\beta} = (X'X)^{-1}X'y$, so $\hat{\beta} \sim \mathcal{N}(\beta, \sigma^2 (X'X)^{-1})$.
- Residuals: $e = y - X\hat{\beta} = (I - H)y$, where $H = X(X'X)^{-1}X'$ is the hat matrix.
- Error variance estimator: $s^2 = \mathrm{RSS}/(n - p)$, with $(n-p)s^2/\sigma^2 \sim \chi^2_{n-p}$.

Key fact: $X'e = 0$ (orthogonality of fitted and residual components), and under normal errors, $\hat{\beta}$ and $e$ are independent.
Therefore, $\hat{\beta}$ and $s^2$ are independent random variables.

Proof sketch with the hat matrix:
- Decompose $y$ into orthogonal parts: $Hy$ (in span of $X$) and $(I-H)y$ (in its orthogonal complement).
- $\hat{\beta}$ depends only on $Hy$; $s^2$ depends only on $(I-H)y$. Under Gaussian errors, these parts are independent.

Practical note: Empirically, you may see small sample correlations between estimates and residual variance, but theory predicts independence under the classical assumptions.


**Empirical Check: Bootstrap Independence**

We resample the observed data with replacement and refit the model many times to obtain empirical distributions of coefficients and $s^2$.

Notes:
- Bootstrap corroborates the lack of association in this dataset, but strict independence is a theoretical result (under normality).
- Use sufficient resamples (e.g., 1000+) for stability, mindful of runtime.
- Visualize $s^2$ vs each $\hat{\beta}$ (scatter with LOWESS) and summarize with correlation estimates/intervals.


In [None]:
# Bootstrap loop to empirically examine association between s^2 and coefficients
predictors = ["horsepower", "weight"]
X = sm.add_constant(data[predictors])
y = data["mpg"]

n_bootstraps = 1000
beta_estimates = np.empty((n_bootstraps, X.shape[1]))
sigma_squared_estimates = np.empty(n_bootstraps)

for i in range(n_bootstraps):
    sample_idx = RNG.integers(0, len(data), len(data))
    X_resampled = X.iloc[sample_idx]
    y_resampled = y.iloc[sample_idx]
    bootstrap_model = sm.OLS(y_resampled, X_resampled).fit()
    beta_estimates[i] = bootstrap_model.params.values
    sigma_squared_estimates[i] = bootstrap_model.mse_resid

beta_df = pd.DataFrame(beta_estimates, columns=X.columns)
beta_df["s_n_squared"] = sigma_squared_estimates


In [None]:

# Collect bootstrap estimates into a DataFrame for analysis
corr_summary = []
for col in X.columns:
    pearson_r, pearson_p = pearsonr(beta_df[col], beta_df["s_n_squared"])
    spearman_r, spearman_p = spearmanr(beta_df[col], beta_df["s_n_squared"])
    corr_summary.append(
        {
            "parameter": col,
            "pearson_r": pearson_r,
            "pearson_p": pearson_p,
            "spearman_r": spearman_r,
            "spearman_p": spearman_p,
        }
    )

corr_summary_df = pd.DataFrame(corr_summary)
display(corr_summary_df)


In [None]:

# Visualize relationship between s^2 and each coefficient estimate
for col in X.columns:
    plt.figure(figsize=(6, 4))
    sns.regplot(x=beta_df["s_n_squared"], y=beta_df[col], lowess=True, scatter_kws={"alpha": 0.3})
    plt.title(f"s^2 vs {col} (Bootstrap estimates)")
    plt.xlabel("s^2 (bootstrap)")
    plt.ylabel(f"{col} estimate")
    plt.show()


In [None]:

# Estimate uncertainty around the correlation via bootstrap resampling
n_permutations = 1000
correlations = []
for _ in range(n_permutations):
    s_sample = RNG.choice(beta_df["s_n_squared"], size=len(beta_df), replace=True)
    beta_sample = RNG.choice(beta_df["const"], size=len(beta_df), replace=True)
    correlations.append(np.corrcoef(s_sample, beta_sample)[0, 1])

ci_lower, ci_upper = np.percentile(correlations, [2.5, 97.5])
print(f"Bootstrap correlation 95% CI for const: [{ci_lower:.4f}, {ci_upper:.4f}]")

plt.figure(figsize=(6, 4))
sns.histplot(correlations, bins=30, kde=True, color="#70AD47")
plt.axvline(ci_lower, color="red", linestyle="--", label="95% CI")
plt.axvline(ci_upper, color="red", linestyle="--")
plt.title("Bootstrap Distribution of Correlation (const vs s^2)")
plt.xlabel("Correlation")
plt.legend()
plt.show()




## Simulation Exercise

To empirically demonstrate the independence of $\hat{\beta}$ and $s_n^2$, we can use simulation.

### Task

1. **Generate Data**: Simulate data based on a simple linear regression model.
   - Set up a design matrix $X$ with an intercept and one or more predictors.
   - Generate response values $Y$ based on a known linear relationship with added Gaussian noise.
   
2. **Estimate $\hat{\beta}$ and $s_n^2$**:
   - Use OLS to compute $\hat{\beta}$ and $s_n^2$ for each simulated dataset.

3. **Repeat Simulations**:
   - Perform the simulation multiple times (e.g., 1000 times) to generate distributions for $\hat{\beta}$ and $s_n^2$.

4. **Calculate Correlation**:
   - Calculate the correlation between the simulated values of $\hat{\beta}$ and $s_n^2$.
   - If $\hat{\beta}$ and $s_n^2$ are independent, the correlation should be close to zero.



**Simulation Study**

We simulate data from a known linear model to study sampling behavior.

Design ideas:
- Control multicollinearity by making predictors correlated (e.g., `X2 = X1 + noise`).
- Compare Pearson and Spearman correlation between $s^2$ and each $\hat{\beta}$ across resamples.
- Inspect residual diagnostics to verify assumptions used in the theory.


In [None]:

# Simulated data to study sampling behavior and multicollinearity
n_samples = 200
X1 = RNG.normal(2, 1, n_samples)
X2 = X1 + RNG.normal(0, 0.5, n_samples)
X3 = RNG.normal(4, 1, n_samples)
intercept = np.ones(n_samples)
noise = RNG.normal(0, 1, n_samples)
y_sim = 3 + 2 * X1 - 1 * X2 + 5 * X3 + noise

X_sim = pd.DataFrame({"const": intercept, "X1": X1, "X2": X2, "X3": X3})
model_sim = sm.OLS(y_sim, X_sim).fit()

n_bootstrap_sim = 1000
beta_sim = np.empty((n_bootstrap_sim, X_sim.shape[1]))
sigma_sim = np.empty(n_bootstrap_sim)

for i in range(n_bootstrap_sim):
    idx = RNG.integers(0, n_samples, n_samples)
    X_boot = X_sim.iloc[idx]
    y_boot = y_sim[idx]
    model_boot = sm.OLS(y_boot, X_boot).fit()
    beta_sim[i] = model_boot.params.values
    sigma_sim[i] = model_boot.mse_resid

beta_sim_df = pd.DataFrame(beta_sim, columns=X_sim.columns)
beta_sim_df["s_n_squared"] = sigma_sim

display(model_sim.summary2().tables[1])


In [None]:

# Correlation matrix between coefficients and s^2 in the simulated setting
sim_corr = beta_sim_df.corr()
display(sim_corr.loc[["const", "X1", "X2", "X3"], ["s_n_squared"]])

plt.figure(figsize=(6, 4))
sns.heatmap(sim_corr, annot=True, cmap="crest", fmt=".2f")
plt.title("Simulation: Correlation Matrix of Bootstrap Estimates")
plt.show()


In [None]:
# Calculate the correlation matrix
correlation_matrix = beta_df.corr()
print("Correlation Matrix between each beta estimate and s_n^2:")
print(correlation_matrix)

**Correlation Demo (Toy Example)**

This small example contrasts Pearson (linear) vs Spearman (rank/monotonic) correlation.
- Pearson captures linear association;
- Spearman is robust to monotonic but non-linear relationships and outliers.


In [None]:

# Generate a non-linear relationship to compare Pearson vs Spearman correlation
x = np.linspace(-3, 3, 200)
y = x**2 + RNG.normal(0, 0.5, size=x.size)

plt.figure(figsize=(6, 4))
sns.scatterplot(x=x, y=y, alpha=0.6)
plt.title("Non-linear Relationship")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

pearson_r, pearson_p = pearsonr(x, y)
spearman_r, spearman_p = spearmanr(x, y)
print(f"Pearson r = {pearson_r:.3f}, p = {pearson_p:.3g}")
print(f"Spearman rho = {spearman_r:.3f}, p = {spearman_p:.3g}")


In [None]:
# Generate a monotonic but non-linear relationship to compare Pearson vs Spearman correlation
x = np.linspace(-5, 5, 400)
y = 1/(2+np.exp(-x)) + np.random.normal(0, 0.1, x.size)  # monotónní, nelineární
plt.figure(figsize=(6, 4))
sns.scatterplot(x=x, y=y, alpha=0.6)
plt.title("Non-linear Monotonic Relationship")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

pearson_r, pearson_p = pearsonr(x, y)
spearman_r, spearman_p = spearmanr(x, y)
print(f"Pearson r = {pearson_r:.3f}, p = {pearson_p:.3g}")
print(f"Spearman rho = {spearman_r:.3f}, p = {spearman_p:.3g}")


**Unrelated Predictors and the Global F-test**

We add random, unrelated predictors to illustrate that, in expectation, they should not be significant.

Theory reminder:
- The overall F-test evaluates the joint null that all slopes are zero.
- In a single-parameter model, $F = t^2$.
- With many predictors, the F-test can reject even if some individual t-tests are not significant (shared variance, collinearity).


In [None]:
# Add unrelated (random) predictors to test spurious significance
for col in ["random1", "random2", "random3"]:
    data.loc[:, col] = RNG.normal(0, 1, len(data))

random_model = smf.ols("mpg ~ random1 + random2 + random3", data=data).fit()
print("Random-only model (expect low explanatory power)")
display(random_model.summary2().tables[0])
display(random_model.summary2().tables[1])


In [None]:
# Estimate false-positive rate when regressing mpg on noise predictors
alpha = 0.05
n_trials = 200
significant_counts = []
for _ in range(n_trials):
    temp = data.copy()
    for col in ["noise_a", "noise_b", "noise_c"]:
        temp.loc[:, col] = RNG.normal(0, 1, len(temp))
    noise_model = smf.ols("mpg ~ noise_a + noise_b + noise_c", data=temp).fit()
    sig = (noise_model.pvalues.drop("Intercept") < alpha).sum()
    significant_counts.append(sig)

false_positive_rate = np.mean(np.array(significant_counts) > 0)
print(f"Proportion of trials with at least one false positive (alpha={alpha}): {false_positive_rate:.3f}")


**Comparing F-test and t-tests**

Interpretation tips:
- If the global F-test is significant but many t-tests are not, suspect multicollinearity or insufficient power for individual effects.
- Compare partial $R^2$ and standardized effects to gauge practical importance.
- Avoid p-hacking: pre-register hypotheses or use corrections when testing many predictors.


In [None]:
# Compare the F-test with individual t-tests for each regression coefficient
# F-test p-value (overall model significance)
f_test_pvalue = random_model.f_pvalue

# Individual t-test p-values
t_test_pvalues = random_model.pvalues

print("\nF-test p-value (for the entire model):", f_test_pvalue)
print("Individual t-test p-values (for each coefficient):")
print(t_test_pvalues)


In [None]:
# Combine real and random predictors; examine F-test vs individual t-tests
combined_formula = "mpg ~ random1 + random2 + random3 + horsepower + weight + acceleration"
combined_model = smf.ols(combined_formula, data=data).fit()
print("Model with both signal and noise variables")
display(combined_model.summary2().tables[1])

f_test_pvalue_combined = combined_model.f_pvalue
print(f"Global F-test p-value: {f_test_pvalue_combined:.4g}")


In [None]:
# Compare the F-test with individual t-tests for each regression coefficient
f_test_pvalue_combined = combined_model.f_pvalue
t_test_pvalues_combined = combined_model.pvalues

print("\nF-test p-value (for the combined model):", "{:.4f}".format(f_test_pvalue_combined) )
print("Individual t-test p-values (for each coefficient in the combined model):")
t_test_pvalues_combined_formatted = t_test_pvalues_combined.apply(lambda x: f"{x:.4f}")
print(t_test_pvalues_combined_formatted)


In [None]:

# Specify a candidate final model based on significance and interpretability
final_model_formula = "mpg ~ horsepower + weight + acceleration"
final_model = smf.ols(final_model_formula, data=data).fit()
print("Final model summary")
display(final_model.summary2().tables[1])
print(f"Adjusted R^2: {final_model.rsquared_adj:.3f}")


In [None]:

# Demonstrate that, for a single predictor, F equals t^2
single_model = smf.ols("mpg ~ horsepower", data=data).fit()
t_value = single_model.tvalues["horsepower"]
f_value = single_model.fvalue
print(f"t^2 = {t_value**2:.4f}, F = {f_value:.4f}")


**Manual Calculations: SE, t, and F**

Formulas used:
- $s = \sqrt{\mathrm{RSS}/(n-p)}$
- $\mathrm{Var}(\hat{\beta}) = \sigma^2 (X'X)^{-1}$, so $se(\hat{\beta}_i) = s \sqrt{(X'X)^{-1}_{ii}}$
- $t_i = \hat{\beta}_i / se(\hat{\beta}_i)$ with $\mathrm{df}=n-p$
- Nested-model F-test: $F = \frac{(RSS_r - RSS_f)/q}{RSS_f/(n - p_f)}$

We verify that manual calculations align with `statsmodels` outputs. Small differences are due to rounding.


In [None]:

# (X'X)^(-1) used to compute standard errors manually
X_design = final_model.model.exog
y_obs = final_model.model.endog

XtX_inv = np.linalg.inv(X_design.T @ X_design)
n_obs, n_params = X_design.shape
rss = np.sum(final_model.resid**2)
s_hat = np.sqrt(rss / (n_obs - n_params))
se_manual = s_hat * np.sqrt(np.diag(XtX_inv))


In [None]:
# Manually compute t-tests for each regressor and compare with statsmodels
# Extract the estimated coefficients, standard errors, and degrees of freedom
beta_hat = final_model.params
se_beta_hat = final_model.bse
degrees_of_freedom = final_model.df_resid
print(se_beta_hat)


In [None]:

# Manual t statistics and p-values compared to statsmodels
params = final_model.params
se_sm = final_model.bse

t_manual = params / se_manual
p_manual = 2 * (1 - stats.t.cdf(np.abs(t_manual), df=final_model.df_resid))

comparison_df = pd.DataFrame(
    {
        "coef": params,
        "se_manual": se_manual,
        "se_sm": se_sm,
        "t_manual": t_manual,
        "t_sm": final_model.tvalues,
        "p_manual": p_manual,
        "p_sm": final_model.pvalues,
    }
)
display(comparison_df)

np.testing.assert_allclose(se_manual, se_sm, rtol=1e-6, atol=1e-8)
np.testing.assert_allclose(t_manual, final_model.tvalues.values, rtol=1e-6, atol=1e-8)


In [None]:
# Compare with statsmodels' computed t-values and p-values
print("\nStatsmodels t-test results:")
print(final_model.summary2().tables[1][['Coef.', 't', 'P>|t|']])


In [None]:

# Overall model F-test and interpretation
f_statistic = final_model.fvalue
f_p_value = final_model.f_pvalue
print(f"F-statistic: {f_statistic:.4f} with p-value {f_p_value:.4g}")
print("Interpretation: At least one predictor is associated with mpg at conventional significance levels.")


Interpret the overall F-test and the individual t-tests together:
- Does the model explain a meaningful portion of variance (adjusted $R^2$)?
- Are key predictors significant and practically important (effect sizes, units)?
- Could multicollinearity mask individual effects despite a significant F-test?
- Do diagnostics support assumptions (normality, homoscedasticity, linearity)?


**Multicollinearity & VIF (Future Lessons)**

We artificially create highly correlated predictors to illustrate multicollinearity and compute VIF.

Guidelines:
- VIF > 5 suggests moderate collinearity; VIF > 10 is often considered high.
- High VIF inflates standard errors and destabilizes coefficient estimates.
- Consider dropping/recombining variables, or applying regularization.


In [None]:

# Create near-duplicates to illustrate multicollinearity
data["horsepower_noise"] = data["horsepower"] + RNG.normal(0, 0.1, len(data))
data["weight_noise"] = data["weight"] + RNG.normal(0, 0.1, len(data))

model_with_noise = smf.ols(
    "mpg ~ horsepower + weight + displacement + horsepower_noise + weight_noise",
    data=data,
).fit()
print("Model including highly correlated predictors")
display(model_with_noise.summary2().tables[0])
display(model_with_noise.summary2().tables[1])


**Residual Diagnostics**

We assess OLS assumptions via diagnostic plots. Look for:
- Normality (histogram, Q-Q plot).
- Homoscedasticity (residuals vs fitted/predictors).
- Influence (Cook’s distance, leverage) - will be covered later.
- Linearity (partial regression plots) - will be covered later..


In [None]:

# Residual diagnostics: normality, variance, and patterns
residuals = final_model.resid
fitted_values = final_model.fittedvalues
studentized_residuals = final_model.get_influence().resid_studentized_internal

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

sns.histplot(residuals, bins=20, ax=axes[0], edgecolor="black")
axes[0].set_title("Histogram of Residuals")
axes[0].set_xlabel("Residual")
axes[0].set_ylabel("Frequency")

sm.qqplot(studentized_residuals, line="45", ax=axes[1])
axes[1].set_title("Q-Q Plot of Studentized Residuals")

plt.tight_layout()
plt.show()


In [None]:

# Scale-location plot (spread-location) to assess homoscedasticity
sqrt_abs = np.sqrt(np.abs(studentized_residuals))
plt.figure(figsize=(6, 4))
sns.scatterplot(x=fitted_values, y=sqrt_abs, alpha=0.6)
plt.axhline(sqrt_abs.mean(), color="red", linestyle="--")
plt.xlabel("Fitted values")
plt.ylabel("sqrt(|Studentized residuals|)")
plt.title("Scale-Location Plot")
plt.show()


In [None]:

# Studentized residuals vs. fitted values to check for heteroscedasticity
plt.figure(figsize=(6, 4))
sns.scatterplot(x=fitted_values, y=studentized_residuals, alpha=0.6)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Fitted values")
plt.ylabel("Studentized residuals")
plt.title("Studentized Residuals vs Fitted Values")
plt.show()


In [None]:

# Residuals vs. fitted values with LOWESS smoothing
fig, ax = plt.subplots(figsize=(6, 4))
sns.residplot(x=fitted_values, y=residuals, lowess=True, color="#4472C4", ax=ax)
ax.axhline(0, color="red", linestyle="--")
ax.set_title("Residuals vs Fitted (LOWESS)")
ax.set_xlabel("Fitted values")
ax.set_ylabel("Residuals")
plt.show()


In [None]:

# Studentized residuals vs. each predictor to check for non-linearity
for predictor in final_model_formula.split("~")[1].split("+"):
    predictor = predictor.strip()
    plt.figure(figsize=(6, 4))
    sns.scatterplot(x=data[predictor], y=studentized_residuals, alpha=0.6)
    plt.axhline(0, color="red", linestyle="--")
    plt.xlabel(predictor)
    plt.ylabel("Studentized residuals")
    plt.title(f"Studentized Residuals vs. {predictor}")
    plt.show()


In [None]:
data

In [None]:

# Examine multicollinearity via correlation matrix and remove highly correlated predictors
corr_matrix = data.corr(numeric_only=True).abs()
display(corr_matrix)

corr_threshold = 0.9
high_corr_pairs = [
    (corr_matrix.index[i], corr_matrix.columns[j])
    for i in range(corr_matrix.shape[0])
    for j in range(i + 1, corr_matrix.shape[1])
    if corr_matrix.iloc[i, j] > corr_threshold and corr_matrix.index[i] != "mpg"
]

print(f"Highly correlated pairs (threshold > {corr_threshold}): {high_corr_pairs}")

data_reduced = data.drop(columns={pair[0] for pair in high_corr_pairs if pair[0] in data.columns})
print(f"Remaining predictors after dropping columns: {sorted(data_reduced.columns)}")


In [None]:

# Stepwise selection using mlxtend's SequentialFeatureSelector
if HAS_MLXTEND:
    y_mtx = data["mpg"]
    X_mtx = data.drop(columns=["mpg"])
    numeric_cols = X_mtx.select_dtypes(include=[np.number]).columns
    X_mtx = X_mtx[numeric_cols]

    sfs = SFS(
        LinearRegression(),
        k_features="best",
        forward=True,
        floating=True,
        scoring="r2",
        cv=5,
        n_jobs=1,
    )
    sfs = sfs.fit(X_mtx, y_mtx)
    selected_features = list(sfs.k_feature_names_)
    formula_mlxtend = "mpg ~ " + " + ".join(selected_features)
    model_mlxtend = smf.ols(formula=formula_mlxtend, data=data).fit()
    print(f"Selected features (mlxtend): {selected_features}")
    display(model_mlxtend.summary2().tables[1])
else:
    print("mlxtend is not installed. Install mlxtend to run the stepwise selection demo.")


### Individual Student Work — HW

1. Data Exploration and Preprocessing
   - Load the `mpg` dataset; keep relevant variables.
   - For this HW, use `weight` as the response variable (explicit change).
   - Convert `origin` to a categorical variable with three levels (USA, Europe, Japan).
   - Plot relationships between `weight` and each predictor, including grouped plots by `origin`.

2. Initial Model Fitting
   - Fit OLS: `weight ~ horsepower + displacement + acceleration + C(origin)`.
   - Interpret coefficients and p-values in context (units, direction, practical magnitude).

3. Overall F-test vs Individual t-tests
   - Report the global F-test and compare with individual t-tests.
   - Compare $R^2$ with adjusted $R^2$ and explain the difference.

4. Investigate Correlation
   - Identify the two most correlated predictors; remove one and refit.
   - Compare the lighter model to the full model (adjusted $R^2$, F).
   - At fixed values of common predictors (3 settings), vary the removed predictor randomly and compare prediction intervals across models.

5. Categorical Interactions
   - Fit a model with interactions: e.g., `weight ~ displacement * C(origin)` or `horsepower * C(origin)`.
   - Interpret interaction terms and discuss how `origin` moderates effects.

6. Model Selection
   - Perform some kind of stepwise regression.
   - Compare final vs initial models in adjusted $R^2$ and F; discuss trade-offs.

7. Diagnostics
   - Produce residual plots and a Q–Q plot. Comment on normality, homoscedasticity, and any influential points.

