<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/code/01RAD_Ex04_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 01RAD Exercise 04

- Residual diagnostics
- Post-hoc analysis checks


$\operatorname{Var}(\hat{y}_i) = \sigma^2 \cdot \mathbf{x}_i^\top (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{x}_i$


# Projection Hat Matrix $\mathbf{H}$ and Matrix $\mathbf{M}$

## $\mathbf{H}$: Hat Matrix

$$
\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top
$$

- Projects the observation vector $\mathbf{Y}$ onto the column space generated by $\mathbf{X}$.
- "Puts a hat" on $\mathbf{y}$, because $\hat{\mathbf{Y}} = \mathbf{H} \mathbf{Y}$.
- In code: `hat_Y = H @ Y`.

**Properties**

- **Symmetry and idempotency**: $\mathbf{H} = \mathbf{H}^\top$ and $\mathbf{H}^2 = \mathbf{H}$.
- **Trace (degrees of freedom)**: $\operatorname{tr}(\mathbf{H}) = p$.
- **Diagonal elements ($h_{ii}$) are leverage values**, quantifying how much each observation influences its own fitted value.

## $\mathbf{M}$ Matrix

$$
\mathbf{M} = \mathbf{I} - \mathbf{H}
$$

- Complement of the hat matrix.
- Projects onto the orthogonal complement of the column space of $\mathbf{X}$.
- Residual vector: $\hat{\mathbf{e}} = \mathbf{M} \mathbf{Y}$.

**Properties**

- **Symmetry and idempotency**: $\mathbf{M} = \mathbf{M}^\top$ and $\mathbf{M}^2 = \mathbf{M}$.
- **Orthogonality with $\mathbf{H}$**: $\mathbf{H}\mathbf{M} = \mathbf{0}$ and $\mathbf{M}\mathbf{H} = \mathbf{0}$.
- **Trace (residual degrees of freedom)**: $\operatorname{tr}(\mathbf{M}) = n - p$.



## Questions

- If $\mathbf{X}$ has $n$ rows and $p$ linearly independent columns, what is the dimension of the original vector space on which $\mathbf{H}$ and $\mathbf{M}$ operate?
- What is the dimension (rank) of the column space $\operatorname{Col}(\mathbf{X})$?
- What is the dimension of $\operatorname{Nul}(\mathbf{X}^\top)$, the orthogonal complement of $\operatorname{Col}(\mathbf{X})$, and what do we call this space?
- Is the decomposition $\mathbb{R}^n = \operatorname{Col}(\mathbf{X}) \oplus \operatorname{Nul}(\mathbf{X}^T)$ correct?



### Types of residuals in linear regression

$$
Y_i = X_i \beta + e_i, \quad e_i \sim \mathcal{N}(0, \sigma^2)
$$

Residuals measure the difference between observed and predicted values.

#### 1. Raw residuals
$$
\hat{e}_i = Y_i - \hat{Y}_i
$$

#### 2. Standardized residuals (known $\sigma$)
Raw residuals scaled by their standard deviation when $\sigma$ is known:
$$
\hat{Z}_i = \frac{\hat{e}_i}{\sigma \sqrt{1 - h_{ii}}}
$$
Question: how is the variance of $\hat{e}_i$ derived from $\sigma^2$ and $h_{ii}$?

#### 3. Internally studentized residuals (unknown $\sigma$)
Adjust raw residuals using the OLS variance estimate $s^2 = \frac{1}{n - p} \sum_{j=1}^n \hat{e}_j^2$:
$$
\hat{r}_i = \frac{\hat{e}_i}{s \sqrt{1 - h_{ii}}}
$$
These residuals account for leverage but still use the variance estimate computed from all observations.

#### 4. Externally studentized residuals
Residuals scaled by their variance estimate excluding the ${i}$-th case:
$$
\hat{r}_{(-i)} = \frac{\hat{e}}{s_{(-i)} \sqrt{1 - h_{ii}}}, \quad s_{(-i)}^2 = \frac{(n - p) s^2 - \frac{\hat{e}_i^2}{1 - h_{ii}}}{n - p - 1}
$$


#### 5. Externally studentized Leave-one-out  residuals
Leave-one-out residuals scaled by their own variance estimate:
$$
\hat{r}_{(-i)} = \frac{\hat{e}_{(-i)}}{s_{(-i)} \sqrt{1 - h_{ii}}},
$$
Externally studentized residuals highlight the influence of each observation if it were removed from the fit.

In R:
* `rstandard(model, ...)`	internally studentized
* `rstudent(model, ...)`    	externally studentized


# Let's code

In [None]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

import scipy.stats as stats
import matplotlib.pyplot as plt

In [None]:
# Generate sample data for simple regression
np.random.seed(42)
n = 30
X = np.random.normal(10, 2, n)
sigma = 2  # Known true standard deviation
Y = 2 * X + 5 + np.random.normal(0, sigma, n)  # Linear relationship


In [None]:

# Compute coefficients manually
# Construct design matrix with intercept
X_matrix = np.column_stack((np.ones(n), X))
Y_matrix = Y
beta_hat = np.linalg.inv(X_matrix.T @ X_matrix) @ X_matrix.T @ Y_matrix

# Predicted values and residuals
Y_hat = X_matrix @ beta_hat
residuals = Y_matrix - Y_hat  # classical residuals by hand

# Assemble tidy DataFrame for statsmodels formulas
data = pd.DataFrame({'Y': Y_matrix, 'X': X})

# Variance and standardized residuals
s_squared = np.sum(residuals**2) / (n - 2)  # unbiased variance estimate
# Alternative (biased) estimator would be s2 = np.var(residuals)

# Hat matrix and leverage values
H = X_matrix @ np.linalg.inv(X_matrix.T @ X_matrix) @ X_matrix.T
h_ii = np.diag(H)
standardized_residuals_known_sigma = residuals / np.sqrt(sigma**2 * (1 - h_ii))




In [None]:
# Compute trace of H
trace_H_direct = np.trace(H)
print("Trace of H:", trace_H_direct)
print(sum(h_ii))

In [None]:

# Fit model using statsmodels (formula API)
model_formula = smf.ols('Y ~ X', data=data).fit()

# Fit model using statsmodels OLS with explicit design matrix
model = sm.OLS(Y_matrix, X_matrix).fit()

# Predicted values and residuals from formula fit
Y_hat = model_formula.fittedvalues
residuals = model_formula.resid

# Leverage values (diagonal of Hat matrix)
influence_formula = model_formula.get_influence()
h_ii = influence_formula.hat_matrix_diag

influence_matrix = model.get_influence()
h_ii_matrix = influence_matrix.hat_matrix_diag

print('Sum of leverages (formula fit):', h_ii.sum())
print('Sum of leverages (matrix fit):', h_ii_matrix.sum())


In [None]:

# Compute residual variants
p = 2  # parameters: intercept and slope

# 1. Standardized residuals (using known true sigma)
standardized_residuals = residuals / np.sqrt(sigma**2 * (1 - h_ii))

# 2. Internally studentized residuals (unknown sigma)
s_squared = np.sum(residuals**2) / (n - p)
studentized_residuals_internal = residuals / np.sqrt(s_squared * (1 - h_ii))

# 3. Externally studentized residuals (deleted residuals)
sigma_i_sq = ((n - p) * s_squared - (residuals**2) / (1 - h_ii)) / (n - p - 1)
sigma_i_sq = np.clip(sigma_i_sq, a_min=0, a_max=None)
sigma_i = np.sqrt(sigma_i_sq)
studentized_residuals_external = residuals / (sigma_i * np.sqrt(1 - h_ii))

# Reference residuals from statsmodels
influence = model_formula.get_influence()
model_studentized_residuals_internal = influence.resid_studentized
model_studentized_residuals_external = influence.resid_studentized_external

residuals_df = pd.DataFrame({
    'Residuals (statsmodels)': residuals,
    'Standardized (known sigma)': standardized_residuals,
    'Studentized internal (manual)': studentized_residuals_internal,
    'Studentized external (manual)': studentized_residuals_external,
    'Studentized internal (statsmodels)': model_studentized_residuals_internal,
    'Studentized external (statsmodels)': model_studentized_residuals_external
})

residuals_df


In [None]:

# Compare variance estimates
s_squared_manual = np.sum(residuals**2) / (n - 2)
s_squared_statsmodels = model.mse_resid
print("s_squared (manual):", s_squared_manual)
print("s_squared (statsmodels):", s_squared_statsmodels)
print("Difference:", abs(s_squared_manual - s_squared_statsmodels))



### Visual checks of residual distributions
We compare raw residuals with standardized and studentized versions to see how scaling affects the normality assessment.


In [None]:

# Q-Q plots for different residual definitions
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Q-Q plots for residual diagnostics")

stats.probplot(residuals, dist="norm", plot=axes[0])
axes[0].set_title("Raw residuals vs normal")

stats.probplot(studentized_residuals_internal, dist="norm", plot=axes[1])
axes[1].set_title("Studentized residuals (internal)")

stats.probplot(studentized_residuals_external, dist="t", sparams=(n - p - 1,), plot=axes[2])
axes[2].set_title("Studentized residuals (external)")

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()


### Residual shape examples


The four-panel layout lets us compare true disturbances, fitted residuals, and Q-Q plots for each simulated scenario.


In [None]:

n_samples = 100

# Generate datasets with specific residual characteristics
def generate_data(case):
    x = np.linspace(0, 10, n_samples)
    if case == "Right skewed":
        y = 2 * x + 5 + np.random.exponential(scale=1, size=n_samples)
    elif case == "Left skewed":
        y = 2 * x + 5 - np.random.exponential(scale=1, size=n_samples)
    elif case == "Tails too light":
        y = 2 * x + 5 + np.random.uniform(low=-1, high=1, size=n_samples)
    elif case == "Tails too heavy":
        y = 2 * x + 5 + np.random.standard_t(df=1, size=n_samples)
    elif case == "Bimodal distribution":
        y = 2 * x + 5 + np.concatenate([
            np.random.normal(loc=-2, scale=0.5, size=n_samples//2),
            np.random.normal(loc=2, scale=0.5, size=n_samples//2)
        ])
    elif case == "True normal distribution":
        y = 2 * x + 5 + np.random.normal(scale=1, size=n_samples)
    return x, y

# Titles for different residual characteristics
titles = ["Right skewed", "Left skewed", "Tails too light",
          "Tails too heavy", "Bimodal distribution", "True normal distribution"]

# Right = positive skewed, Left = Negative skewed

# Adjusted code to add an additional row with scatter plots of the data and regression lines
fig, axes = plt.subplots(4, 6, figsize=(24, 16))  # Adjusted figure size for additional row

# Generate data, fit model, and plot scatter plots with regression lines, histograms, and QQ plots
for i, title in enumerate(titles):
    # Generate data with specific residual characteristics
    x, y = generate_data(title)

    # Plot scatter plot of data with regression line
    x_with_const = sm.add_constant(x)
    model = sm.OLS(y, x_with_const).fit()
    y_pred = model.predict(x_with_const)

    axes[0, i].scatter(x, y, color='orange', alpha=0.6, edgecolor='black')
    axes[0, i].plot(x, y_pred, color='red', lw=2)
    axes[0, i].set_title(f"{title} Scatterplot")
    axes[0, i].set_xlabel("X")
    axes[0, i].set_ylabel("Y")

    # Plot histogram of disturbances
    disturbances = y - (2 * x + 5)
    axes[1, i].hist(disturbances, bins=20, edgecolor='black', alpha=0.7)
    axes[1, i].set_title(f"{title} Disturbances")

    # Plot histogram of residuals
    residuals = model.resid
    axes[2, i].hist(residuals, bins=20, edgecolor='black', alpha=0.7)
    axes[2, i].set_title(f"{title} Residuals")

    # Plot QQ plot of residuals
    sm.qqplot(residuals, line='s', ax=axes[3, i])
    axes[3, i].set_title(f"{title} QQ Plot")

# Adjust layout and show plot
plt.tight_layout()
plt.show()


# HW in separate notebook