<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/code/01RAD_Ex04_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01RAD Exercise 04

 - Residuals
 - Post hoc analysis



Var(y^i)=σ2⋅xTi(XTX)−1xi

# Projction Hat Matrix $\mathbf{H}$ and Matrix $\mathbf{M}$  


##  $ \mathbf{H} $: Hat Matrix

$$
\mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top
$$

 - projects the vector of observations $ \mathbf{Y} $ onto the column space generated by the design matrix $\mathbf{X}$.
 - puts a hat on $ \mathbf{y} $
 - $\hat{\mathbf{Y}} =  \mathbf{H Y}$
 - `hat_Y = H @ Y`


 **Properties of the Hat Matrix**

- **Symmetry and Idempotency**:
   $$
   \mathbf{H} = \mathbf{H}^\top \quad \text{and} \quad \mathbf{H}^2 = \mathbf{H}
   $$

- **Trace (Degrees of Freedom)**:
   $$
   \text{tr}(\mathbf{H}) = p
   $$

- **Diagonal Elements as Leverage Values**:

   diagonal elements $ h_{ii} $ of $ \mathbf{H} $ indicate how much influence each observation has on its own fitted value.



## $\mathbf{M} $ Matrix
$$
\mathbf{M} = \mathbf{I} - \mathbf{H}
$$

- the complement of the Hat matrix.
- projects onto the orthogonal complement of the column space of $ \mathbf{X} $.
- $\hat{e} = \mathbf{MY}$

**Properties of the $ \mathbf{M} $ Matrix**

- **Symmetry and Idempotency**:
   $$
   \mathbf{M} = \mathbf{M}^\top \quad \text{and} \quad \mathbf{M}^2 = \mathbf{M}
   $$


- **Orthogonality with $\mathbf{H}$**:
   $$
   \mathbf{H} \mathbf{M} = \mathbf{0} \quad \text{and} \quad \mathbf{M} \mathbf{H} = \mathbf{0}
   $$


4. **Trace (Residual Degrees of Freedom)**:
   $$
   \text{tr}(\mathbf{M}) = n - p
   $$

---



## Questions:

- If $ \mathbf{X} $ has $ n $ rows and $ p $ linearly independent columns, what is the dimension of a original vector space, where $\mathbf{H}$ and $\mathbf{M}$ operate?

- What is the dimension of the range (the span (set of all possible linear combinations) of its column vectors) of $\mathbf{X}$, denoted as $\text{Col}(\mathbf{X})$, i.e.  $rank(\mathbf{X})$

- What is the dimension of the orthogonal complement of the column space of $ \mathbf{X} $, i.e., $ \text{Nul}(\mathbf{X}^\top) $ and how is it called?

-  Is following equation correct? $$ \mathbb{R}^n = \text{Col}(\mathbf{X}) \oplus \text{Nul}(\mathbf{X}^\top) $$

### Types of Residuals in Linear Regression

$$
Y_i = X_i \beta + e_i, \ \text{where} \ e_i \sim N(0, \sigma^2)
$$

Residuals measure the difference between observed and predicted values.

#### 1. Raw Residuals

The raw residuals are simply the differences between each observed value $ Y_i $ and its corresponding predicted value $\hat{Y}_i $:
$$
\hat{e}_i = Y_i - \hat{Y}_i
$$

#### 2. Standardized Residuals (known sigma):

- raw residuals, divided by its standard deviation with known $\sigma$.
$$
\hat{Z}_i = \frac{\hat{e}_i}{\sigma \sqrt{(1 - h_{ii})}}
$$

Questions: how the variance of the residuals is computed?


#### 3. Internally Studentized Residuals (unknown sigma)

Internally studentized residuals adjust each residual to account for the leverage $ h_{ii} $ of each observation.

$$
\hat{r_i} = \frac{\hat{e}_i}{s \sqrt{1 - h_{ii}}}
$$

and $s^2 = \hat{\sigma}^2 = \frac{1}{n - p}\sum_{j=1}^n \hat{e}_j^2 $ is the variance estimate from OLS, using all $n$ observations.


Studentized Residuals better reflects the influence of each observation on the fit by normalizing based on individual variances. Internally studentized residuals do not fully assess an observation's influence if removed from the model.

#### 4. Externally Studentized Residuals

Externally Studentized Residuals $\hat{r}_{(-i)}$
 - taking the PRESS residuals, or leave-one-out residuals (the residuals when each observation is left out of the model fit) and dividing by a scaled estimate of their standard deviation.
$$
\hat{r}_{(-i)} =  \frac{\hat{e}_{(-i)}}{s_{(-i)} \sqrt{1 - h_{ii}}}
$$
where
$$
s_{(-i)} = \sqrt{\frac{(n - p - 1)s^2 - \frac{\hat{e}_i^2}{1 - h_{ii}}}{n - p - 2}}
$$.




# Let's code

In [None]:
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm

import scipy.stats as stats
import matplotlib.pyplot as plt

In [None]:
# Generate sample data for simple regression
np.random.seed(42)
n = 30
X = np.random.normal(10, 2, n)
sigma = 2  # Known true standard deviation
Y = 2 * X + 5 + np.random.normal(0, sigma, n)  # Updated relationship


In [None]:
# Compute coefficients manually (by hand)
data = pd.DataFrame({'X': X, 'Y': Y})

# Add intercept
data['Intercept'] = 1

X_matrix = data[['Intercept', 'X']].values
Y_matrix = data['Y'].values
beta_hat = np.linalg.inv(X_matrix.T @ X_matrix) @ X_matrix.T @ Y_matrix

# Predicted values and residuals
Y_hat = X_matrix @ beta_hat
residuals = Y_matrix - Y_hat  # classical residuals by hand

# Variance and Standardized residuals
s_squared = np.sum(residuals**2) / (n - 2)  # variance estimate for residuals
# can we use estimate of varianc instead? s2 = np.var(residuals)


# Compute Hat matrix H
H = X_matrix @ np.linalg.inv(X_matrix.T @ X_matrix) @ X_matrix.T

h_ii = np.diag(X_matrix @ np.linalg.inv(X_matrix.T @ X_matrix) @ X_matrix.T)  # Leverage values (diag of Hat matrix)
standardized_residuals = residuals / np.sqrt(sigma * (1 - h_ii))  # Standardized residuals by hand


In [None]:
# Compute trace of H
trace_H_direct = np.trace(H)
print("Trace of H:", trace_H_direct)
print(sum(h_ii))

In [None]:
# Fit model using statsmodels
model = smf.ols('Y ~ X', data=data).fit()

# Predicted values and residuals
Y_hat = model.fittedvalues
#X_matrix = model.model.exog

residuals = model.resid  # Classical residuals from statsmodels

# Leverage values (diagonal of Hat matrix)
H = model.get_influence().hat_matrix_diag
h_ii = model.get_influence().hat_matrix_diag
sum(h_ii)

In [None]:
# Compute Residuals
# 1. Standardized Residuals (using known true sigma)
standardized_residuals = residuals / np.sqrt(sigma**2 * (1 - h_ii))

# 2. Studentized Residuals (Internal, matches resid_studentized in statsmodels)
s_squared = np.sum(residuals**2) / (n - 2)  # OLS variance estimate
studentized_residuals_internal = residuals / np.sqrt(s_squared * (1 - h_ii))

# 3. Studentized Residuals (External, matches resid_studentized_external in statsmodels)
studentized_residuals_external = residuals / np.sqrt([
    (np.sum(np.delete(residuals**2, i)) / (n - 3)) * (1 - h_ii[i]) for i in range(n)
])

# Gather residuals from statsmodels for comparison
model_studentized_residuals_internal = model.get_influence().resid_studentized  # internal studentized residuals
model_studentized_residuals_external = model.get_influence().resid_studentized_external  # external studentized residuals

# Create a DataFrame with all residuals
residuals_df = pd.DataFrame({
    'Classical Residuals (StatsModels)': residuals,
    'Standardized Residuals (Hand)': standardized_residuals,
    'Studentized Residuals (Internal - Hand)': studentized_residuals_internal,
    'Studentized Residuals (External - Hand)': studentized_residuals_external,
    'Studentized Residuals (Internal - StatsModels)': model_studentized_residuals_internal,
    'Studentized Residuals (External - StatsModels)': model_studentized_residuals_external
})

residuals_df


In [None]:
# Calculate s_squared manually
residuals = model.resid
s_squared_manual = np.sum(residuals**2) / (n - 2)

# Get s_squared from statsmodels
s_squared_statsmodels = model.mse_resid

# Display both values
print("s_squared (manual calculation):", s_squared_manual)
print("s_squared (statsmodels):", s_squared_statsmodels)
print("Difference:", abs(s_squared_manual - s_squared_statsmodels))


In [None]:
# Plot Q-Q plots to test distributions
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("Q-Q Plots for Residuals Distributions")

# 1. Standardized residuals with known sigma (Z-scores)
stats.probplot(residuals, dist="norm", plot=axes[0])
axes[0].set_title("Standardized Residuals (Known Sigma) - Z Scores")

# 2. Standardized residuals with estimated sigma (t-distributed)
stats.probplot(standardized_residuals,  dist="norm", plot=axes[1])
axes[1].set_title("Standardized Residuals (Unknown Sigma) - t Statistic")

# 3. Studentized residuals (internally studentized, t-distributed)
stats.probplot(studentized_residuals_internal, dist="t", sparams=(n - 2,), plot=axes[2])
axes[2].set_title("Studentized Residuals - t Statistic")

plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()


In [None]:
n_samples = 100

# Generate datasets with specific residual characteristics
def generate_data(case):
    x = np.linspace(0, 10, n_samples)
    if case == "Right skewed":
        y = 2 * x + 5 + np.random.exponential(scale=1, size=n_samples)
    elif case == "Left skewed":
        y = 2 * x + 5 - np.random.exponential(scale=1, size=n_samples)
    elif case == "Tails too light":
        y = 2 * x + 5 + np.random.uniform(low=-1, high=1, size=n_samples)
    elif case == "Tails too heavy":
        y = 2 * x + 5 + np.random.standard_t(df=1, size=n_samples)
    elif case == "Bimodal distribution":
        y = 2 * x + 5 + np.concatenate([
            np.random.normal(loc=-2, scale=0.5, size=n_samples//2),
            np.random.normal(loc=2, scale=0.5, size=n_samples//2)
        ])
    elif case == "True normal distribution":
        y = 2 * x + 5 + np.random.normal(scale=1, size=n_samples)
    return x, y

# Titles for different residual characteristics
titles = ["Right skewed", "Left skewed", "Tails too light",
          "Tails too heavy", "Bimodal distribution", "True normal distribution"]

# Adjusted code to add an additional row with scatter plots of the data and regression lines
fig, axes = plt.subplots(4, 6, figsize=(24, 16))  # Adjusted figure size for additional row

# Generate data, fit model, and plot scatter plots with regression lines, histograms, and QQ plots
for i, title in enumerate(titles):
    # Generate data with specific residual characteristics
    x, y = generate_data(title)

    # Plot scatter plot of data with regression line
    x_with_const = sm.add_constant(x)
    model = sm.OLS(y, x_with_const).fit()
    y_pred = model.predict(x_with_const)

    axes[0, i].scatter(x, y, color='orange', alpha=0.6, edgecolor='black')
    axes[0, i].plot(x, y_pred, color='red', lw=2)
    axes[0, i].set_title(f"{title} Scatterplot")
    axes[0, i].set_xlabel("X")
    axes[0, i].set_ylabel("Y")

    # Plot histogram of disturbances
    disturbances = y - (2 * x + 5)
    axes[1, i].hist(disturbances, bins=20, edgecolor='black', alpha=0.7)
    axes[1, i].set_title(f"{title} Disturbances")

    # Plot histogram of residuals
    residuals = model.resid
    axes[2, i].hist(residuals, bins=20, edgecolor='black', alpha=0.7)
    axes[2, i].set_title(f"{title} Residuals")

    # Plot QQ plot of residuals
    sm.qqplot(residuals, line='s', ax=axes[3, i])
    axes[3, i].set_title(f"{title} QQ Plot")

# Adjust layout and show plot
plt.tight_layout()
plt.show()


# HW in separate notebook