<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/code/01RAD_Ex03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01RAD Exercise 03

Last exercise: simple linear regression + different approaches how to add categorical varaible

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

import statsmodels.api as sm
import statsmodels.formula.api as smf

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.datasets import get_rdataset
from scipy.stats import t,norm

from sklearn.linear_model import LinearRegression

In [None]:
# Load the dataset
cars = sns.load_dataset('mpg').dropna()  # Dropping rows with missing values

# Check the first few rows
print(cars.head())

In [None]:
# OLS model: mpg ~ weight (single slope and intercept)
model1 = smf.ols('mpg ~ weight', data=cars)
results1 = model1.fit()
print(results1.summary())


# Scatter plot + regression line
plt.figure(figsize=(8, 6))
sns.scatterplot(x='weight', y='mpg', data=cars, color='blue')
plt.plot(cars['weight'], results1.fittedvalues, color='red', label='Regression line')

plt.title('Simple Linear Regression (mpg ~ weight)')
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.legend()
plt.show()


In [None]:
# OLS model: mpg ~ weight + origin (three intercepts, one slope)
model2 = smf.ols('mpg ~ weight + C(origin)', data=cars)
results2 = model2.fit()
print(results2.summary())

# Scatter plot with points colored by origin
plt.figure(figsize=(8, 6))
scatter = sns.scatterplot(x='weight', y='mpg', hue='origin', data=cars, palette='Set1')

# Get the color palette used in the scatterplot
palette = dict(zip(cars['origin'].unique(), scatter.legend_.get_texts()))

# Plot regression lines for each origin group (same slope, different intercepts)
for origin_level in cars['origin'].unique():
    subset = cars[cars['origin'] == origin_level]
    color = scatter.legend_.get_lines()[list(cars['origin'].unique()).index(origin_level)].get_color()
    plt.plot(subset['weight'], results2.predict(subset), label=f'Origin {origin_level}', color=color)

plt.title('Multiple Intercepts (mpg ~ weight + origin)')
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.legend(title='Origin')
plt.show()

In [None]:
# OLS model: mpg ~ weight + origin + weight:origin (three intercepts, three slopes)
model3 = smf.ols('mpg ~ weight * C(origin)', data=cars)
results3 = model3.fit()
print(results3.summary())

# Scatter plot with points colored by origin
plt.figure(figsize=(8, 6))
scatter = sns.scatterplot(x='weight', y='mpg', hue='origin', data=cars, palette='Set1')

# Plot regression lines for each origin group (different slopes and intercepts)
for origin_level in cars['origin'].unique():
    subset = cars[cars['origin'] == origin_level]
    color = scatter.legend_.get_lines()[list(cars['origin'].unique()).index(origin_level)].get_color()
    plt.plot(subset['weight'], results3.predict(subset), label=f'Origin {origin_level}', color=color)

plt.title('Multiple Intercepts and Slopes (mpg ~ weight * origin)')
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.legend(title='Origin')
plt.show()


**The confidence interval for the mean predicted value** $\hat{y}_i$ at a given value of $x_i$ is calculated as:

$
\hat{y}_i \pm z_{\alpha/2} \cdot \sqrt{\text{Var}(\hat{y}_i)}
$

Where:
- $\hat{y}_i$ is the predicted value at $x_i$, computed from the regression equation $\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i$,
- $z_{\alpha/2,}$ is the critical value from the tnormal distribution with (based on the desired confidence level, typically 95%),
- $\text{Var}(\hat{y}_i)$is the variance of the predicted value $\hat{y}_i$, given by:

$
\text{Var}(\hat{y}_i) = \sigma^2 \left( \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2} \right)
$

Where:
- $\sigma^2$ is the residual variance,
- $n$ is the number of observations,
- $x_i$ is the specific value of the independent variable for which the confidence interval is being calculated,
- $\bar{x}$ is the mean of the independent variable values.


**The prediction interval for an individual predicted value** $y_i$ at a given $x_i$ is computed as:

$
\hat{y}_i \pm z_{\alpha/2} \cdot \sqrt{\text{Var}(\hat{y}_i) + \sigma^2}
$

## Qeuestions:

* Is this computation correct?
* If so, can I use it in practice?
* How can we derive the formula for $\hat{y}_i$ in general?

<!--


The confidence interval for the mean predicted value with unknown $\sigma$:

$
\hat{y}_i \pm t_{\alpha/2, n-m-1} \cdot \sqrt{\hat{\sigma}^2 \left( \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2} \right)}
$

Where the unbiased estimate of the residual variance $\hat{\sigma}^2$ in a regression model is given by:

$
s_n^2 = \hat{\sigma}^2 = \frac{1}{n - m - 1} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$

$
\text{Var}(\hat{y}_i) = \sigma^2 \cdot \mathbf{x}_i^T (X^T X)^{-1} \mathbf{x}_i
$

-->


In [None]:
# Generate new data for weight (for a smooth line plot)
weight_range = np.linspace(cars['weight'].min(), cars['weight'].max(), 100)
new_data = pd.DataFrame({'weight': weight_range})

# Predict the mean mpg and get confidence and prediction intervals
predictions = results1.get_prediction(new_data)
prediction_summary = predictions.summary_frame(alpha=0.05)  # 95% intervals
prediction_summary.head()


In [None]:
# Plotting
plt.figure(figsize=(8, 6))

# Scatter plot of original data
sns.scatterplot(x='weight', y='mpg', data=cars, color='blue', label='Data')

# Plot the regression line (mean prediction)
plt.plot(weight_range, prediction_summary['mean'], color='red', label='Regression line')

# Plot the confidence interval
plt.fill_between(weight_range,
                 prediction_summary['mean_ci_lower'],
                 prediction_summary['mean_ci_upper'],
                 color='red', alpha=0.3, label='Confidence interval')

# Plot the prediction interval
plt.fill_between(weight_range,
                 prediction_summary['obs_ci_lower'],
                 prediction_summary['obs_ci_upper'],
                 color='green', alpha=0.2, label='Prediction interval')

plt.title('Regression Line with Confidence and Prediction Intervals')
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.legend()
plt.show()


Manual computation

In [None]:
# Extracting the regression coefficients and residuals from statsmodels
intercept, slope = results1.params
y_hat = results1.fittedvalues
residuals = results1.resid
n = len(residuals)

# Estimate of sigma^2 (unbiased residual variance) using statsmodels result
sigma_squared_hat = results1.mse_resid
sigma_hat = np.sqrt(sigma_squared_hat)

# Variance-covariance matrix of the coefficients
var_beta_hat = results1.cov_params()

# Generate new data for weight for smooth line plotting
weight_range = np.linspace(cars['weight'].min(), cars['weight'].max(), 100)
X_range_with_intercept = sm.add_constant(weight_range)

# Predicted mean mpg for the new data (regression line)  - classic way
y_hat_range = X_range_with_intercept @ results1.params

# You can use predict function instead of manually calculating
# y_hat_range = results1.predict(new_data)

# Standard error of the predicted mean (for confidence interval)
se_mean_prediction = np.sqrt(np.sum(X_range_with_intercept @ var_beta_hat * X_range_with_intercept, axis=1))

# Confidence interval (95%)
alpha = 0.05
t_value = t.ppf(1 - alpha / 2, df=n - 2)  # Critical t-value for 95% confidence interval
confidence_interval_lower = y_hat_range - t_value * se_mean_prediction
confidence_interval_upper = y_hat_range + t_value * se_mean_prediction

# Step 5: Standard error for the prediction interval (includes variance of errors)
se_prediction_interval = np.sqrt(se_mean_prediction**2 + sigma_squared_hat)

# Prediction interval (95%)
prediction_interval_lower = y_hat_range - t_value * se_prediction_interval
prediction_interval_upper = y_hat_range + t_value * se_prediction_interval


# Plotting
plt.figure(figsize=(8, 6))

# Scatter plot of original data
sns.scatterplot(x=cars['weight'], y=cars['mpg'], color='blue', label='Data')

# Plot the regression line (mean prediction)
plt.plot(weight_range, y_hat_range, color='red', label='Regression line')

# Plot the confidence interval
plt.fill_between(weight_range, confidence_interval_lower, confidence_interval_upper,
                 color='red', alpha=0.3, label='Confidence interval')

# Plot the prediction interval
plt.fill_between(weight_range, prediction_interval_lower, prediction_interval_upper,
                 color='green', alpha=0.2, label='Prediction interval')

plt.title('Regression Line with Confidence and Prediction Intervals (Manual Calculation)')
plt.xlabel('Weight')
plt.ylabel('MPG')
plt.legend()
plt.show()

# Print estimated sigma and standard errors for reference
print(f"Estimated sigma^2 (residual variance): {sigma_squared_hat}")
print(f"Estimated sigma (residual standard deviation): {sigma_hat}")

## Introduction into multivarible regression

Recap: $\hat{\beta}^{OLS} = argmin_{\beta \in \mathrm{R^p}} \sum_{i=1}^n (Y_i - X_i \beta)^2$

From the Lecture: $\hat{\beta}^{OLS}  = (X^TX)^{-1}X^TY$

Question
* When it holds?
* How many solutions do we have?
* What should we check and how?
* How do we compute $\hat{\beta}^{OLS}$ in practice?

In [None]:
# Generating multivariate data X with an intercept
np.random.seed(42)  # For reproducibility
n_samples = 100
n_features = 3

# Generating random explanatory variables
X = np.random.randn(n_samples, n_features)

# Adding intercept (column of ones)
X_intercept = np.hstack((np.ones((n_samples, 1)), X))

# True coefficients beta (including intercept)
beta_true = np.array([5, 2, -3, 1])

# Generating noise epsilon
epsilon = np.random.randn(n_samples)

# Calculating the response Y
Y = X_intercept @ beta_true + epsilon

# Applying various methods to estimate coefficients

## a) Analytical solution using normal equations
beta_hat_norm_eq = np.linalg.inv(X_intercept.T @ X_intercept) @ X_intercept.T @ Y

## b) Using numpy.linalg.lstsq
beta_hat_lstsq, residuals, rank, s = np.linalg.lstsq(X_intercept, Y, rcond=None)

## c) Linear regression using sklearn
model_sk = LinearRegression(fit_intercept=False)
model_sk.fit(X_intercept, Y)
beta_hat_sk = model_sk.coef_

## d) Linear regression using statsmodels
model_sm = sm.OLS(Y, X_intercept).fit()
beta_hat_sm = model_sm.params

# 5. Comparing the results

# Creating a DataFrame for comparison
df_results = pd.DataFrame({
    'True beta': beta_true,
    'Normal equations': beta_hat_norm_eq,
    'Numpy lstsq': beta_hat_lstsq,
    'Sklearn': beta_hat_sk,
    'Statsmodels': beta_hat_sm
})

print(df_results)

# Graphical comparison
methods = ['Normal equations', 'Numpy lstsq', 'Sklearn', 'Statsmodels']
x = np.arange(len(beta_true))  # Indices of coefficients

plt.figure(figsize=(10, 6))
plt.plot(x, beta_true, 'o-', label='True beta', linewidth=3)
for method in methods:
    plt.plot(x, df_results[method], 'x--', label=method)

plt.xticks(x, ['Intercept'] + [f'X{i}' for i in range(1, n_features+1)])
plt.xlabel('Coefficients')
plt.ylabel('Value')
plt.title('Comparison of Estimated Coefficients by Different Methods')
plt.legend()
plt.grid(True)
plt.show()


Overview:

* NumPy (`numpy.linalg.lstsq`): Uses Singular Value Decomposition (SVD)

* SciPy (`scipy.linalg.lstsq`): Offers methods using QR decomposition and SVD

* scikit-learn (`LinearRegression`): Uses SVD via numpy.linalg.lstsq

* statsmodels (`OLS`): Uses QR decomposition by default

Methods:
* **Cholesky Decomposition** $(X^T X = L L^T)$ is efficient but sensitive to data conditions. Use it when you are confident that $(X^TX)$ is positive definite.

* **QR Decomposition** ($X = QR$) is a stable method suitable for most linear regression problems, especially when multicollinearity is a concern.

* **SVD** ($X = U \Sigma V^T$ and ($(X^TX)^{-1}X = X^{+}$) provides the most robust solution, particularly in the presence of multicollinearity or rank deficiency, higher computational cost.

In [None]:
# 1. Generating multivariate data X with an intercept
np.random.seed(42)  # For reproducibility
n_samples = 100
n_features = 3

# Function to generate correlated features
def generate_correlated_data(n_samples, n_features, correlation):
    """
    Generates a dataset with specified correlation between features.

    Parameters:
    - n_samples: Number of samples
    - n_features: Number of features (excluding intercept)
    - correlation: Desired correlation between features (between 0 and 1)

    Returns:
    - X: Generated data matrix with an intercept column
    """
    # Create a covariance matrix with the desired correlation
    cov_matrix = np.full((n_features, n_features), correlation)
    np.fill_diagonal(cov_matrix, 1)

    # Generate multivariate normal data
    mean = np.zeros(n_features)
    X_no_intercept = np.random.multivariate_normal(mean, cov_matrix, size=n_samples)

    # Add intercept (column of ones)
    X = np.hstack((np.ones((n_samples, 1)), X_no_intercept))

    return X

# Set the desired correlation level (e.g., 0.9 for high correlation)
correlation_level = 0.99

# Generate correlated data
X = generate_correlated_data(n_samples, n_features, correlation_level)

# True coefficients beta (including intercept)
beta_true = np.array([5, 2, -3, 1])  # Length should be n_features + 1

# 2. Generating noise epsilon
epsilon = np.random.randn(n_samples)

# 3. Calculating the response Y
Y = X @ beta_true + epsilon

# 4. Calculating the condition number of X^T X
condition_number = np.linalg.cond(X.T @ X)
print(f"Condition number of X^T X: {condition_number:.2e}")

# 5. Applying various methods to estimate coefficients

## a) Analytical solution using normal equations
try:
    beta_hat_norm_eq = np.linalg.inv(X.T @ X) @ X.T @ Y
except np.linalg.LinAlgError:
    beta_hat_norm_eq = np.linalg.pinv(X.T @ X) @ X.T @ Y
    print("Used pseudo-inverse due to singular matrix in normal equations.")

## b) Using numpy.linalg.lstsq
beta_hat_lstsq, residuals, rank, s = np.linalg.lstsq(X, Y, rcond=None)

## c) Linear regression using sklearn
model_sk = LinearRegression(fit_intercept=False)
model_sk.fit(X, Y)
beta_hat_sk = model_sk.coef_

## d) Linear regression using statsmodels
model_sm = sm.OLS(Y, X).fit()
beta_hat_sm = model_sm.params

# 6. Comparing the results

# Creating a DataFrame for comparison
df_results = pd.DataFrame({
    'True beta': beta_true,
    'Normal equations': beta_hat_norm_eq,
    'Numpy lstsq': beta_hat_lstsq,
    'Sklearn': beta_hat_sk,
    'Statsmodels': beta_hat_sm
})

print(df_results)

# Graphical comparison
methods = ['Normal equations', 'Numpy lstsq', 'Sklearn', 'Statsmodels']
x = np.arange(len(beta_true))  # Indices of coefficients

plt.figure(figsize=(10, 6))
plt.plot(x, beta_true, 'o-', label='True beta', linewidth=3)
for method in methods:
    plt.plot(x, df_results[method], 'x--', label=method)

plt.xticks(x, ['Intercept'] + [f'X{i}' for i in range(1, n_features+1)])
plt.xlabel('Coefficients')
plt.ylabel('Value')
plt.title(f'Comparison of Estimated Coefficients (Correlation={correlation_level})')
plt.legend()
plt.grid(True)
plt.show()


In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Number of samples and features
n_samples = 100
n_features = 3

# Function to generate correlated features
def generate_correlated_data(n_samples, n_features, correlation):
    """
    Generates a dataset with specified correlation between features.

    Parameters:
    - n_samples: Number of samples
    - n_features: Number of features (excluding intercept)
    - correlation: Desired correlation between features (between 0 and 1)

    Returns:
    - X: Generated data matrix with an intercept column
    """
    # Mean and covariance matrix
    mean = np.zeros(n_features)
    cov = np.full((n_features, n_features), correlation)
    np.fill_diagonal(cov, 1)

    # Generate multivariate normal data
    X_no_intercept = np.random.multivariate_normal(mean, cov, size=n_samples)

    # Add intercept (column of ones)
    X = np.hstack((np.ones((n_samples, 1)), X_no_intercept))

    return X

# Generate data with high correlation to test ill-conditioned scenarios
correlation_level = 0.99
X = generate_correlated_data(n_samples, n_features, correlation_level)

# True coefficients (including intercept)
beta_true = np.array([5, 2, -3, 1])  # Length should be n_features + 1

# Generate noise
epsilon = np.random.randn(n_samples)

# Calculate response variable
Y = X @ beta_true + epsilon


In [None]:
def linear_regression_cholesky(X, Y):
    """
    Solves the linear regression problem using normal equations and Cholesky decomposition.

    Parameters:
    - X: Design matrix (n_samples x n_features)
    - Y: Response vector (n_samples,)

    Returns:
    - beta_hat: Estimated coefficients (n_features,)
    """
    # Compute X^T X and X^T Y
    XtX = X.T @ X
    XtY = X.T @ Y

    # Perform Cholesky decomposition of XtX
    try:
        L = np.linalg.cholesky(XtX)
    except np.linalg.LinAlgError:
        raise np.linalg.LinAlgError("Matrix X^T X is not positive definite.")

    # Solve L * z = X^T Y
    z = np.linalg.solve(L, XtY)

    # Solve L^T * beta_hat = z
    beta_hat = np.linalg.solve(L.T, z)

    return beta_hat


In [None]:
def linear_regression_qr(X, Y):
    """
    Solves the linear regression problem using QR decomposition.

    Parameters:
    - X: Design matrix (n_samples x n_features)
    - Y: Response vector (n_samples,)

    Returns:
    - beta_hat: Estimated coefficients (n_features,)
    """
    # Compute the QR decomposition of X
    Q, R = np.linalg.qr(X)

    # Compute Q^T Y
    QtY = Q.T @ Y

    # Solve R * beta_hat = Q^T Y
    beta_hat = np.linalg.solve(R, QtY)

    return beta_hat


In [None]:
def linear_regression_svd(X, Y):
    """
    Solves the linear regression problem using Singular Value Decomposition (SVD).

    Parameters:
    - X: Design matrix (n_samples x n_features)
    - Y: Response vector (n_samples,)

    Returns:
    - beta_hat: Estimated coefficients (n_features,)
    """
    # Compute the SVD of X
    U, S, Vt = np.linalg.svd(X, full_matrices=False)

    # Compute beta_hat = V * S_inv * U^T * Y
    S_inv = np.diag(1 / S)
    beta_hat = Vt.T @ S_inv @ U.T @ Y

    return beta_hat


In [None]:
def linear_regression_pinv(X, Y):
    """
    Solves the linear regression problem using the Moore-Penrose pseudoinverse.

    Parameters:
    - X: Design matrix (n_samples x n_features)
    - Y: Response vector (n_samples,)

    Returns:
    - beta_hat: Estimated coefficients (n_features,)
    """
    # Compute the pseudoinverse of X
    X_pinv = np.linalg.pinv(X)

    # Compute beta_hat = X_pinv * Y
    beta_hat = X_pinv @ Y

    return beta_hat


In [None]:
# Estimate coefficients using Cholesky decomposition
try:
    beta_cholesky = linear_regression_cholesky(X, Y)
except np.linalg.LinAlgError as e:
    print(f"Cholesky method failed: {e}")
    beta_cholesky = np.full(beta_true.shape, np.nan)

# Estimate coefficients using QR decomposition
beta_qr = linear_regression_qr(X, Y)

# Estimate coefficients using SVD
beta_svd = linear_regression_svd(X, Y)

# Estimate coefficients using the pseudoinverse
beta_pinv = linear_regression_pinv(X, Y)

# For reference, use NumPy's lstsq method
beta_lstsq, residuals, rank, s = np.linalg.lstsq(X, Y, rcond=None)

# Create a DataFrame for comparison
import pandas as pd

df_results = pd.DataFrame({
    'True beta': beta_true,
    'Cholesky': beta_cholesky,
    'QR Decomposition': beta_qr,
    'SVD': beta_svd,
    'Pseudoinverse': beta_pinv,
    'NumPy lstsq': beta_lstsq
})

print(df_results)


In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
from numpy.linalg import eigvals, svd, inv, cholesky
from sklearn.linear_model import LinearRegression


# 1. VIF (Variance Inflation Factor)
X_df = pd.DataFrame(X, columns=['Intercept'] + [f'X{i}' for i in range(1, n_features + 1)])
vif_data = pd.DataFrame({
    'Feature': X_df.columns,
    'VIF': [variance_inflation_factor(X_df.values, i) for i in range(X_df.shape[1])]
})
print("VIF:")
print(vif_data)

# 2. Condition Number
cond_number = np.linalg.cond(X)
print(f"\nCondition number of X: {cond_number:.2e}")

# 3. Condition Indices
def condition_indices(X):
    U, s, Vt = svd(X, full_matrices=False)
    cond_indices = s[0] / s
    return cond_indices

cond_indices = condition_indices(X)
print("\nCondition Indices:")
for i, ci in enumerate(cond_indices):
    print(f"Index {i+1}: {ci:.2f}")

# 4. Eigenvalues of X^T X
eigenvalues = eigvals(X.T @ X)
print("\nEigenvalues of X^T X:")
print(eigenvalues.real)

# 5. Correlation Matrix
corr_matrix = X_df.iloc[:, 1:].corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

# 6. Normal Equations using Cholesky Decomposition
try:
    XtX = X.T @ X
    XtY = X.T @ Y
    L = cholesky(XtX)
    z = np.linalg.solve(L, XtY)
    beta_cholesky = np.linalg.solve(L.T, z)
    print("\nCoefficients using Cholesky Decomposition:")
    print(beta_cholesky)
except np.linalg.LinAlgError as e:
    print(f"\nCholesky decomposition failed: {e}")

# 7. Linear Regression using QR Decomposition
def linear_regression_qr(X, Y):
    Q, R = np.linalg.qr(X)
    QtY = Q.T @ Y
    beta_hat = np.linalg.solve(R, QtY)
    return beta_hat

beta_qr = linear_regression_qr(X, Y)
print("\nCoefficients using QR Decomposition:")
print(beta_qr)

# 8. Linear Regression using Singular Value Decomposition (SVD)
def linear_regression_svd(X, Y):
    U, S, Vt = svd(X, full_matrices=False)
    S_inv = np.diag(1 / S)
    beta_hat = Vt.T @ S_inv @ U.T @ Y
    return beta_hat

beta_svd = linear_regression_svd(X, Y)
print("\nCoefficients using SVD:")
print(beta_svd)

# 9. Linear Regression using Pseudoinverse
X_pinv = np.linalg.pinv(X)
beta_pinv = X_pinv @ Y
print("\nCoefficients using Pseudoinverse:")
print(beta_pinv)

# 10. Comparing All Methods
df_comparison = pd.DataFrame({
    'Feature': ['Intercept'] + [f'X{i}' for i in range(1, n_features + 1)],
    'True beta': beta_true,
    'Cholesky': beta_cholesky if 'beta_cholesky' in locals() else np.nan,
    'QR Decomposition': beta_qr,
    'SVD': beta_svd,
    'Pseudoinverse': beta_pinv,
})

print("\nComparison of Estimated Coefficients:")
print(df_comparison)

# 11. Plotting the Coefficients Comparison
plt.figure(figsize=(12, 8))
methods = ['True beta', 'Cholesky', 'QR Decomposition', 'SVD', 'Pseudoinverse']
x = np.arange(len(beta_true))

for method in methods:
    plt.plot(x, df_comparison[method], marker='o', linestyle='--', label=method)

plt.xticks(x, ['Intercept'] + [f'X{i}' for i in range(1, n_features + 1)])
plt.xlabel('Coefficients')
plt.ylabel('Value')
plt.title('Comparison of Estimated Coefficients by Different Methods')
plt.legend()
plt.grid(True)
plt.show()

# Additional Note: VIF Calculation Details
# The VIF for each predictor X_i is calculated as: VIF(X_i) = 1 / (1 - R_i^2)
# where R_i^2 is the coefficient of determination obtained by regressing X_i on all other predictors.

# Final Note on Numerical Stability
# Cholesky decomposition requires X^T X to be positive definite. If not, it will raise an exception.
# QR decomposition and SVD are more stable options when dealing with ill-conditioned matrices.
# The pseudoinverse method handles any matrix, including rank-deficient ones, but can be computationally expensive.


In [None]:
X

## Hat matrix H

In [None]:
# Compute X^T X
XtX = X.T @ X

# Calculate the hat matrix H
H = X @ np.linalg.inv(XtX) @ X.T

print("Dimensions of H:", H.shape)
print("Dimensions of X:", X.shape)

In [None]:
# Eigenvalues of H
eigenvalues = np.linalg.eigvals(H)
print("Eigenvalues of H:", np.round(eigenvalues, 10))

# Check if H is idempotent
idempotent_diff = np.sum(np.round(H @ H - H, 5))
print("Difference between H^2 and H:", idempotent_diff)

# Check if H is symmetric
symmetry_diff = np.round(H.T - H, 10)
print("Difference between H^T and H:", symmetry_diff)

# Predicted values
hat_Y = H @ Y

In [None]:
model = sm.OLS(Y, X)
results = model.fit()

# Predicted values from statsmodels
Y_hat_sm = results.predict(X)
# (X @ results.params)


# Compare predicted values
difference = hat_Y - Y_hat_sm
print("Difference between predicted values from Hat matrix and statsmodels predict():")
print(np.round(difference, 10).sum())
max_difference = np.max(np.abs(difference))
print(f"Maximum absolute difference: {max_difference}")

In [None]:
print((X @ results.params).mean())
print(results.fittedvalues.mean())
print((H @ Y).mean())

In [None]:
# M matrix: I - H
M = np.identity(H.shape[0]) - H
e = (M @ Y)
e

In [None]:
# Residuals computed using M matrix
residuals_M = M @ Y

# Residuals from statsmodels
residuals_statsmodels = results.resid

# Compare the two sets of residuals
difference = residuals_M - residuals_statsmodels
print("Difference between residuals from M matrix and statsmodels:")
print(np.round(difference, 10).sum())

# Maximum absolute difference
max_difference = np.max(np.abs(difference))
print(f"Maximum absolute difference: {max_difference}")


# Individual student work

# **Exercise: Developing a Marketing Plan Based on Advertising Data**

Imagine that you are statistical consultants tasked with building a marketing plan for the next year to maximize product sales. You have access to a dataset that contains information on the advertising budget allocated to three different media channels—**TV**, **Radio**, and **Newspaper**—and the corresponding **Sales** figures.

## **Dataset Description**

- **Variables:**
  - **TV**: Advertising budget allocated to TV (in thousands of dollars)
  - **Radio**: Advertising budget allocated to Radio (in thousands of dollars)
  - **Newspaper**: Advertising budget allocated to Newspaper (in thousands of dollars)
  - **Sales**: Product sales (in thousands of units)

## **Tasks**

Based on this data and your final regression model, answer the following questions:

1. **Relationship Between Advertising Budget and Sales**
   - Is there a statistically significant relationship between the advertising budget and sales?

2. **Contribution of Each Media**
   - Do all three media channels—TV, Radio, and Newspaper—contribute to sales?
   - Which media have significant effects on sales?

3. **Media Generating the Biggest Boost in Sales**
   - Which advertising medium generates the largest increase in sales per unit increase in budget?

4. **Strength of the Relationship**
   - How strong is the relationship between the advertising budget and sales?
   - What is the coefficient of determination (R-squared) of your model?

5. **Effect of TV Advertising**
   - How much increase in sales is associated with a given increase in TV advertising budget?

6. **Effect of Radio Advertising**
   - How much increase in sales is associated with a given increase in Radio advertising budget?

7. **Accuracy of Estimated Effects**
   - How accurately can we estimate the effect of each medium on sales?
   - Provide the confidence intervals for the coefficients of each medium.

8. **Predicting Future Sales**
   - How accurately can we predict future sales based on the advertising budgets?
   - What is the standard error of the estimate?

9. **Optimal Allocation of Advertising Budget**
    - Imagine you have a budget of $100,000. What is the best strategy to allocate this budget among TV, Radio, and Newspaper advertising to maximize sales?

10. **Predicting Sales for Specific Budget Allocation**
    - If you spend $10,000 on TV advertising and $20,000 on Radio advertising, how much increase in sales can you expect?

11. **Confidence Interval for Predicted Sales**
    - What is the 95% confidence interval for the predicted sales in the previous question?

12. **Checking Correlation Between Independent Variables**
    - Are there significant correlations between the advertising budgets for different media?
    - How might multicollinearity affect your regression model?




In [None]:
# Load the data
Advert = pd.read_csv("https://raw.githubusercontent.com/francji1/01RAD/main/data/Advert.csv", sep=",")
Advert.head()