<a href="https://colab.research.google.com/github/francji1/01RAD/blob/main/code/01RAD_ex02_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01RAD Exercise 02

# Data exploration

Data exploration is essential for understanding the characteristics and relationships in the dataset before fitting any models.



This notebook continues the exploratory analysis of the `mpg` dataset. We summarise the variables, compare fuel efficiency across regions, and revisit linear regression building blocks as preparation for the next lecture.


In [None]:

# Core scientific stack and statistical helpers
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

import statsmodels.api as sm
import statsmodels.formula.api as smf

from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.datasets import get_rdataset
from scipy.stats import t, norm


In [None]:

# Load the MPG dataset from seaborn and drop rows with missing values for a clean baseline
cars = sns.load_dataset('mpg').dropna()

# Peek at the first rows to confirm structure
print(cars.head())



### Dataset overview
We begin by inspecting dataset dimensions and data types to understand what variables are available.


In [None]:

# print data summary with number of rows, columns, and missing cells
dataset_summary = {
    'rows': len(cars),
    'columns': cars.shape[1],
    'missing_cells': int(cars.isna().sum().sum())
}
print(dataset_summary)


In [None]:

# print dtypes sorted by column name
print(cars.dtypes.sort_index())


In [None]:

# Additional preview
print(cars.head())


In [None]:

# Full descriptive statistics across numeric and categorical columns
print(cars.describe(include='all'))



### Correlation structure among numeric features
Correlations identify which variables move together and highlight potential multicollinearity for later modelling.


In [None]:

# Select only numeric columns for correlation analysis
numeric_cars = cars.select_dtypes(include=[float, int])

# Compute the Pearson correlation matrix
corr_matrix = numeric_cars.corr()

# Visualise the correlations using a heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()



### Pairwise relationships
Pair plots reveal non-linear patterns and potential outliers across combinations of numeric variables.


In [None]:

# Generate pairwise scatter plots and kernel density estimates
sns.pairplot(numeric_cars)
plt.suptitle('Pair Plot for Cars Dataset', y=1.02)
plt.show()


Choose mpg as response and weight as regressor


### Focus on MPG and weight
Weight is a prime candidate regressor for fuel efficiency. We examine marginal distributions and their relationship.


In [None]:
# Set up the plot grid with 3 subplots
plt.figure(figsize=(12, 6))

# 1. Scatter plot of MPG vs Weight
plt.subplot(1, 3, 1)
sns.scatterplot(data=cars, x='weight', y='mpg')
plt.title("Scatter plot of Weight vs MPG")
plt.xlabel("Weight")
plt.ylabel("MPG")

# 2. Histogram and Density Plot of Weight
plt.subplot(1, 3, 2)
sns.histplot(cars['weight'], kde=True, color='green', label='Weight')
plt.legend()
plt.title("Histogram and Density Plot of Weight")
plt.xlabel("Weight")

# 3. Histogram and Density Plot of MPG
plt.subplot(1, 3, 3)
sns.histplot(cars['mpg'], kde=True, color='blue', label='MPG')
plt.legend()
plt.title("Histogram and Density Plot of MPG")
plt.xlabel("MPG")

plt.tight_layout()
plt.show()


In [None]:
# Scatter plot of Weight vs MPG with colors by Country (origin)
plt.figure(figsize=(10, 6))
sns.scatterplot(data=cars, x='weight', y='mpg', hue='origin', palette='Set1')

# Add labels and title
plt.title("Scatter plot of Weight vs MPG (Colored by Country)")
plt.xlabel("Weight")
plt.ylabel("MPG")

# Display the plot
plt.show()


### Mean MPG by origin
Simple group summaries help motivate formal hypothesis tests.


In [None]:

# Compute mean MPG per region and compare differences
mean_mpg_by_country = cars.groupby('origin')['mpg'].mean()
print('Mean MPG by Country (Origin):')
print(mean_mpg_by_country)

# Calculate pairwise differences between regional means
import itertools
country_pairs = list(itertools.combinations(mean_mpg_by_country.index, 2))

print('Differences between mean MPG by country pairs:')
for country1, country2 in country_pairs:
    mean_diff = mean_mpg_by_country[country1] - mean_mpg_by_country[country2]
    print(f'Difference between {country1} and {country2}: {mean_diff:.4f}')


In [None]:

# print summary statistics for MPG by origin using groupby
mpg_by_origin = cars.groupby('origin')['mpg'].agg(['mean', 'median', 'std', 'count'])
print(mpg_by_origin)



### Two-sample t-tests across regions
We test whether mean MPG differs across each pair of regions without assuming equal sample sizes.


In [None]:

# Loop over unique pairs of regions and conduct independent-sample t-tests
country_list = cars['origin'].unique()

for i, country1 in enumerate(country_list):
    for country2 in country_list[i + 1:]:
        mpg1 = cars[cars['origin'] == country1]['mpg']
        mpg2 = cars[cars['origin'] == country2]['mpg']
        t_stat, p_value = stats.ttest_ind(mpg1, mpg2)
        print(f'T-test between {country1} and {country2}:')
        print(f't-statistic: {t_stat:.4f}, p-value: {p_value:.4f}')



### One-way ANOVA
We verify the global null hypothesis that all regional means are equal.


In [None]:
# Perform ANOVA to compare means across all countries
model = smf.ols('mpg ~ C(origin)', data=cars).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print("\nANOVA Table:")
print(anova_table)



### One-way ANOVA via SciPy
A direct call to `scipy.stats.f_oneway` reaches the same conclusion as the regression-based ANOVA table.


In [None]:
import scipy.stats as stats

# Separate MPG data by country
mpg_usa = cars[cars['origin'] == 'usa']['mpg']
mpg_japan = cars[cars['origin'] == 'japan']['mpg']
mpg_europe = cars[cars['origin'] == 'europe']['mpg']

# Perform ANOVA
f_stat, p_value = stats.f_oneway(mpg_usa, mpg_japan, mpg_europe)

print(f"ANOVA Results: F-statistic = {f_stat:.4f}, p-value = {p_value:.4f}")



### Tukey HSD post-hoc comparison
Once the global null is rejected, Tukey's test highlights which pairs differ.


In [None]:
# Perform Tukey's HSD test for pairwise comparison
tukey = pairwise_tukeyhsd(endog=cars['mpg'], groups=cars['origin'], alpha=0.05)
print("\nTukey's HSD Test:")
print(tukey.summary())

In [None]:
# 1. Simple Linear Regression: MPG ~ weight
model = smf.ols('mpg ~ weight', data=cars)
fit = model.fit()
print(fit.summary())

# Plotting the regression line
sns.regplot(x='weight', y='mpg', data=cars, ci=None, line_kws={"color": "red"})
plt.title('MPG vs Weight')
plt.show()


In [None]:
46.2165 / 0.799

### Questions:


- What is the difference between **Mean Squared Error (MSE)** and $\hat{\sigma}^2 $ in the context of linear regression?
  - Why is MSE sometimes defined as $\frac{\text{RSS}}{n} $ and why do we divide RSS by the degrees of freedom for $ \hat{\sigma}^2 $?
  - How does adjusting for degrees of freedom impact the estimate of the error variance?

- How is the **covariance matrix** of the estimated coefficients $\hat{\beta} $ calculated?
  - Write down the formula for the covariance matrix $ \text{Cov}(\hat{\beta}) $.
  - Why is it important to compute the **diagonal elements** of this matrix, and how do these relate to the standard errors of the coefficients?


-  How is the t-test used to assess the statistical significance of the coefficients in linear regression?
  - What is the null hypothesis in the t-test for a regression coefficient?
  - How is the **t-value** calculated, and how do we use it to obtain the **p-value**?
  - What is the relationship between the t-value, standard error of $ \hat{\beta}$, and the confidence intervals for the parameter?

-  If a parameter's t-test returns a high p-value, what does that suggest about the significance of the parameter in the model?
  - Should this parameter be kept or removed from the model, and why?


In [None]:
# Compute residuals
residuals = model.fit().resid

# Compute and print statistics
print("Mean of residuals:", residuals.mean())
print("Standard deviation of residuals:", residuals.std())
print("Mean Squared Error (MSE):", (residuals**2).sum()/len(cars))
print("Variance of residuals:", residuals.var())
print("Scaled deviance of residuals:", (residuals**2).sum() / (len(cars) - 2))
print("Skewness of residuals:", residuals.skew())
print("Kurtosis of residuals:", residuals.kurtosis())

In [None]:
# compute mse of residuals
mse1 = np.mean(residuals**2)
print("Mean Squared Error (MSE) computed directly:", mse1)
mse2 = (residuals**2).sum()/len(cars)
print("Mean Squared Error (MSE) computed from residuals:", mse2)
resid_mse= model.fit().mse_resid
print("Mean Squared Error (MSE) from model fit:", resid_mse)
resid_mse2 = (residuals**2).sum()/(len(cars)-2)
print("Mean Squared Error (MSE) from residuals (adjusted):", resid_mse2)

In [None]:
# Plot residuals as a histogram
plt.hist(residuals, bins=20, edgecolor='k', alpha=0.65)
plt.title('Histogram of Residuals')
plt.xlabel('Residual')
plt.ylabel('Frequency')
plt.show()

# Q-Q plot of residuals
stats.probplot(residuals, plot=plt)
plt.title('Q-Q Plot of Residuals')
plt.show()

In [None]:
# Plotting results
fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_fit(model.fit(), 1, ax=ax)
plt.show()

In [None]:
def get_regression(X, Y):
    """
    Calculate linear regression coefficients, standard errors,
    t-values, p-values, and 95% confidence intervals.

    Parameters:
    - X: DataFrame or array-like of independent variables.
    - Y: Series or array-like of dependent variable.

    Returns:
    - DataFrame with coefficients, standard errors, t-values, p-values,
      and 95% confidence intervals.
    """

    # Ensure X and Y are DataFrames or numpy arrays
    X = pd.DataFrame(X.copy())  # X must be a DataFrame
    Y = np.array(Y)  # Convert Y to NumPy array if it's a Series

    # Add constant (intercept) term to X matrix
    X['const'] = 1  # Adds intercept term
    X = X[['const'] + [col for col in X if col != 'const']]  # Ensure 'const' is the first column

    # Calculate regression coefficients (beta_hat) using the formula: (X'X)^(-1) X'Y
    # Not efficient but simple
    beta_hat = np.linalg.inv(X.values.T @ X.values) @ X.values.T @ Y

    # Predicted values and residuals
    Y_pred = X.values @ beta_hat  # Predicted Y
    residuals = Y - Y_pred  # Residuals (actual Y - predicted Y)

    # Residual Sum of Squares (RSS) = Sum of Sqeuares Errors (SSE)
    RSS = residuals.T @ residuals

    # Mean Squared Error (MSE)
    MSE = RSS / Y.shape[0]

    # Adjusted degrees of freedom (n - p), where n is the number of observations and p is the number of predictors
    df = Y.shape[0] - X.shape[1]
    # Unbiased estimate of the variance of the residuals (RSS divided by degrees of freedom)
    sigma2_hat = RSS / df

    # Standard errors of coefficients (sqrt of diagonal of covariance matrix)
    se_beta_hat = np.sqrt(sigma2_hat * np.diag(np.linalg.inv(X.values.T @ X.values)))

    # t-values and p-values
    t_values = beta_hat / se_beta_hat
    p_values = 2 * (1 - t.cdf(np.abs(t_values), df))

    # Critical t-value for 95% confidence intervals
    alpha = 0.05
    t_critical = t.ppf(1 - alpha/2, df)

    # 95% Confidence Intervals
    ci_lower = beta_hat - t_critical * se_beta_hat
    ci_upper = beta_hat + t_critical * se_beta_hat

    # Create a DataFrame for the output
    return pd.DataFrame({
        'coef': beta_hat,
        'std err': se_beta_hat,
        't': t_values,
        'P > |t|': p_values,
        '95% CI Lower': ci_lower,
        '95% CI Upper': ci_upper
    }, index=X.columns)


In [None]:

# Independent variable (X) - weight
X = cars[['weight']]
# Dependent variable (Y) - mpg
Y = cars['mpg']

# Compare manual OLS implementation with statsmodels
manual_ols = get_regression(X, Y)
print('Manual OLS Results:')
print(manual_ols)

X_with_const = sm.add_constant(X[['weight']])
model = sm.OLS(Y, X_with_const)
results = model.fit()
print('Statsmodels OLS Results:')
print(results.summary())



## Simple linear regression recap
We first revisit the single-regressor case (MPG on weight) using a hand-crafted OLS routine and `statsmodels` for validation.


**Task:** In the simple linear regression model, construct a Wald test for $H_0 : \beta_1 = 17 \beta_0$ versus $H_1 : \beta_1 \neq 17 \beta_0$.

**Solution**.  Let $\delta = \beta_1 - 17 \beta_0$.  The MLE is $\hat{\delta} = \hat{\beta}_1 - 17 \hat{\beta}_0$, with estimated standard error $\hat{\text{se}}(\hat{\delta})$, where

$$\hat{\text{se}}(\hat{\delta})^2 = \hat{\text{se}}(\hat{\beta}_1 - 17 \hat{\beta}_0)^2 = \hat{\text{se}}(\hat{\beta}_1)^2 + 17^2 \hat{\text{se}}(\hat{\beta}_0)^2 $$


The Wald test then checks if $|W| < z_{\alpha / 2}$, where

$$W = \frac{\hat{\delta} - 0}{\hat{\text{se}}(\hat{\delta})}
= \frac{\hat{\beta}_1 - 17 \hat{\beta}_0}{\sqrt{\hat{\text{se}}(\hat{\beta}_1)^2 + 17^2 \hat{\text{se}}(\hat{\beta}_0)^2}}$$


### Model comparison via F-test
We compare an intercept-only baseline against the weight model using the classic nested-model F-test.


In [None]:
from scipy.stats import f

# Fit the two models
fit0 = smf.ols('mpg ~ 1', data=cars).fit()  # Restricted model mpg ~ 1 (intercept only)
fit1 = smf.ols('mpg ~ weight', data=cars).fit() # Full model mpg ~ weight

# Get RSS for both models
RSS0 = np.sum(fit0.resid ** 2)  # Residual sum of squares for restricted model
RSS1 = np.sum(fit1.resid ** 2)  # Residual sum of squares for full model

# Number of observations and number of parameters
n = len(cars)
p0 = 1  # Number of parameters in the restricted model (intercept)
p1 = 2  # Number of parameters in the full model (intercept + weight)

# Degrees of freedom for both models
df0 = n - p0  # Degrees of freedom for fit0
df1 = n - p1  # Degrees of freedom for fit1

# Compute the F-statistic
numerator = (RSS0 - RSS1) / (p1 - p0)  # Improvement in RSS
denominator = RSS1 / df1  # Error in the full model

F_stat = numerator / denominator
print(f"F-statistic: {F_stat}")

# Compare with critical value from F-distribution
alpha = 0.05  # Significance level
F_critical = f.ppf(1 - alpha, p1 - p0, df1)
print(f"Critical F-value at 5% significance: {F_critical}")

# p-value from the F-distribution
p_value = 1 - f.cdf(F_stat, p1 - p0, df1)
print(f"P-value: {p_value}")


In [None]:

# Compare R-squared definitions for models with and without intercepts
fit1 = smf.ols('mpg ~ weight', data=cars).fit()
fit2 = smf.ols('mpg ~ -1 + weight', data=cars).fit()

RSS1 = np.sum(fit1.resid ** 2)
RSS2 = np.sum(fit2.resid ** 2)

TSS_with_intercept = np.sum((cars['mpg'] - cars['mpg'].mean()) ** 2)
TSS_no_intercept = np.sum(cars['mpg'] ** 2)

R2_1_manual = 1 - (RSS1 / TSS_with_intercept)
R2_2_manual = 1 - (RSS2 / TSS_no_intercept)

R2_1_sm = fit1.rsquared
R2_2_sm = fit2.rsquared

print(f'Manual R-squared for fit1 (with intercept): {R2_1_manual}')
print(f'Manual R-squared for fit2 (without intercept): {R2_2_manual}')

print(f'R-squared from statsmodels for fit1 (with intercept): {R2_1_sm}')
print(f'R-squared from statsmodels for fit2 (without intercept): {R2_2_sm}')


In [None]:

# Demonstrate the pitfall of using the centered TSS with a no-intercept model
TSS_wrong = np.sum((cars['mpg'] - cars['mpg'].mean()) ** 2)
R2_wrong = 1 - (RSS2 / TSS_wrong)
print(f'Wrong R-squared for fit2 (using intercept-based formula): {R2_wrong}')



## Multiple regression extension
We add regional information to the weight regressor and compare manual vs. library-based OLS fits.


In [None]:

# Build design matrix with weight and origin dummies
X = cars[['weight', 'origin']]
X = pd.get_dummies(X, drop_first=True).astype(float)
Y = cars['mpg']

manual_ols = get_regression(X, Y)
print('Manual OLS Results:')
print(manual_ols)

X_with_const = sm.add_constant(X)
model = sm.OLS(Y, X_with_const)
results = model.fit()
print('Statsmodels OLS Results:')
print(results.summary())


In [None]:

# Equivalent fit using the formula interface
model = smf.ols('mpg ~ weight + origin', data=cars)
results = model.fit()
print('Statsmodels OLS Results:')
print(results.summary())


In [None]:
# Linear Regression for different countries
countries = cars['origin'].unique()

for country in countries:
    country_data = cars[cars['origin'] == country]
    model_country = smf.ols('mpg ~ horsepower', data=country_data).fit()
    print(f'Regression for Country: {country}')
    print(model_country.summary())

    # Plot for each country
    plt.figure()
    sns.regplot(x='horsepower', y='mpg', data=country_data, ci=None, line_kws={"color": "red"})
    plt.title(f'MPG vs Horsepower - {country}')
    plt.show()



## Key takeaways
- Group comparisons (t-tests, ANOVA, Tukey) reveal how fuel efficiency varies by origin before modelling.
- Manual OLS derivations mirror library output and highlight the impact of including an intercept.
- Extending to multiple regression requires careful handling of categorical predictors via dummy coding.


# Student Individual Work

### 1. Convert `mpg` to liters per 100 km
- **Task**: Convert the fuel consumption in `mpg` (miles per gallon) to liters per 100 kilometers (L/100km).
- **Formula**:
  $
  \text{L/100km} = \frac{235.215}{\text{mpg}}
  $
- **Question**: What is the average fuel consumption in liters per 100 km for the dataset?

---

### 2. Convert `horsepower` to kilowatts (kW)
- **Task**: Convert the engine power from `horsepower (hp)` to `kilowatts (kW)`.
- **Formula**:
  $
  \text{kW} = \text{hp} \times 0.7355
  $
- **Question**: What is the range of engine power in kilowatts for the dataset?

---

### 3. Run regression analysis on how `liters_per_100km` depends on `kw` (engine power)
- **Task**: Perform regression analysis to understand the relationship between fuel consumption (`liters_per_100km`) and engine power (`kw`).
- **Question**: What are the coefficients of the regression model? How do they interpret the relationship between fuel consumption and engine power?

---

### 4. Run the same regression analysis using a model **with and without intercept**
- **Task**: Run two regression models-one with an intercept and one without an intercept.
- **Question**: How do the models differ? What are the key differences in the interpretation of the results between the two models?

---

### 5. Discuss the F-statistic and R-squared for both models
- **Task**: Compare the F-statistic and R-squared for both models (with and without intercept).
- **Question**: Which model better explains the data, and why? Which one would you choose and under what circumstances?

---

### 6. Test if the regression coefficient for the Intercept is equal to 10 times the regression coefficient for engine power
- **Task**: Test the hypothesis that the intercept is equal to 10 times the regression coefficient for engine power.
  $
  H_0: \beta_0 = 10 \times \beta_1 \quad vs. \quad H_1: \beta_0 \neq 10 \times \beta_1
  $
- **Question**: Can we reject the null hypothesis? What does this tell us about the relationship between the intercept and engine power?

---

### 7. Compare fuel consumption for cars from Europe and Japan
- **Task**: Compare the fuel consumption (in liters per 100 km) between European and Japanese cars at different engine power levels (kW).
- **Question**: For what engine power (`kw`) do European cars have smaller fuel consumption than Japanese cars?

---

### 8. Investigate the impact of `weight` on fuel consumption
- **Task**: Add `weight` as a second predictor in the regression model to see how it affects the relationship between engine power and fuel consumption.
- **Question**: Does `weight` significantly improve the model? How does it affect the coefficients and interpretation of `kw`?

---


### 9. Predict the fuel consumption of a car with 150 kW engine power and discuss the prediction interval
- **Task**: Use the regression model to predict the fuel consumption of a car with 150 kW engine power (for each origin).
- **Question**: What is the predicted fuel consumption? How confident are we in this prediction?

