This worksheet was created with the help of Claude 3.7 based on the notebook "A primer on predictive inference."

# Worksheet: Conformal prediction with Galton's height data

In this exercise, you will implement conformal prediction using Galton's data on the heights of fathers and their children. The goal is to gain hands-on experience with the conformal prediction methodology and verify its finite-sample validity by computing the empirical coverage.

## Background

Conformal prediction is a technique that wraps around a point prediction method to provide prediction intervals with valid coverage guarantees. Unlike traditional prediction intervals in linear regression, which rely on distributional assumptions, conformal prediction intervals have valid coverage *without* those assumptions.

The key idea is to use a calibration dataset to determine how large the prediction intervals should be. By computing nonconformity scores on the calibration data, we can establish a quantile that ensures the desired coverage level.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy.stats import t
import seaborn as sns

## Part 1: Load and Explore the Data

First, let's load Galton's dataset and perform some exploratory data analysis.

In [None]:
# Load the data
df = pd.read_csv("Galton.txt", sep="\t")

# Display the first few rows
print("First 5 rows of the dataset:")
print(df.head())

# Summary statistics
print("\n Summary statistics:")
print(df.describe())

# Extract the relevant columns
X = df["Father"].astype(float)  # Father's height
Y = df["Height"].astype(float)  # Child's height

# Find the total number of observations
n = X.shape[0]
print(f"\n Total number of observations: {n}")

In [None]:
# Visualize the relationship between father's and child's height
plt.figure(figsize=(10, 6))
plt.scatter(X, Y, alpha=0.5)
plt.xlabel("Father's Height (inches)")
plt.ylabel("Child's Height (inches)")
plt.title("Relationship between Father's and Child's Heights")
plt.grid(True, alpha=0.3)
plt.show()

## Part 2: Split the Data

For conformal prediction, we need to split our data into three parts:
1. Training set: Used to fit the model
2. Calibration set: Used to compute the nonconformity scores and determine the quantile
3. Test set: Used to evaluate the prediction intervals

Let's split the data into these three parts.

In [None]:
# Split the data into training, calibration, and test sets (50%, 25%, 25%)
idx = np.random.permutation(n)
n_train = int(np.floor(0.5 * n))
n_calib = int(np.floor(0.25 * n))

idx_train = idx[:n_train]
idx_calib = idx[n_train:n_train + n_calib]
idx_test = idx[n_train + n_calib:]

# Create the datasets
X_train, Y_train = X[idx_train], Y[idx_train]
X_calib, Y_calib = X[idx_calib], Y[idx_calib]
X_test, Y_test = X[idx_test], Y[idx_test]

print(f"Training set size: {len(X_train)}")
print(f"Calibration set size: {len(X_calib)}")
print(f"Test set size: {len(X_test)}")

## Part 3: Fit the OLS Model

Now, let's fit an Ordinary Least Squares (OLS) model using the training data.

In [None]:
# Add an intercept to the training features
X_train_with_intercept = sm.add_constant(X_train)

# Fit the OLS model
ols_model = sm.OLS(Y_train, X_train_with_intercept).fit()

# Print the model summary
print(ols_model.summary())

In [None]:
# Create a range of x values for plotting
x_range = np.linspace(X.min(), X.max(), 100)
x_range_with_intercept = sm.add_constant(x_range)

# Generate predictions
y_pred = ols_model.predict(x_range_with_intercept)

# Plot the data and fitted line
plt.figure(figsize=(10, 6))
plt.scatter(X_train, Y_train, alpha=0.5, label="Training Data")
plt.plot(x_range, y_pred, color="red", linewidth=2, label="OLS Fitted Line")
plt.xlabel("Father's Height (inches)")
plt.ylabel("Child's Height (inches)")
plt.title("OLS Regression: Child's Height vs Father's Height")
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

## Part 4: Traditional OLS Prediction Intervals

Before implementing conformal prediction, let's compute the traditional OLS prediction intervals for comparison.

In [None]:
# Significance level
alpha = 0.1  # 90% prediction intervals

# Generate predictions and prediction intervals for the test set
X_test_with_intercept = sm.add_constant(X_test)
test_predictions = ols_model.get_prediction(X_test_with_intercept).summary_frame(alpha=alpha)

# Extract the predictions and intervals
y_pred_test = test_predictions['mean']
lower_bound = test_predictions['obs_ci_lower']
upper_bound = test_predictions['obs_ci_upper']

# Calculate the coverage
ols_coverage = np.mean((Y_test >= lower_bound) & (Y_test <= upper_bound))
print(f"OLS prediction interval coverage: {ols_coverage:.4f} (target: {1-alpha:.1f})")

In [None]:
# Visualize OLS prediction intervals
prediction_summary = ols_model.get_prediction(x_range_with_intercept).summary_frame(alpha=alpha)
plt.figure(figsize=(10, 6))
plt.scatter(X_train, Y_train, alpha=0.3, label="Training Data")
plt.scatter(X_test, Y_test, alpha=0.5, color="green", label="Test Data")
plt.plot(x_range, prediction_summary['mean'], color="red", linewidth=2, label="OLS Fitted Line")
plt.fill_between(x_range, 
                 prediction_summary['obs_ci_lower'], 
                 prediction_summary['obs_ci_upper'], 
                 color="red", alpha=0.2, label=f"{100*(1-alpha)}% OLS Prediction Interval")
plt.xlabel("Father's Height (inches)")
plt.ylabel("Child's Height (inches)")
plt.title("OLS Prediction Intervals")
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

## Part 5: Conformal Prediction

Now, let's implement conformal prediction using the residual nonconformity score.

**Task 1**: Compute the nonconformity scores for the calibration set.

*Hint*: The nonconformity score is the absolute difference between the observed value and the predicted value.

In [None]:
# TODO: Compute predictions for the calibration set
X_calib_with_intercept = sm.add_constant(X_calib)
y_pred_calib = ols_model.predict(X_calib_with_intercept)

# TODO: Compute nonconformity scores
nonconformity_scores = # YOUR CODE HERE

# Visualize the nonconformity scores
plt.figure(figsize=(10, 6))
plt.hist(nonconformity_scores, bins=20, alpha=0.7)
plt.xlabel("Nonconformity Score (|Y - Ŷ|)")
plt.ylabel("Frequency")
plt.title("Distribution of Nonconformity Scores on Calibration Set")
plt.grid(True, alpha=0.3)
plt.show()

**Task 2**: Determine the quantile for the desired coverage level.

In [None]:
# TODO: Compute the quantile corresponding to the desired coverage level
# For a 90% prediction interval (alpha = 0.1), we need the ceiling of (1-alpha)*(m+1) quantile,
# where m is the size of the calibration set

m = len(X_calib)
quantile_index = # YOUR CODE HERE
sorted_scores = # YOUR CODE HERE
q_hat = # YOUR CODE HERE

print(f"Conformal quantile (q_hat): {q_hat:.4f}")

**Task 3**: Construct conformal prediction intervals for the test set.

In [None]:
# TODO: Compute conformal prediction intervals for the test set
y_conf_pred_test = ols_model.predict(X_test_with_intercept)
lower_bound_conf = # YOUR CODE HERE
upper_bound_conf = # YOUR CODE HERE

# Calculate the conformal coverage
conformal_coverage = # YOUR CODE HERE
print(f"Conformal prediction interval coverage: {conformal_coverage:.4f} (target: {1-alpha:.1f})")

**Task 4**: Visualize the conformal prediction intervals and compare them with the OLS prediction intervals.

In [None]:
# TODO: Generate conformal prediction intervals for the range of x values
y_conf_pred_range = ols_model.predict(x_range_with_intercept)
lower_bound_conf_range = # YOUR CODE HERE
upper_bound_conf_range = # YOUR CODE HERE

# Visualize both OLS and conformal prediction intervals
plt.figure(figsize=(12, 7))
plt.scatter(X_train, Y_train, alpha=0.3, label="Training Data")
plt.scatter(X_test, Y_test, alpha=0.5, color="green", label="Test Data")
plt.plot(x_range, y_conf_pred_range, color="red", linewidth=2, label="OLS Fitted Line")

# Plot OLS prediction intervals
plt.fill_between(x_range, 
                 prediction_summary['obs_ci_lower'], 
                 prediction_summary['obs_ci_upper'], 
                 color="red", alpha=0.2, label=f"{100*(1-alpha)}% OLS Prediction Interval")

# Plot conformal prediction intervals
plt.fill_between(x_range, 
                 lower_bound_conf_range, 
                 upper_bound_conf_range, 
                 color="blue", alpha=0.2, label=f"{100*(1-alpha)}% Conformal Prediction Interval")

plt.xlabel("Father's Height (inches)")
plt.ylabel("Child's Height (inches)")
plt.title("Comparison of OLS and Conformal Prediction Intervals")
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

## Part 6: Empirical Validation of Finite-Sample Validity

To verify the finite-sample validity of conformal prediction, let's perform multiple random splits of the data and compute the coverage for each split.

In [None]:
# TODO: Implement a function to compute the coverage for a given random split
def compute_coverage(X, Y, train_size=0.5, calib_size=0.25, alpha=0.1, seed=None):
    if seed is not None:
        np.random.seed(seed)
    
    # Split the data
    n = len(X)
    idx = np.random.permutation(n)
    n_train = int(np.floor(train_size * n))
    n_calib = int(np.floor(calib_size * n))
    
    idx_train = idx[:n_train]
    idx_calib = idx[n_train:n_train + n_calib]
    idx_test = idx[n_train + n_calib:]
    
    X_train, Y_train = X[idx_train], Y[idx_train]
    X_calib, Y_calib = X[idx_calib], Y[idx_calib]
    X_test, Y_test = X[idx_test], Y[idx_test]
    
    # Fit the OLS model
    X_train_with_intercept = sm.add_constant(X_train)
    model = sm.OLS(Y_train, X_train_with_intercept).fit()
    
    # Compute OLS prediction intervals
    X_test_with_intercept = sm.add_constant(X_test)
    test_predictions = model.get_prediction(X_test_with_intercept).summary_frame(alpha=alpha)
    ols_coverage = np.mean((Y_test >= test_predictions['obs_ci_lower']) & 
                           (Y_test <= test_predictions['obs_ci_upper']))
    
    # Compute conformal prediction intervals
    # YOUR CODE HERE
    X_calib_with_intercept = sm.add_constant(X_calib)
    y_pred_calib = model.predict(X_calib_with_intercept)
    nonconformity_scores = # YOUR CODE HERE
    
    m = len(X_calib)
    quantile_index = # YOUR CODE HERE
    sorted_scores = # YOUR CODE HERE
    q_hat = # YOUR CODE HERE
    
    y_pred_test = model.predict(X_test_with_intercept)
    lower_bound_conf = # YOUR CODE HERE
    upper_bound_conf = # YOUR CODE HERE
    conformal_coverage = # YOUR CODE HERE
    
    return ols_coverage, conformal_coverage

In [None]:
# Perform multiple random splits and compute coverage
n_splits = 100
ols_coverages = []
conformal_coverages = []

for i in range(n_splits):
    ols_cov, conf_cov = compute_coverage(X, Y, seed=i)
    ols_coverages.append(ols_cov)
    conformal_coverages.append(conf_cov)
    
print(f"Average OLS coverage: {np.mean(ols_coverages):.4f} (std: {np.std(ols_coverages):.4f})")
print(f"Average conformal coverage: {np.mean(conformal_coverages):.4f} (std: {np.std(conformal_coverages):.4f})")
print(f"Target coverage: {1-alpha:.1f}")

In [None]:
# Visualize the distribution of coverages
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
plt.hist(ols_coverages, bins=20, alpha=0.7, color='red')
plt.axvline(1-alpha, color='black', linestyle='dashed', linewidth=2, label=f"Target ({1-alpha:.1f})")
plt.axvline(np.mean(ols_coverages), color='blue', linestyle='solid', linewidth=2, label=f"Mean ({np.mean(ols_coverages):.4f})")
plt.xlabel("Coverage")
plt.ylabel("Frequency")
plt.title("OLS Prediction Interval Coverage")
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.hist(conformal_coverages, bins=20, alpha=0.7, color='blue')
plt.axvline(1-alpha, color='black', linestyle='dashed', linewidth=2, label=f"Target ({1-alpha:.1f})")
plt.axvline(np.mean(conformal_coverages), color='red', linestyle='solid', linewidth=2, label=f"Mean ({np.mean(conformal_coverages):.4f})")
plt.xlabel("Coverage")
plt.ylabel("Frequency")
plt.title("Conformal Prediction Interval Coverage")
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 7: Discussion Questions

1. How do the OLS and conformal prediction intervals compare in terms of coverage?
2. Are the conformal prediction intervals wider or narrower than the OLS intervals? Why?
3. What are the key assumptions of OLS prediction intervals? How does conformal prediction differ in terms of assumptions?
4. How might you modify the nonconformity score to get potentially narrower prediction intervals for this data?
5. Would you expect the same coverage guarantee if the relationship between fathers' and children's heights was non-linear? Explain.

## Bonus Task: Exploring Different Nonconformity Scores

The basic nonconformity score we used was $s(x, y) = |y - \hat{\mu}(x)|$. Can you think of and implement a different nonconformity score? For example, you might consider normalized residuals or studentized residuals. How do these affect the resulting prediction intervals?