This tutorial has been modified from meps-cqr.ipynb from conformal-prediction (https://github.com/aangelopoulos/conformal-prediction). It is based on Romano et al. (2019) (https://proceedings.neurips.cc/paper/2019/hash/5103c3584b063c431bd1268e9b5e76fb-Abstract.html). This worksheet in particular was created with the help of Claude 3.7.

# Worksheet: Improved conditional coverage with *conformalized quantile regression* (CQR) using the Medical Expenditure Panel Survey (MEPS) data

This tutorial explores how to apply conformal prediction techniques to quantile regression outputs using the Medical Expenditure Panel Survey (MEPS) data. The goal is to create prediction intervals that can adapt to different levels of noise with reliable coverage.

### What is conformal prediction?

Conformal prediction is a framework that allows us to construct prediction intervals with guaranteed coverage properties without making strong distributional assumptions. The key idea is to use a calibration dataset to determine how to adjust our predictions to achieve the desired coverage level.

### What is quantile regression?

Quantile regression estimates conditional quantiles of the response variable given the features. For a target quantile level $\gamma$, we estimate the function $t_\gamma(x)$ such that $\mathbb{P}(Y ≤ t_\gamma(x) | X = x) = \gamma$. This is particularly useful for constructing prediction intervals, as we can use the lower and upper quantiles (e.g., $\gamma = 0.1/2 = 0.05$ and $\gamma = 1-0.1/2 = 0.95$for a 90% interval).

### Conformalized quantile regression (CQR)

Conformalized quantile regression (CQR) combines these ideas: we start with quantile regression estimates and then "conformalize" them to guarantee coverage. This approach maintains the adaptivity of quantile regression while providing the coverage guarantees of conformal prediction.

In [None]:
import os
import json
import numpy as np
import matplotlib.pyplot as plt
!pip install -U --no-cache-dir gdown --pre

## Loading the MEPS Data

We'll use the Medical Expenditure Panel Survey (MEPS) data, which contains information about healthcare expenditures. Our goal is to predict medical expenses (Y) based on various patient characteristics (X).

The data includes:
- X: Features related to patient demographics and health status
- Y: Medical expenses (our target variable)
- L: Lower quantile estimates from a pre-trained quantile regression model
- U: Upper quantile estimates from the same model

In [None]:
# Load cached data
if not os.path.exists('../data'):
    os.system('gdown 1h7S6N_Rx7gdfO3ZunzErZy6H7620EbZK -O ../data.tar.gz')
    os.system('tar -xf ../data.tar.gz -C ../')
    os.system('rm ../data.tar.gz')
    
data = np.load('../data/meps/meps-gbr.npz')
X, Y, L, U = data['X'], data['y'], data['lower'], data['upper']

In [None]:
# Let's look at the data dimensions
print(f"X shape: {X.shape}")
print(f"Y shape: {Y.shape}")
print(f"L shape: {L.shape}")
print(f"U shape: {U.shape}")

# Plot the distribution of medical expenses
plt.figure(figsize=(10, 6))
plt.hist(Y, bins=50, alpha=0.7)
plt.title('Distribution of Medical Expenses')
plt.xlabel('Expenses ($)')
plt.ylabel('Frequency')
plt.show()

## Setting up the experiment

We need to:
1. Define the desired coverage level $1-α$
2. Split our data into calibration and test sets

In [None]:
# EXERCISE 1: Set the target miscoverage rate alpha and calibration set size m
alpha = # Your code here - target miscoverage rate (e.g., 0.1 for 90% coverage)
m = # Your code here - number of calibration points

In [None]:
# EXERCISE 2: Split the data into calibration and test sets
# Create a boolean mask for selecting calibration points
idx = # Your code here
np.random.shuffle(idx)  # Shuffle to randomly select calibration points

# Use the mask to split the data
Y_cal, Y_te = # Your code here
L_cal, L_te = # Your code here
U_cal, U_te = # Your code here

## Conformalized quantile regression (CQR)

The CQR nonconformity score for an observation $(x,y)$ is:

$$s(x,y) = \max\left\{\hat{t}_{\alpha/2}(x)-y, y-\hat{t}_{1-\alpha/2}(x)\right\}$$

**Think.** What is this nonconformity score capturing?

In [None]:
# EXERCISE 3: Calculate nonconformity scores for the calibration set
S_cal = # Your code here

# Sort the scores
S_cal = np.sort(S_cal)

# Calculate the conformity threshold (qhat)
# This is the (1-alpha) quantile of the nonconformity scores
qhat = # Your code here

## Constructing conformalized prediction intervals

Now, we can construct our conformalized prediction intervals for the test set. The interval is:

$$\hat C(x) = [\hat{t}_{\alpha/2}(x) - \hat{q}, \hat{t}_{1-\alpha/2}(x) + \hat{q}],$$

where $\hat{q}$ is our conformity threshold (qhat).

In [None]:
# EXERCISE 4: Construct conformalized prediction intervals
Chat = # Your code here

## Evaluating coverage

Let's check if our conformalized prediction intervals achieve the desired coverage. We'll also compare them with the original (non-conformalized) intervals $[\hat{t}_{\alpha/2}(x), \hat{t}_{1-\alpha/2}(x)]$.

In [None]:
# EXERCISE 5: Calculate and compare empirical coverage
# Coverage before conformalization
empirical_coverage0 = # Your code here
print(f"The empirical coverage before conformalization is: {empirical_coverage0}")

# Coverage after conformalization
empirical_coverage = # Your code here
print(f"The empirical coverage after conformalization is: {empirical_coverage}")

## Examining conditional coverage

One major advantage of conformalized prediction is improved conditional coverage across different subgroups. Let's check if our method achieves uniform coverage across different cancer diagnosis categories.

In [None]:
# Process cancer diagnosis variables
X_cancer = X[~idx, 40:45]
for col in range(X_cancer.shape[1]):
    one_val = X_cancer[:, col].max()
    X_cancer[:, col] = (X_cancer[:, col] == one_val).astype(int)
cancer_dx = X_cancer.dot(np.arange(5)+1).astype(int)

# Count observations in each cancer diagnosis category
counts = [(cancer_dx == dx).sum() for dx in np.arange(5)+1]
print("Number of observations per cancer diagnosis category:")
for i, count in enumerate(counts):
    print(f"Category {i+1}: {count}")

In [None]:
# EXERCISE 6: Calculate stratified coverage for each cancer diagnosis category
# Coverage before conformalization
stratified_coverage0 = # Your code here

# Coverage after conformalization
stratified_coverage = # Your code here

# Plot the results
plt.figure(figsize=(10, 6))
plt.plot(np.arange(5)+1, stratified_coverage0, "o", label="Pre-conformalization")
plt.plot(np.arange(5)+1, stratified_coverage, "o", label="Post-conformalization")
plt.hlines(1-alpha, 0.5, 5.5, 'r', label="Target coverage")
plt.xlim(0.75, 5.25)
plt.ylim(0.6, 1.0)
plt.xlabel("Cancer Diagnosis Category")
plt.ylabel("Empirical Coverage")
plt.title("Coverage by Cancer Diagnosis Category")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# Key takeaways

1. **The Coverage Problem**: Traditional quantile regression methods produce intervals that often don't achieve their target coverage in finite samples.

2. **Conformal Prediction Solution**: By applying conformal prediction techniques to quantile regression outputs, we can guarantee the desired coverage level (1-α) without making distributional assumptions.

3. **The CQR Procedure**:
   - Start with estimated conditional quantiles from any quantile regression method
   - Calculate nonconformity scores on a calibration set
   - Adjust the original interval based on these scores

4. **Empirical Results**:
   - Pre-conformalization intervals achieved ~73% coverage when targeting 90%
   - Post-conformalization intervals achieved ~93% coverage, meeting our target
   - Coverage was more uniform across different subgroups after conformalization

5. **Advantages**:
   - Distribution-free coverage guarantees
   - Can be applied on top of any existing quantile regression method
   - Improves conditional coverage across different subpopulations

6. **Limitations**:
   - Requires a separate calibration set
   - May produce wider intervals than uncalibrated methods

## Discussion questions

1. How does the conformalization process improve coverage compared to the original quantile regression intervals?

2. How would you expect the width of the prediction intervals to change after conformalization? Why?

3. What might explain any differences in coverage across the different cancer diagnosis categories?

4. How might the choice of the base quantile regression model affect the final conformalized intervals?

5. How would the choice of α affect our results? What tradeoffs are involved in selecting different values?

6. How should we decide the size of the calibration set? What happens if it's too small or too large?

7. How does CQR compare to other uncertainty quantification methods like bootstrap or Bayesian approaches?

8. In what real-world scenarios would guaranteed coverage be particularly important when predicting medical expenses?

9. How might conformal prediction help ensure fair treatment across different demographic groups?

10. How would you explain the concept of conformalized prediction intervals to non-technical stakeholders?