<a href="https://colab.research.google.com/github/Valentin-Laurent/MAPIE-DataCraft/blob/main/notebooks/regression-tutorial-correction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Measuring Model Uncertainty in Regression with MAPIE
---

In this notebook, we will estimate prediction intervals with MAPIE.

We will determine the validity of our prediction intervals using two metrics:

- The "effective" coverage, which is the percentage of test data included in the prediction intervals. For example, for a target confidence level of 90%, 90% of the test data should be within the produced intervals.
- The average width of the prediction intervals, which should be as close as possible to the "theoretical" width used to generate data noise.

In [1]:
!rm -rf /content/MAPIE-DataCraft
!git clone https://github.com/Valentin-Laurent/MAPIE-DataCraft

Cloning into 'MAPIE-DataCraft'...
remote: Enumerating objects: 82, done.[K
remote: Counting objects: 100% (82/82), done.[K
remote: Compressing objects: 100% (50/50), done.[K
remote: Total 82 (delta 26), reused 82 (delta 26), pack-reused 0 (from 0)[K
Receiving objects: 100% (82/82), 1.53 MiB | 8.30 MiB/s, done.
Resolving deltas: 100% (26/26), done.


In [2]:
!pip install mapie

Collecting mapie
  Downloading mapie-1.0.1-py3-none-any.whl.metadata (11 kB)
Downloading mapie-1.0.1-py3-none-any.whl (173 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m173.2/173.2 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mapie
Successfully installed mapie-1.0.1


# Import

In [4]:
import json
import sys

sys.path.append('/content/MAPIE-DataCraft/notebooks/utils')

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, QuantileRegressor
from mapie.metrics.regression import regression_coverage_score, regression_mean_width_score
from mapie.regression import CrossConformalRegressor, ConformalizedQuantileRegressor

from dataset import (
    x_sinx,
    get_1d_data_with_constant_noise,
    get_1d_data_with_heteroscedastic_noise,
    get_1d_data_with_normal_distribution,
)
from viz import (
    plot_regression,
    plot_uncertainties,
    plot_prediction_interval_width,
)

# Uncertainty in Regression

## Homoscedastic Noise

Let's start by building an artificial dataset. We will use the function $f(x) = x\sin(x)$ to which we add constant Gaussian noise.

In [5]:
X, y, X_test, y_test, y_mesh = get_1d_data_with_constant_noise(
    funct=x_sinx,
    min_x=-5,
    max_x=5,
    n_samples=600,
    noise=0.5
)

Let's visualize the dataset and its generating function.

In [None]:
plot_regression(
    X_test,
    y_test,
    y_mesh,
    name_mesh="Generator",
    title="Homoscedastic Problem",
)

We will learn a polynomial model to fit the data.

In [None]:
polyn_model = Pipeline(
    [
        ("poly", PolynomialFeatures(degree=10)),
        ("linear", LinearRegression())
    ]
)

**Exercise 1**: We now want to train this model with MAPIE and obtain 95% confidence intervals.
- Instantiate a `CrossConformalRegressor` wrapping our polynomial model with the CV+ method with 5 cross-validation folds, and a confidence level of 95%.
- Train and conformalize the `CrossConformalRegressor` on the dataset.
- Predict on the test set.

In [None]:
mapie_regressor = CrossConformalRegressor(  # correction
    estimator=polyn_model,  # correction
    confidence_level=0.95,  # correction
    cv=5,  # correction
    method="plus",  # correction
    random_state=1  # correction
    )  # correction
mapie_regressor.fit_conformalize(X, y)  # correction
y_preds, y_pred_intervals = mapie_regressor.predict_interval(X_test)  # correction

Let's visualize the prediction intervals obtained on the test set.

In [None]:
plot_uncertainties(
    X_test,
    y_test,
    y_preds,
    y_pred_intervals,
    title="Prediction Intervals with 95% Confidence Level"
)

Let's visualize the width of the prediction intervals as a function of $x$.

In [None]:
plot_prediction_interval_width(
    X_test,
    y_pred_intervals,
    title="Width of Prediction Intervals",
    yaxis_title="Width"
)

Here we see that the confidence intervals are roughly constant, which is expected given the homoscedasticity of the problem!

**Exercise 2**: calculate the uncertainty metrics:
- Coverage rate (`regression_coverage_score`)
- Average size of prediction intervals (`regression_mean_width_score`)
- Did we achieve the target coverage rate of 95%?
- The theoretical size of the intervals is `1.96`. Is the average size of the intervals predicted by MAPIE larger? Smaller?

In [None]:
print(f"Empirical coverage rate: {regression_coverage_score(y_test, y_pred_intervals)[0]:.3f}")  # correction
print(f"Average interval width: {regression_mean_width_score(y_pred_intervals)[0]:.3f}")  # correction

## Heteroscedastic Noise

Let's start by building an artificial dataset. We will use the function $f(x) = x\sin(x)$ to which we add Gaussian noise proportional to $x$.

In [None]:
X, y, X_test, y_test, y_mesh = get_1d_data_with_heteroscedastic_noise(
    funct=x_sinx,
    min_x=0,
    max_x=5,
    n_samples=600,
    noise=0.5
)

Let's visualize the dataset and its generating function.

In [None]:
plot_regression(
    X_test,
    y_test,
    y_mesh,
    name_mesh="Generator",
    title="Heteroscedastic Problem",
)

In this setting, using a `CrossConformalRegressor` would result in confidence intervals being roughly constant, even though the noise in the data is not at all!
Fortunately, there is a solution to obtain adaptive prediction intervals: conformalized quantile regression. Let's first instantiate a quantile model.

In [None]:
polyn_model_quant = Pipeline(
    [
        ("poly", PolynomialFeatures(degree=10)),
        ("linear", QuantileRegressor(solver="highs", alpha=0))
    ]
)

**Exercise 3**: We now want to train this model with MAPIE and obtain 95% confidence intervals.
- Split the input data (`X` and `y`) into `X_train`, `X_conformalize`, `y_train`, `y_conformalize`
- Instantiate a `ConformalizedQuantileRegressor` wrapping our polynomial model with a confidence level of 95%
- Train the `MapieQuantileRegressor` on the training set, and conformalize it on the conformalization set
- Predict on the test set.

In [None]:
X_train, X_conformalize, y_train, y_conformalize = train_test_split(X, y)  # correction
mapie_regressor = ConformalizedQuantileRegressor(estimator=polyn_model_quant, confidence_level=0.95)  # correction
mapie_regressor.fit(X_train, y_train)  # correction
mapie_regressor.conformalize(X_conformalize, y_conformalize)  # correction
y_preds, y_pred_intervals = mapie_regressor.predict_interval(X_test)  # correction

Let's visualize the prediction intervals obtained on the test set.

In [None]:
plot_uncertainties(
    X_test,
    y_test,
    y_preds,
    y_pred_intervals,
    title="Prediction Intervals with 95% Confidence Level"
)

Let's visualize the width of the prediction intervals as a function of $x$.

In [None]:
plot_prediction_interval_width(
    X_test,
    y_pred_intervals,
    title="Width of Prediction Intervals",
    yaxis_title="Width"
)

Ah, there it is! We have captured the heteroscedasticity well!

**Exercise 4**: calculate the uncertainty metrics:
- Coverage rate (`regression_coverage_score`)
- Average size of prediction intervals (`regression_mean_width_score`)
- Did we achieve the target coverage rate of 95%?

In [None]:
print(f"Empirical coverage rate: {regression_coverage_score(y_test, y_pred_intervals)[0]:.3f}")  # correction
print(f"Average interval width: {regression_mean_width_score(y_pred_intervals)[0]:.3f}")  # correction

Bingo! The coverage rate is still good, and the average size of our intervals is significantly lower than if we had used a `CrossConformalRegressor`!

## Epistemic Uncertainty

Let's start by building an artificial dataset. We will use the function $f(x) = x\sin(x)$ to which we add constant Gaussian noise, but with data points distributed non-uniformly.

In [None]:
X, y, X_test, y_test, y_mesh = get_1d_data_with_normal_distribution(
    funct=x_sinx,
    mu=0,
    sigma=2,
    n_samples=600,
    noise=0.5
)

Let's visualize the dataset and its generating function.

In [None]:
plot_regression(
    X_test,
    y_test,
    y_mesh,
    name_mesh="Generator",
    title="Epistemic Problem",
)

**Exercise 5**: We now want to train this model with MAPIE and obtain 95% confidence intervals.
- Split the input data (`X` and `y`) into `X_train`, `X_conformalize`, `y_train`, `y_conformalize`
- Instantiate a `ConformalizedQuantileRegressor` wrapping our polynomial model with a confidence level of 95%
- Train the `MapieQuantileRegressor` on the training set, and conformalize it on the conformalization set
- Predict on the test set.

In [None]:
X_train, X_conformalize, y_train, y_conformalize = train_test_split(X, y)  # correction
mapie_regressor = ConformalizedQuantileRegressor(estimator=polyn_model_quant, confidence_level=0.95)  # correction
mapie_regressor.fit(X_train, y_train)  # correction
mapie_regressor.conformalize(X_conformalize, y_conformalize)  # correction
y_preds, y_pred_intervals = mapie_regressor.predict_interval(X_test)  # correction

Let's visualize the prediction intervals obtained on the test set.

In [None]:
plot_uncertainties(
    X_test,
    y_test,
    y_preds,
    y_pred_intervals,
    title="Prediction Intervals with 95% Confidence Level"
)

Let's visualize the width of the prediction intervals as a function of $x$.

In [None]:
plot_prediction_interval_width(
    X_test,
    y_pred_intervals,
    title="Width of Prediction Intervals",
    yaxis_title="Width"
)

We see that the confidence intervals explode when the density of the dataset decreases, capturing the epistemic error well!

**Exercise 6**: calculate the uncertainty metrics:
- Coverage rate (`regression_coverage_score`)
- Average size of prediction intervals (`regression_mean_width_score`)
- Did we achieve the target coverage rate of 95%?

In [None]:
print(f"Empirical coverage rate: {regression_coverage_score(y_test, y_pred_intervals)[0]:.3f}")  # correction
print(f"Average interval width: {regression_mean_width_score(y_pred_intervals)[0]:.3f}")  # correction

Congratulations, you have mastered uncertainties in regression with MAPIE!