# The ``DualBounds`` class

In [1]:
# Import packages
import sys; sys.path.insert(0, "../../../")
import numpy as np
import dualbounds as db
from dualbounds.generic import DualBounds

## General usage

The ``DualBounds`` class is the main class in the package, used to bound quantities of the form $E[f(Y(1), Y(0), X)]$. Its usage is as follows. 

**Step 1**: initialize the ``DualBounds`` class, which takes as an input (i) the data, (ii) the definition of the function $f$ (which defines the estimand $\theta$), and (iii) a description of the outcome model to use as an input. The user can also input a vector of propensity scores if they are known; else they will be estimated from the data. 

For example, below, we show how to compute $E[\mathbb{I}(Y(1) > Y(0)]$, the probability that the treatment effect is positive.

In [2]:
# Generate synthetic data from a linear model
data = db.gen_data.gen_regression_data(n=900, p=30, sample_seed=123)

# Initialize dual bounds object
dbnd = DualBounds(
    f=lambda y0, y1, x: y0 < y1, # the estimand is E[f(y0, y1, x)]
    covariates=data['X'], # n x p covariate matrix
    treatment=data['W'], # n-length treatment vector
    outcome=data['y'], # n-length outcome vector
    propensities=data['pis'], # n-length propensity scores (optional)
    outcome_model='ridge', # description of model for Y | X, W
)


**Step 2**: after initialization, the ``fit`` method fits the underlying outcome model and produces the final estimates and confidence bounds for the sharp partial identification bounds $\theta_L \le \theta \le \theta_U$. 

In [3]:
# Compute dual bounds and observe output
dbnd.fit(
    nfolds=5, # number of cross-fitting folds
    alpha=0.05, # nominal level,
    verbose=True # show progress bars
)
print(dbnd.results().to_markdown())

Cross-fitting the outcome model.


  0%|          | 0/5 [00:00<?, ?it/s]

Estimating optimal dual variables.


  0%|          | 0/900 [00:00<?, ?it/s]

|            |     Lower |     Upper |
|:-----------|----------:|----------:|
| Estimate   | 0.6832    | 0.934563  |
| SE         | 0.0210876 | 0.0125664 |
| Conf. Int. | 0.641869  | 0.959193  |


Note that there are two estimates---a lower and an upper estimate---because $\theta$ is not identified. One can also produce a more verbose output using the ``summary`` method:

In [4]:
dbnd.summary()

___________________Inference_____________________
               Lower     Upper
Estimate    0.683200  0.934563
SE          0.021088  0.012566
Conf. Int.  0.641869  0.959193

_________________Outcome model___________________
                      Model  No covariates
Out-of-sample R^2  0.931781       0.000000
RMSE               1.055246       4.040167
MAE                0.828872       3.228010

_________________Treatment model_________________
                            Model  No covariates
Out-of-sample R^2        0.001111       0.000000
Accuracy                 0.500000       0.516667
Likelihood (geom. mean)  0.500000       0.499721

______________Nonrobust plug-in bounds___________
               Lower     Upper
Estimate    0.684959  0.923693
SE          0.012243  0.007051
Conf. Int.  0.660962  0.937513

_______________Technical diagnostics_____________
                            Lower     Upper
Loss from gridsearch     0.015206 -0.000077
Max leverage             0.017024  0.01758

Another example below bounds a different estimand, the positive treatment effect $E[\max(Y(1) - Y(0), 0)]$, using a different underlying ML model (a k-nearest neighbors regressor).

In [5]:
dbnd = DualBounds(
    f=lambda y0, y1, x: np.maximum(y1-y0,0), # new estimand
    covariates=data['X'],
    treatment=data['W'], 
    outcome=data['y'], 
    propensities=data['pis'], 
    outcome_model='knn', 
)
dbnd.fit()
print(dbnd.results().to_markdown())

Cross-fitting the outcome model.


  0%|          | 0/5 [00:00<?, ?it/s]

Estimating optimal dual variables.


  0%|          | 0/900 [00:00<?, ?it/s]

|            |    Lower |    Upper |
|:-----------|---------:|---------:|
| Estimate   | 2.92715  | 4.305    |
| SE         | 0.175792 | 0.151109 |
| Conf. Int. | 2.58261  | 4.60117  |


## Choosing the outcome model

Dual bounds wrap on top of an underlying model which estimates the conditional distributions of $Y(1) \mid X$ and $Y(0) \mid X$. There are three ways to specify the underlying model, listed below in order of increasing flexibility.

### Method 1: String identifiers

The easiest method is to use one of the string identifiers, such as ``'ridge', 'lasso', 'elasticnet', 'randomforest', 'knn'`` (see the API reference for a complete list):

In [6]:
dbnd = DualBounds(
    f=lambda y0, y1, x: np.maximum(y1-y0,0), # estimand
    covariates=data['X'], 
    treatment=data['W'], 
    outcome=data['y'],
    # use a random forest to predict E[Y | X]
    outcome_model='randomforest',
)

For binary data, these string identifiers assume a nonparametric model where $Y_i \sim \text{Bern}(\mu(X_i, W_i))$ and the conditional mean function $\mu$ is estimated via one of the models listed above (e.g., a random forest classifier).

For nonbinary data, these string identifiers use a semiparametric regression model:

$$Y_i = \mu(X_i, W_i) + \epsilon_i  $$

where the conditional mean function $\mu(\cdot, \cdot)$ is approximated using one of the models listed above (e.g., a random forest or k-nearest neighbors regressor). All methods automatically create interaction terms between the covariates and the treatment.

**Default 1: Homoskedasticity.** By default, these string identifiers estimate a homoskedastic model where the variance of $\epsilon_i$ does not depend on $X_i$. However, one can also specify a model to use to estimate the heteroskedasticity pattern, as shown below:

In [7]:
dbnd = DualBounds(
    f=lambda y0, y1, x: np.maximum(y1-y0,0), # estimand
    covariates=data['X'], 
    treatment=data['W'], 
    outcome=data['y'],
    # use a random forest to predict E[Y | X]
    outcome_model='randomforest', 
    # use lasso to predict Var(Y | X)
    heterosked_model='lasso',
)

That said, we emphasize that the default (homoskedastic) approach yields valid bounds even under arbitrary heteroskedasticity patterns.

**Default 2: Nonparametric residual estimates.** By default, these string identifiers estimate the law of $\epsilon_i$ using the empirical law of the training residuals (or, for ridge estimators, the leave-one-out residuals). However, it is possible to change this by changing the ``eps_dist`` parameter.

In [8]:
dbnd = DualBounds(
    f=lambda y0, y1, x: np.maximum(y1-y0,0), # estimand
    covariates=data['X'], 
    treatment=data['W'], 
    outcome=data['y'],
    propensities=data['pis'],
    # use a random forest to predict E[Y | X]
    outcome_model='randomforest',
    # assume a parametric model for the residuals
    # (the default is nonparametric)
    eps_dist='laplace', 
)

### Method 2: A ``dist_reg.DistReg`` class

Analysts can also specify the outcome model by passing in a model which inherits from ``dualbounds.dist_reg.DistReg``, including the ``CtsDistReg``, ``QuantileDistReg``, or and ``BinaryDistReg`` classes in the ``dualbounds.dist_reg`` submodule. One example is given below:

In [9]:
Y_model = db.dist_reg.CtsDistReg(
    model_type='elasticnet', 
    eps_dist='empirical',
    how_transform='interactions', # create interactions btwn X and W
    heterosked_model='lasso',
    heterosked_kwargs=dict(cv=3), # kwargs for model for Var(Y|X)
)
dbnd = DualBounds(
    outcome_model=Y_model, # use new model
    f=lambda y0, y1, x: np.maximum(y1-y0,0), # estimand
    covariates=data['X'], 
    treatment=data['W'], 
    outcome=data['y'],
    propensities=data['pis'],
)
dbnd.fit(alpha=0.05)
print(dbnd.results().to_markdown())

Cross-fitting the outcome model.


  0%|          | 0/5 [00:00<?, ?it/s]

Estimating optimal dual variables.


  0%|          | 0/900 [00:00<?, ?it/s]

|            |   Lower |     Upper |
|:-----------|--------:|----------:|
| Estimate   | 3.08487 | 3.2955    |
| SE         | 0.10184 | 0.0950885 |
| Conf. Int. | 2.88527 | 3.48187   |


One can also directly input ``sklearn`` or ``sklearn``-like classes. For example, below we show how to use the ``AdaBoostClassifier`` from sklearn for binary data.

In [10]:
import sklearn.ensemble as ensemble
Y_model = db.dist_reg.BinaryDistReg(
    model_type=ensemble.AdaBoostClassifier,
    algorithm='SAMME'
)
dbnd = DualBounds(
    outcome_model=Y_model, # use new model
    f=lambda y0, y1, x: y0 < y1, # estimand
    outcome=data['y'] > 0, # make the outcome binary
    # other data
    treatment=data['W'], 
    covariates=data['X'],
    propensities=data['pis'],
)
dbnd.fit(alpha=0.05)
print(dbnd.results().to_markdown())

Cross-fitting the outcome model.


  0%|          | 0/5 [00:00<?, ?it/s]

Estimating optimal dual variables.


  0%|          | 0/900 [00:00<?, ?it/s]

|            |     Lower |     Upper |
|:-----------|----------:|----------:|
| Estimate   | 0.294189  | 0.397205  |
| SE         | 0.0284652 | 0.0226536 |
| Conf. Int. | 0.238398  | 0.441606  |


Analysts can also create custom classes inheritting from ``dualbounds.dist_reg.DistReg``, allowing analysts to use (e.g.) custom conditional variance estimators---see the API reference for more details.

### Method 3: Input predicted conditional distributions

For maximum flexibility, one can also directly input predicted conditional distributions of $Y(1) \mid X$ and $Y(0) \mid X$, in the form of a list of batched scipy distributions whose shapes sum to the number of datapoints.

This is illustrated below, although for simplicity the inputs have nothing to do with the true distributions of $Y(1) \mid X$ and $Y(0) \mid X$. Note that in real applications, it is extremely important that the estimates of $Y(1) \mid X$ and $Y(0) \mid X$ must be computed using cross-fitting, otherwise the dual bounds may not be valid.

In [11]:
from scipy import stats
n = len(data['y']) # number of data-points

# Initialize object
dbnd = DualBounds(
    Y_model='lasso', # this will be ignored
    f=lambda y0, y1, x : y0 < y1, # estimand
    # data
    outcome=data['y'],
    treatment=data['W'], 
    covariates=data['X'],
    propensities=data['pis'],
)

# Either of the following input formats work
y0_dists = stats.norm(loc=np.zeros(n))
y1_dists = [
    stats.norm(loc=np.zeros(int(n/2)), scale=2), 
    stats.norm(loc=np.zeros(int(n/2)), scale=3)
]
# Compute dual bounds using y0_dists and y1_dists
dbnd.fit(
    y0_dists=y0_dists,
    y1_dists=y1_dists,
    suppress_warning=True,
)
print(dbnd.results().to_markdown())

Estimating optimal dual variables.


  0%|          | 0/900 [00:00<?, ?it/s]

|            |     Lower |     Upper |
|:-----------|----------:|----------:|
| Estimate   | 0.318794  | 1.2159    |
| SE         | 0.0391768 | 0.0261528 |
| Conf. Int. | 0.242008  | 1.26716   |


This syntax can be useful if in simulations one wants to compute an "oracle dual bound" which has perfect knowledge of the conditional distributions of $Y(0) \mid X$ and $Y(1) \mid X$, as illustrated below.

In [12]:
# Compute oracle dual bounds using the true conditional dists of Y0/Y1
dbnd.fit(
    y0_dists=data['y0_dists'],
    y1_dists=data['y1_dists'],
    suppress_warning=True,
)
print(dbnd.results().to_markdown())

Estimating optimal dual variables.


  0%|          | 0/900 [00:00<?, ?it/s]

|            |     Lower |     Upper |
|:-----------|----------:|----------:|
| Estimate   | 0.675722  | 0.929035  |
| SE         | 0.0208299 | 0.0126281 |
| Conf. Int. | 0.634896  | 0.953786  |


Note that the output of the oracle dual bounds is extremely similar to the output of the initial dual bounds in the third cell.

## Choosing the propensity scores

Dual bounds can also apply to observational data where the propensity scores must be estimated. In this case, analysts can specify the model used to estimate the propensity scores---the ``propensity_model``---with one of three methods. First, one can use a string identifier:

In [13]:
dbnd = DualBounds(
    propensity_model='ridge', # logistic ridge for prop. scores
    outcome_model='lasso',
    f=lambda y0, y1, x: y0 < y1, # estimand
    outcome=data['y'],
    treatment=data['W'], 
    covariates=data['X'],
)
dbnd.fit().summary()

Fitting propensity scores.


  0%|          | 0/5 [00:00<?, ?it/s]

Cross-fitting the outcome model.


  0%|          | 0/5 [00:00<?, ?it/s]

Estimating optimal dual variables.


  0%|          | 0/900 [00:00<?, ?it/s]

___________________Inference_____________________
               Lower     Upper
Estimate    0.692290  0.931634
SE          0.020940  0.013118
Conf. Int.  0.651249  0.957345

_________________Outcome model___________________
                      Model  No covariates
Out-of-sample R^2  0.931776       0.000000
RMSE               1.055278       4.040167
MAE                0.827924       3.228010

_________________Treatment model_________________
                            Model  No covariates
Out-of-sample R^2        0.007721       0.000000
Accuracy                 0.535556       0.516667
Likelihood (geom. mean)  0.501630       0.499721

______________Nonrobust plug-in bounds___________
               Lower     Upper
Estimate    0.687313  0.926805
SE          0.012143  0.006866
Conf. Int.  0.663513  0.940263

_______________Technical diagnostics_____________
                            Lower     Upper
Loss from gridsearch     0.015727 -0.000082
Max leverage             0.017218  0.04643

Second, one can directly input an sklearn classifier.

In [14]:
dbnd = DualBounds(
    propensity_model=ensemble.AdaBoostClassifier(algorithm='SAMME'), 
    f=lambda y0, y1, x: y0 < y1, # estimand
    outcome=data['y'],
    treatment=data['W'], 
    covariates=data['X']
)
dbnd.fit().summary()

Fitting propensity scores.


  0%|          | 0/5 [00:00<?, ?it/s]

Cross-fitting the outcome model.


  0%|          | 0/5 [00:00<?, ?it/s]

Estimating optimal dual variables.


  0%|          | 0/900 [00:00<?, ?it/s]

___________________Inference_____________________
               Lower     Upper
Estimate    0.685896  0.933438
SE          0.021124  0.012638
Conf. Int.  0.644494  0.958209

_________________Outcome model___________________
                      Model  No covariates
Out-of-sample R^2  0.931781       0.000000
RMSE               1.055246       4.040167
MAE                0.828872       3.228010

_________________Treatment model_________________
                            Model  No covariates
Out-of-sample R^2       -0.014126       0.000000
Accuracy                 0.501111       0.516667
Likelihood (geom. mean)  0.496089       0.499721

______________Nonrobust plug-in bounds___________
               Lower     Upper
Estimate    0.684959  0.923693
SE          0.012243  0.007051
Conf. Int.  0.660962  0.937513

_______________Technical diagnostics_____________
                            Lower     Upper
Loss from gridsearch     0.015206 -0.000077
Max leverage             0.015009  0.02433

Lastly, analysts can also directly estimate the vector propensity scores and input them, although analysts should ensure that they are correctly employing cross-fitting in this case to ensure validity.