<a href="https://colab.research.google.com/github/VectorInstitute/Causal_Inference_Laboratory/blob/main/notebooks/estimation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparation

## Upload Code

Run this code to clone the repository and prepare it. 


In [3]:
!git clone https://github.com/VectorInstitute/Causal_Inference_Laboratory.git
!mv Causal_Inference_Laboratory code
!mv code/data ./data
!mv code/utils ./utils
!mv code/models ./models
!mv code/estimation_results ./estimation_results

Cloning into 'Causal_Inference_Laboratory'...
remote: Enumerating objects: 120, done.[K
remote: Counting objects: 100% (120/120), done.[K
remote: Compressing objects: 100% (106/106), done.[K
remote: Total 120 (delta 17), reused 109 (delta 12), pack-reused 0[K
Receiving objects: 100% (120/120), 14.93 MiB | 22.59 MiB/s, done.
Resolving deltas: 100% (17/17), done.


In [4]:
!pip install xgboost==1.3.3

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting xgboost==1.3.3
  Downloading xgboost-1.3.3-py3-none-manylinux2010_x86_64.whl (157.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m157.5/157.5 MB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: xgboost
  Attempting uninstall: xgboost
    Found existing installation: xgboost 1.7.5
    Uninstalling xgboost-1.7.5:
      Successfully uninstalled xgboost-1.7.5
Successfully installed xgboost-1.3.3


## Imports

In [None]:
import os
import tensorflow as tf
import numpy as np
import pandas as pd
import time
from sklearn.model_selection import train_test_split

import utils.estimators as models
import utils.preprocessing as helper
from utils.preprocessing import sys_config
import utils.metrics as metrics
from utils.evaluation import *

import warnings
warnings.filterwarnings('ignore')

In [None]:
datasets_folder = sys_config["datasets_folder"]
results_folder = sys_config["results_folder"]

seed = 0
np.random.seed(seed)

# Description of datasets
We briefly discuss the datasets here.

## IHDP
Infant Health and Development Program (IHDP) [1] is from a
randomized experiment studying the effect of home visits by specialists on
future cognitive test scores of children. The children of non-white mothers in
the treated set are removed to de-randomize the experiment. Each unit is
simulated for a treated and a control outcome (so we know the ground-truth of
the individual treatment effects).



The IHDP datasets are already split into the train (672 for each realization)
and test (75 for each realization) splits in a 90/10 split. Each `.npz` file
contains the following keys: x, t, yf, ycf, mu0, mu1, which are respectively
covariates, treatment, factual outcome, counterfactual outcome, noiseless
potential control outcome, and noiseless potential treated outcome.
- IHDP-100: 100 realizations of the IHDP dataset (included in our repo);
- IHDP-1000: 1000 realizations of the IHDP dataset
(downloadable from https://www.fredjo.com/);

## Jobs
Jobs is a dataset derived from LaLonde [2] where the original data set has job
training as the treatment and income and employment status after training as
outcomes. The Jobs dataset is proposed in [3] using the LaLonde experimental
sample (297 treated, 425 control) and the PSID comparison group (2490 control).




The Jobs datasets are already split into the train (2570 for each realization)
and test (642 for each realization) splits in a 80/20 split. Each `.npz` file
contains the following keys: x, t, yf, ate, which are respectively
covariates, treatment, factual outcome, and average treatment effect (scalar).
- Jobs: 10 realizations of the Jobs dataset (included in our repo);

## Twins

TWINS [4]. The dataset is from the data of twin births in the USA between 1989-1991 [5] about the effect of the relative weight of each of the twins on the morality of them. The treatment is whether the twin is born heavier than the other twin (T = 1 means heavier) and the outcomes are the first-year mortality of the twins. It has 23968 units (11984 treated, 11984 control) and 46 covariates relating to the parents, the pregnancy and birth.

In [None]:
dataset_name = "IHDP-100" #@param ["IHDP-100", "Jobs", "TWINS"]
if dataset_name == "Jobs":
  x_all, t_all, yf_all = helper.load_Jobs_observational(
              datasets_folder, dataset_name, details=False
          )

  x_test_all = helper.load_Jobs_out_of_sample(
      datasets_folder, dataset_name, details=False
  )
elif dataset_name == "TWINS":
  x_all, t_all, yf_all = helper.load_TWINS_observational(
            datasets_folder, dataset_name, details=False
        )
  x_test_all, t_test_all = helper.load_TWINS_out_of_sample(
      datasets_folder, dataset_name, details=False
  )
elif dataset_name == "IHDP-100":
  x_all, t_all, yf_all = helper.load_IHDP_observational(
            datasets_folder, dataset_name, details=False
        )
  x_test_all = helper.load_IHDP_out_of_sample(
      datasets_folder, dataset_name, details=False
  )

## Estimation

### DML

The double machine learning method works as follows. We first partition the data set into $2$ subsets $I$ and $I^{c}$. Then train any model $M_{t}$ to estimate $T$ from $X$ using $I^{c}$ and train any model $M_{y}$ to estimate $Y$ from $X$ using $I^{c}$. Calculate the residuals $Y_{R} = Y - M_{y}(X)$ and $T_{R} = T - M_{t}(X)$ on $I$. Regress $Y_{R}$ on $T_{R}$ to get the estimated ATE.

### IPW

Define the propensity score as the probability of a unit with covariate $x$ to be treated by $T=1$, i.e., $\pi(x) = \mathbb{P}(T=1| X=x)$. We define the IPW outcomes as:
\begin{align*}
    Y_{i}(1)^{\text{IPW}} = Y_{i}(1) \frac{\mathbb{1}(T=1)}{\pi_{i}(x)},\quad Y_{i}(0)^{\text{IPW}} = Y_{i}(0) \frac{\mathbb{1}(T=0)}{1-\pi_{i}(x)}
\end{align*}
If the true propensity is known, it is can be easily verified that $\mathbb{E}[Y_{i}(1)^{\text{IPW}}] = Y_{i}(1)$ and $\mathbb{E}[Y_{i}(0)^{\text{IPW}}] = Y_{i}(0)$, i.e., the IPW outcomes are the unbiased estimator of each individual's potential outcomes. 

The inverse propensity weighted (IPW) estimator models the propensity score instead of modeling the potential outcomes. However, since the propensity score is in the denominator, this approach tends to have very high variance.

### OLS1

TBC.

### OLS2

TBC.

### NN1

TBC.

### NN2

TBC.

### RF1

TBC.

### RF2

TBC.

### Dragonnet

TBC.

### TARNet

TAR-Net [3] is proposed to overcome the problem of potentially ignoring $T$ when $X$ is high dimensional in COM and the problem of inefficient data usage as units are split into groups in GCOM. It is able to utilize all units through a common embedding and uses two heads for the two treatments.

Run and save estimatotion results using a specified estimator.

In [None]:
def estimate(estimator_name):
    num_realizations = x_all.shape[-1]
    print("Numer of realizations:", num_realizations)
    y0_in_all, y1_in_all, y0_out_all, y1_out_all = [], [], [], []
    ate_in_all, ate_out_all = [], []
    for i in range(num_realizations):
        text = f" Estimation of realization {i} via {estimator_name}"
        print(f"{text:-^79}")
        x, t, yf = x_all[:, :, i], t_all[:, i], yf_all[:, i]
        x_test = x_test_all[:, :, i]
        # train the estimator and predict for this realization
        (
            y0_in,
            y1_in,
            ate_in,
            y0_out,
            y1_out,
            ate_out,
        ) = models.train_and_evaluate(x, t, yf, x_test, estimator_name)
        y0_in_all.append(y0_in)
        y1_in_all.append(y1_in)
        ate_in_all.append(ate_in)
        y0_out_all.append(y0_out)
        y1_out_all.append(y1_out)
        ate_out_all.append(ate_out)
    # follow the dimension order of the dataset,
    # i.e., realizations are captured by the last index
    y0_in_all = np.squeeze(np.array(y0_in_all).transpose()).reshape((-1, num_realizations))
    y1_in_all = np.squeeze(np.array(y1_in_all).transpose()).reshape((-1, num_realizations))
    y0_out_all = np.squeeze(np.array(y0_out_all).transpose()).reshape((-1, num_realizations))
    y1_out_all = np.squeeze(np.array(y1_out_all).transpose()).reshape((-1, num_realizations))
    ate_in_all = np.array(ate_in_all).reshape((num_realizations,))
    ate_out_all = np.array(ate_out_all).reshape((num_realizations,))

    # save estimation results
    estimation_result_folder = os.path.join(
        results_folder, dataset_name, estimator_name
    )
    print(f"Saving {estimation_result_folder}.")
    helper.save_in_and_out_results(
        estimation_result_folder,
        y0_in_all,
        y1_in_all,
        ate_in_all,
        y0_out_all,
        y1_out_all,
        ate_out_all,
    )

In [None]:
estimator = "Dragonnet" #@param estimator_set = ["IPW", "OLS1", "OLS2", "NN1", "NN2", "RF1", "RF2", "Dragonnet", "TARNet"]


estimate(estimator)

## Evalutation

TODO: description

In [None]:
def evaluate(estimator_name, metrics_set):
    print(f'{" Evaluation ":-^79}')
    results_in = {}
    results_out = {}
    if dataset_name == "Jobs":
        ate_in_gt, ate_out_gt = helper.load_Jobs_ground_truth(
            datasets_folder, dataset_name, details=False
        )
        mu0_in, mu1_in, mu0_out, mu1_out = None, None, None, None
        x_all, t_all, yf_all = helper.load_Jobs_observational(
            datasets_folder, dataset_name, details=False
        )
        x_test_all = helper.load_Jobs_out_of_sample(
            datasets_folder, dataset_name, details=False
        )
    elif "IHDP" in dataset_name:
        mu0_in, mu1_in, mu0_out, mu1_out = helper.load_IHDP_ground_truth(
            datasets_folder, dataset_name, details=False
        )
        ate_in_gt = np.mean(mu1_in - mu0_in)
        ate_out_gt = np.mean(mu1_out - mu0_out)
        x_all, t_all, yf_all = helper.load_IHDP_observational(
            datasets_folder, dataset_name, details=False
        )
        x_test_all = helper.load_IHDP_out_of_sample(
            datasets_folder, dataset_name, details=False
        )
    elif dataset_name == "TWINS":
        mu0_in, mu1_in, mu0_out, mu1_out = helper.load_TWINS_ground_truth(
            datasets_folder, dataset_name, details=False
        )
        ate_in_gt = np.mean(mu1_in - mu0_in)
        ate_out_gt = np.mean(mu1_out - mu0_out)
        x_all, t_all, yf_all = helper.load_TWINS_observational(
            datasets_folder, dataset_name, details=False
        )
        x_test_all, t_test_all = helper.load_TWINS_out_of_sample(
            datasets_folder, dataset_name, details=False
        )
        mu0_in, mu1_in, mu0_out, mu1_out = helper.load_TWINS_ground_truth(
            datasets_folder, dataset_name, details=False
        )

    indices_all = np.arange(x_all.shape[0])

    x_train, x_eval, t_train, t_eval, yf_train, yf_eval, inidices_train, indices_eval = train_test_split(
        x_all, t_all, yf_all, indices_all, test_size=0.2, random_state=seed
        )
    
    data_size = x_eval.shape[0]
    num_realizations = 1
    if len(x_eval.shape) == 3:
        num_realizations = x_eval.shape[2]
        data_size = x_eval.shape[0] * x_eval.shape[2]

    
    # squeeze all eval data
    x_eval = np.reshape(x_eval, (data_size, x_eval.shape[1]))
    t_eval = np.reshape(t_eval, (data_size))
    yf_eval = np.reshape(yf_eval, (data_size))

    #Computing relevant evaluation metric for ensemble
    nuisance_stats_dir= results_folder + '//..//models//' + dataset_name + '//'
    # Nuisance Models
    prop_prob, prop_score = get_nuisance_propensity_pred(x_eval, t_eval, save_dir=nuisance_stats_dir)
    outcome_s_pred = get_nuisance_outome_s_pred(x_eval, t_eval, save_dir=nuisance_stats_dir)
    outcome_t_pred = get_nuisance_outcome_t_pred(x_eval, t_eval, save_dir=nuisance_stats_dir)
    outcome_r_pred = get_nuisance_outcome_r_pred(x_eval, save_dir=nuisance_stats_dir)

    estimation_result_folder = os.path.join(
        results_folder, dataset_name, estimator_name
    )
    (
        y0_in,
        y1_in,
        ate_in,
        y0_out,
        y1_out,
        ate_out,
    ) = helper.load_in_and_out_results(estimation_result_folder)
    
    if dataset_name == "TWINS":
        y0_in = y0_in.reshape((-1, 1))
        y1_in = y1_in.reshape((-1, 1))
        y0_out = y0_out.reshape((-1, 1))
        y1_out = y1_out.reshape((-1, 1))
        ate_in = ate_in.reshape((-1, 1))
        ate_out = ate_out.reshape((-1, 1))
            
    results_in[estimator_name] = {}
    results_out[estimator_name] = {}

    for metric in metrics_set:
        metric_in = None
        ite_estimate_eval = (y1_in[indices_eval] - y0_in[indices_eval])
        if metric in ["MAE", "PEHE"]:
            metric_in = metrics.calculate_metrics(
                y0_in, y1_in, ate_in, mu0_in, mu1_in, ate_in_gt, metric=metric
            )
        elif metric == "value_score":
            metric_in = metrics.calculate_value_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, dataset_name=dataset_name, prop_score=prop_score
            )
        elif metric == "value_dr_score":
            metric_in = metrics.calculate_value_dr_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_t_pred, dataset_name=dataset_name, prop_score=prop_score
            )
        elif metric == "value_dr_clip_prop_score":
            metric_in = metrics.calculate_value_dr_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_t_pred, dataset_name=dataset_name, prop_score=prop_score, min_propensity=0.1
            )
        elif metric == "tau_match_score":
            metric_in = metrics.calculate_tau_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval
            )
        elif metric == "tau_iptw_score":
            metric_in = metrics.calculate_tau_iptw_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, prop_score=prop_score
            )
        elif metric == "tau_iptw_clip_prop_score":
            metric_in = metrics.calculate_tau_iptw_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, prop_score=prop_score, min_propensity=0.1
            )
        elif metric == "tau_dr_score":
            metric_in = metrics.calculate_tau_dr_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_t_pred, prop_score=prop_score
            )
        elif metric == "tau_dr_clip_prop_score":
            metric_in = metrics.calculate_tau_dr_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_t_pred, prop_score=prop_score, min_propensity=0.1
            )
        elif metric == "tau_s_score":
            metric_in = metrics.calculate_tau_s_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_s_pred
            )
        elif metric == "tau_t_score":
            metric_in = metrics.calculate_tau_t_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_t_pred
            )
        elif metric == "influence_score":
            metric_in = metrics.calculate_influence_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_t_pred, prop_prob=prop_prob
            )
        elif metric == "influence_clip_prop_score":
            metric_in = metrics.calculate_influence_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_t_pred, prop_prob=prop_prob, min_propensity=0.1
            )
        elif metric == "r_score":
            metric_in = metrics.calculate_r_risk(
                ite_estimate_eval, x_eval, t_eval, yf_eval, outcome_pred=outcome_r_pred, treatment_prob=prop_prob[:, 1]
            )

        if metric_in is None:
            results_in[estimator_name][metric] = {"mean": None}
        else:
            results_in[estimator_name][metric] = {
                "mean": np.mean(metric_in, where=(metric_in != 0)),
            }

    print(f'{" In-sample results ":-^79}')
    for metric in metrics_set:
        print(metric, estimator_name, results_in[estimator_name][metric])

In [None]:
metric = "value_dr_score" #@param [ "MAE", "PEHE", "value_score", "value_dr_score", "value_dr_clip_prop_score", "tau_t_score", "tau_s_score", "tau_match_score", "tau_iptw_score", "tau_iptw_clip_prop_score", "tau_dr_score", "tau_dr_clip_prop_score", "influence_score", "influence_clip_prop_score", "r_score", "ALL"]
if metric == "ALL":
  metric_set = [ "MAE", "PEHE", "value_score", "value_dr_score", "value_dr_clip_prop_score", "tau_t_score", "tau_s_score", "tau_match_score", "tau_iptw_score", "tau_iptw_clip_prop_score", "tau_dr_score", "tau_dr_clip_prop_score", "influence_score", "influence_clip_prop_score", "r_score"]
else:
  metric_set = [metric]

evaluate(estimator, metric_set)

--------------------------------- Evaluation ----------------------------------
------------------------------ In-sample results ------------------------------
value_dr_score Dragonnet {'mean': nan}


# References
[1] J. L. Hill, “Bayesian nonparametric modeling for causal inference,” Journal
of Computational  and Graphical Statistics, vol. 20, no. 1, pp. 217–240, 2011.
[Online]. Available: https://doi.org/10.1198/jcgs.2010.08162

[2] R. J. LaLonde, “Evaluating the econometric evaluations of training programs
with experimental data,” The American Economic Review, vol. 76, no. 4, pp.
604–620, 1986. [Online]. Available: http://www.jstor.org/stable/1806062

[3] U. Shalit, F. D. Johansson, and D. Sontag, “Estimating individual treatment
effect: generalization bounds and algorithms,” in Proceedings of the 34th
International Conference on Machine Learning, ser. Proceedings of Machine
Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70. PMLR, 06–11 Aug 2017
, pp. 3076–3085. [Online].
Available: https://proceedings.mlr.press/v70/shalit17a.html

[4] Christos Louizos, Uri Shalit, Joris Mooij, David Sontag, Richard Zemel, and Max Welling. Causal effect inference with deep latent-variable models. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17), 6449–6459, 2017.

[5] D. Almond, K. Y. Chay, and D. S. Lee. The costs of low birth weight.The Quarterly Journal of Economics,120(3):1031–1083, 2005.
