A novel class of Neyman-orthogonal learners for causal quantities defined at the representation level
The project is built with the following Python libraries:
- PyTorch
- Hydra - simplified command line arguments management
- MlFlow - experiments tracking
- normflows - a PyTorch package for normalizing flows
First one needs to make the virtual environment and install all the requirements:
pip3 install virtualenv
python3 -m virtualenv -p python3 --always-copy venv
source venv/bin/activate
pip3 install -r requirements.txtTo start an experiments server, run:
mlflow server --port=5000 --gunicorn-opts "--timeout 280"
To access the MlFLow web UI with all the experiments, connect via ssh:
ssh -N -f -L localhost:5000:localhost:5000 <username>@<server-link>
Then, one can go to the local browser http://localhost:5000.
Before running semi-synthetic experiments, place datasets in the corresponding folders:
- IHDP100 dataset: ihdp_npci_1-100.test.npz and ihdp_npci_1-100.train.npz to
data/ihdp100/ - ACIC 2016: to
data/acic2016/
── data/acic_2016
├── synth_outcomes
| ├── zymu_<id0>.csv
| ├── ...
│ └── zymu_<id14>.csv
├── ids.csv
└── x.csv
The main training script is universal for different methods and datasets. For details on mandatory arguments - see the main configuration file config/config.yaml and other files in config/ folder.
Generic script with logging and fixed random seed is the following:
PYTHONPATH=. python3 runnables/train.py +dataset=<dataset> +repr_net=<model> exp.seed=10One needs to specify a dataset / dataset generator (and some additional parameters, e.g. train size for the synthetic data dataset.n_samples_train=1000, or a subset index for ACIC 2016 data dataset.dataset_ix=0):
- Synthetic data (adapted from https://arxiv.org/abs/1810.02894):
+dataset=synthetic - IHDP dataset:
+dataset=ihdp100 - ACIC 2016 dataset:
+dataset=acic2016
One needs to choose a model and then fill in the specific hyperparameters (they are left blank in the configs):
- TARNet/TARFlow:
+repr_net=tarnet +repr_net_type=dense/+repr_net=tarnet +repr_net_type=res_flow - BNN/BNNFlow:
+repr_net=bnnet +repr_net_type=dense/+repr_net=bnnet +repr_net_type=res_flow - CFR/CFRFlow:
+repr_net=cfrnet +repr_net_type=dense/+repr_net=cfrnet +repr_net_type=res_flow - RCFR/RCFRFlow:
+repr_net=rcfrnet +repr_net_type=dense/+repr_net=rcfrnet +repr_net_type=res_flow - CFR-ISW/CFRNet-ISW:
+repr_net=cfrisw +repr_net_type=dense/+repr_net=cfrisw +repr_net_type=res_flow - BWCFR/BWCFRFlow:
+repr_net=bwcfr +repr_net_type=dense/+repr_net=bwcfr +repr_net_type=res_flow
Models already have the best hyperparameters saved, for each model - dataset and different sizes of the representation. One can access them via: +repr_net/<dataset>_hparams/<model>=<n_samples_train> or +model/<dataset>_hparams/<model>/<n_samples_train>=<ipm_params> etc. To perform manual hyperparameter tuning use the flags repr_net.tune_hparams=True, and then see repr_net.hparams_grid.
Stage 1 models are propensity networks (src/models/prop_nets.py) and outcome networks (src/models/mu_nets.py). The hyperparameters were tuned together with the stage 0 models and are stored in the same YAML files. To perform manual hyperparameter tuning use the flags prop_net_cov.tune_hparams=True and mu_net_cov.tune_hparams=True.
Stage 2 models are defined in config/config.yaml and src/models/target_net.py. One needs to specify the list of second-stage models to fit for exp.targets:
- CAPOs estimation with
$\text{DR}^{\text{K}}_a$ -learner:exp.targets="['mu0', 'mu1']" - CAPOs estimation with
$\text{DR}^{\text{FS}}_a$ -learner:exp.targets="['y0', 'y1']" - CATE estimation with
$\text{DR}^{\text{K}}$ :exp.targets="['cate']" - CATE estimation with
$\text{R}$ :exp.targets="['rcate']" - CATE estimation with
$\text{IVW}$ :exp.targets="['ivw_pi_cate']"
Example of running TARFlow (CFR w/
CUDA_VISIBLE_DEVICES=<devices> PYTHONPATH=. python3 runnables/train.py -m +dataset=synthetic +repr_net=tarnet +repr_net/synthetic_hparams/tarnet_res_flow=\'500\' exp.logging=True exp.device=cuda exp.seed=10 exp.targets="[]"Example of running BWCFRFlow without tuning based on synthetic data with n_train = 500,
CUDA_VISIBLE_DEVICES=<devices> PYTHONPATH=. python3 runnables/train.py -m +dataset=synthetic +repr_net=bwcfr +repr_net/synthetic_hparams/bwcfr_res_flow=\'500\' exp.logging=True repr_net.ipm=mmd repr_net.alpha=0.01,0.02,0.05 exp.device=cuda exp.seed=10 exp.targets="['mu0', 'mu1', 'y0', 'y1', 'cate', 'rcate']"Example of all-stages tuning of CFR based on the 0-th subset of IHDP100 dataset with IPM = WM and
CUDA_VISIBLE_DEVICES=<devices> PYTHONPATH=. -m +dataset=ihdp100 +repr_net=cfrnet exp.logging=True exp.device=cuda dataset.dataset_ix=0 repr_net.ipm=wass repr_net.alpha=0.1 repr_net.tune_hparams=True prop_net_cov.tune_hparams=True mu_net_cov.tune_hparams=True exp.targets="['mu0', 'mu1', 'y0', 'y1', 'cate', 'rcate']"Project based on the cookiecutter data science project template. #cookiecutterdatascience