Finding Influential Training Samples for Gradient Boosted Decision Trees
This repository implements the LeafRefit and LeafInfluence methods described in the paper Finding Influential Training Samples for Gradient Boosted Decision Trees.
The paper deals with the problem of finding infuential training samples using the Infuence Functions framework from classical statistics recently revisited in the paper "Understanding Black-box Predictions via Influence Functions" (code). The classical approach, however, is only applicable to smooth parametric models. In our paper, we introduce LeafRefit and LeafInfuence, methods for extending the Infuence Functions framework to non-parametric Gradient Boosted Decision Trees ensembles.
We recommend using the Anaconda Python distribution for easy installation.
The following Python 2.7 packages are required:
Note: versions of the packages specified below are the versions with which the experiments reported in the paper were tested.
- ipywidgets>=7.0.0 (for Jupyter Notebook rendering)
create_influence_boosting_env.sh script creates the
influence_boosting Conda environment with the required packages installed. You can run the script by running the following in the
The code in this repository uses CatBoost for an implementation of GBDT. We tested our package with CatBoost version 0.6 built from GitHub. Installation instructions are available in the documentation.
Note: if you are using the
influence_boosting environment described above, make sure to install CatBoost specifically for this environment.
Since CatBoost is written in C++, in order to use CatBoost models with our Python package, we also include
export_catboost, a binary that exports a saved CatBoost model to a human-readable JSON.
This repository assumes that a program named
export_catboost is available in the shell. To ensure that, you can do the following:
- Select one of the two binaries,
export_catboost_linux, depending on your OS.
- Copy it to
export_catboostin the root repository directory.
- Add the path to the root repository directory to the
Note: since CatBoost's treatment of categorical features can be fairly complicated,
export_catboost currently supports numerical features only.
An example experiment showing the API and a use-case of Influence Functions can be found in the
Note: in this notebook, CatBoost parameters are loaded from the
catboost_params.json file. In particular, the
task_type parameter is set to
CPU by default. If you have a GPU with CUDA available on your machine and compiled CatBoost with GPU support, you can change this parameter to
GPU in order to train CatBoost faster on GPU. The majority of the experiments in the paper were conducted using the