# BoXHED 2.0 quick start

BoXHED 2.0 is a software package for nonparametrically estimating hazard functions via gradient boosting. It extends BoXHED 1.0 whose paper can be found here: [BoXHED: Boosted eXact Hazard Estimator with Dynamic covariates](http://proceedings.mlr.press/v119/wang20o/wang20o.pdf).

This section provides a demonstration of applying BoXHED 2.0 to a synthetic data example. 

### 1. Importing convenience functions from main.py

Here we introduce the functions we import from main.py (the script we use for evaluating BoXHED2.0).

**_read_synth** reads the synthetic data for training and returns a pandas dataframe.

input:
* *ind_exp*: hazard function number, based on the paper
* *num_irrelevant*: number of irrelevant covariates, 0, 20, or 40

output:

a pandas data frame consisting of the following columns:
* *patient*: the unit number. It starts from 1 to the number of the patients in the datasets.
* *t_start*: the start time of the observation
* *t_end*: the end time of the observation
* *X_i*: other covariates (the name is not important for other covariates)

In [1]:
from main import _read_synth

**_read_synth_test** reads the synthetic data for testing and returns a pandas dataframe as well as true hazard function (for RMSE calculation)

input:
* *ind_exp*: hazard function number, based on the paper
* *num_irrelevant*: number of irrelevant covariates, 0, 20, or 40

output:

* a numpy array for true hazard function for each row of the test data.
* a pandas data frame consisting of the following columns:
  * *t_start*: the start time of the observation
  * *X_i*: other covariates (the name is not important for other covariates)

In [2]:
from main import _read_synth_test

**drop_rows** drops rows randomly to introduce censoring. 

input:
* *data*: input data as read by *_read_synth*
* *num_irrelevant*: probability of each row staying in the dataset

output:
* a pandas data frame similar to the input, but probably with fewer rows and disccontinuity in time.

In [3]:
from main import drop_rows

as an example, we select arbitrary values for simulation parameters and train/test BoXHED2.0 using them.

### 2. Running an example

selecting specific simulation parameters:

In [4]:
exp_num   = 41  #experiment index. could also be 42, 43, and 44
num_irr   = 20  #number of irrelevant features. could also be 0 and 40
keep_prob = .8  #1-prob_{dropout}. could be any number in (0,1]
num_quant = 256 #number of quantiles. Could be any integer in [8, 256] 

reading in the data

In [5]:
data = _read_synth(exp_num, num_irr)

importing the BoXHED model: (for simplicity we have omitted hyper-parameter tuning, but it is implemented in *main.py*)

In [6]:
from boxhed import boxhed
boxhed_ = boxhed(max_depth    = 1,
                 n_estimators = 150)

now we call the preprocessor on the input data:

In [7]:
subjects, X, w, delta = boxhed_.preprocess(
        data             = data,
        quant_per_column = num_quant,
        weighted         = True,
        nthreads         = 1)

boxhed.preprocess has 4 inputs. The only one needing clarification is a boolean *weighted* which decides whether the quantiles are weighted in training or no.

It also has 3 outputs:
* *subjects*: patients for each row of *X* and *y*.
* *X*: input covariates as fed to BoXHED 2.0. It consists of covariates as well as *t_start*
* *w*: duration of each trajectory
* *delta*: denotes whether the event of interest has happened at the end of the trajectory or not.

fitting a BoXHED model would look like:

In [8]:
boxhed_.fit (X, delta, w)

boxhed(n_estimators=150)

we now read the test set and its corresponding true hazard value:

In [11]:
true_haz, test_x = _read_synth_test(exp_num, num_irr) 

making a prediction on the test set:

In [12]:
preds = boxhed_.predict(test_x)

we now measure the RMSE by:

In [13]:
from main import calc_L2
L2 = calc_L2(preds, true_haz)

the point estimate along with the CI is as follows:

In [14]:
L2

[0.1752634914366061, [0.17166798400756458, 0.17885899886564763]]