# Variable Importance Estimation

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/174JPxk6-AKrjnyqkM1dvKtthPYJu31Sd#scrollTo=FQtHhn1Y7CQr)

**This is a simple and unified framework for nonlinear variable importance estimation that incorporates uncertainty in the prediction function and is compatible with a wide range of machine learning models (e.g., tree ensembles, kernel methods, neural networks, etc).**

This tutorial aims to provide some exmaples to use VIE package to obtain variable importance estimations. We implemented three classes: Featurized Decision Tree (FDT), Random Fourier Features (RFF) and Neural Network (NN). For each class, mean square prediction error (MSPE, to compare out-of-sample predictive accuracy) and variable importance scores are provided. 

In [1]:
import os
import sys
import pandas as pd
import jax
import jax.numpy as jnp
import numpy as np
#from tensorflow import keras
from sklearn.metrics import roc_auc_score

# Connect to your Google Drive, make sure there's a `GitHub` folder in `My Drive`.
#from google.colab import drive
#drive.mount('/content/drive/')

#sys.path.append('/content/drive/My Drive/GitHub')
# make sure you have installed the vie package
import vie  

# you can download all the data from 
# https://github.com/wdeng5120/featurized-decision-tree/tree/main/python/experiments/expr/datasets
# and change the path to your local one
#root_path = "/content/drive/MyDrive"
data_path = os.path.join("./datasets/")

## Define prepare_training_data function

In [2]:
# @title prepare_training_data
def prepare_training_data(data, n_obs):
  df_train = data.head(n_obs)
  df_test = data.tail(40)

  x_train, y_train, f_train = df_train, df_train.pop("y"), df_train.pop("f")
  x_test, y_test, f_test = df_test, df_test.pop("y"), df_test.pop("f")

  x_train = x_train.to_numpy()
  x_test = x_test.to_numpy()

  y_train = y_train.to_numpy().reshape(-1, 1).ravel()
  y_test = y_test.to_numpy().reshape(-1, 1).ravel()  
  return x_train, y_train, x_test, y_test

We will read data with the following argumets and prepare the training and testing data.

In [3]:
dataset_name = 'cont' # @param ['cat', 'cont', 'adult', 'heart', 'mi'] 
outcome_type = 'rbf' # @param ['linear', 'rbf', 'matern32', 'complex']
n_obs = 200 # @param [100, 200, 500, 1000]
dim_in = 25 # @param [25, 50, 100, 200]
rep = 7 # @param 

data_file = f"{outcome_type}_n{n_obs}_d{dim_in}_i{rep}.csv"
data_file_path = os.path.join(data_path, dataset_name, data_file)
print(f"Data '{data_file}'", end='\t', flush=True)  

data = pd.read_csv(data_file_path, index_col=0)  
x_train, y_train, x_test, y_test = prepare_training_data(data, n_obs)

Data 'rbf_n200_d25_i7.csv'	

Now let's go over these three models one-by-one.

## Featurized Decision Tree (FDT)

There are a few parameters that we need to specify for FDT. 

- **`x_train`: array-like of shape (n_samples, n_features)** 

  The training input samples. 

- **`y_train`: array-like of shape (n_samples,)** 

  The target values (class labels in classification, real numbers in regression).

- **`c`: float, default=1.0** 

  The smoothing parameter controling the similarity between our softmax approximation tree and the original decision tree. 

- **`sig2`: float, default=0.01** 

  The estimated variance of noise, which can be obtained from cross validation. 

- **`n_tree`: int, default=20**

  The number of trees in the forest.

- **`compute_psi`: bool, default=True**

  Whether to compute variable importance scores. If False, the vector of variable importance scores is returned as zeros of shape (n_features,).

- **`batch_size`: int, default=20**

  Batch size for computing the variable importance scores. Only available if compute_psi=True.

- **`n_samp`: int, default=100**

  Number of the posterior samples of beta. Only available if compute_psi=True.

- **`seed`: int, default=0**

  Controls both the randomness of the bootstrapping of the samples used when building trees and the sampling of the features to consider when looking for the best split at each node.

In [4]:
fdt_out = vie.get_fdt_model(x_train, y_train, c=0.1, sig2=0.01, n_tree=20, 
                            compute_psi=True, batch_size=20, n_samp=10, seed=0)

Shapes: psi_est_all: (100, 20, 20, 25), grad_train: (20, 25), psi_est: (25,)


Here we set `n_sampe=10`, for the convenience of illustration. `fdt_out` has four returns. 

- **`psi_est`: ndarray of shape (n_features,)** 

  The vector of variable importance scores. An array of zeros if compute_psi=False.

- **`f`: prediction function** 

  The prediction function with four parameters and returns the predicted values of shape (n_samples, n_tree).
    - **`X`: array-like of shape (n_samples, n_features)** 

    The input samples. 
    - **`map_matrix`: array-like of shape (n_tree, 2*(n_leaf_nodes - 1), n_leaf_nodes)** 

    The mapping matrix maps each sample from the path to the leaf node.

    - **`feature`: array-like of shape (n_tree, n_leaf_nodes - 1)**

    Features used for splitting nodes.

    - **`threshold`: array-like of shape (n_tree, n_leaf_nodes - 1)** 

    Threshold values.

    - **`beta`: array-like of shape (n_tree, n_leaf_nodes)**

    Posterior means of the coefficients for leaf nodes.
  
- **`grad_fs`: function to compute variable importance scores** 

  The function to compute variable importance scores of shape (n_samp, n_samples, n_tree, n_features) with four parameters: The first three are the same as the those in `f`, with the fourth as 

    - **`betas`: array-like of shape (n_tree, n_samp, n_leaf_nodes)**

    Posterior samples of the coefficients for leaf nodes.

- **`out_set`: dictionary containing five items: `map_matrix`, `feature`, `threshold`, `beta`, `betas`** 

Let's see the MSPE on testing dataset.

In [5]:
f = fdt_out[1]
map_matrix = fdt_out[3]["map_matrix"]
feature = fdt_out[3]["feature"]
threshold = fdt_out[3]["threshold"]
beta = fdt_out[3]["beta"]
pred_test = np.array(f(x_test, map_matrix, feature, threshold, beta))
np.mean((y_test - np.mean(pred_test, 1)) ** 2)

1.593397296777089

The first output of `fdt_out` is the vector of variable importance scores computed using `x_train`.

In [6]:
psi_train = fdt_out[0]
psi_train
import matplotlib.pyplot as plt
#plt.imshow(psi_train[1])
psi_train.shape

(100, 20, 20, 25)

We know the true variables that generated `y` are the first five ones. Therefore, we can calculate the auroc score.

In [8]:
true = np.concatenate((np.repeat(1, 5), np.repeat(0, x_train.shape[1] - 5)))
#roc_auc_score(true, psi_train)

We can also compute the variable importance scores computed using `x_test` and calculate the auroc score. Since the one obtained from `grad_fs` is of shape (n_samp, n_samples, n_tree, n_features). We take the mean across the first axes and take the median across different trees (for robustness, but can also use mean).

In [9]:
grad_fs = fdt_out[2]
betas = fdt_out[3]["betas"]
psi_est = np.array(grad_fs(x_test, map_matrix, feature, threshold, betas))
grad_train = np.mean(psi_est**2, axis=(0,1))
psi_test = np.median(grad_train, axis=0)
roc_auc_score(true, psi_test)

1.0

## Random Fourier Features (RFF)

Next we see the performance of our variable importance measure applying to (approximated) kernel methods, Random Fourier Features. Here we set the number of features to be $O(\sqrt{n}\log(n))$ according to [Rudi, A. and Rosasco, L.](https://proceedings.neurips.cc/paper/2017/hash/61b1fb3f59e28c67f3925f3c79be81a1-Abstract.html). To select the length-scale parameter of the RBF kernel that RFF approximates to, we loop over a list of candidates and select the one with the smallest MSPE. 

In [10]:
n_rff = int(np.sqrt(n_obs) * np.log(n_obs)) + 1
pred_mse_rff = []
l_lst = [5.0, 10.0, 16.0, 23.0]
for l in l_lst:
    m = vie.rff(x_train, y_train, dim_hidden=n_rff, sig2=0.01, lengthscale=l, seed=0)
    m.train()
    pred = m.predict(x_test)[0]
    pred_mse_rff.append(np.mean((pred - y_test) ** 2))

l_best = l_lst[np.argmin(pred_mse_rff)]
m = vie.rff(x_train, y_train, dim_hidden=n_rff, sig2=0.01, lengthscale=l_best, seed=0)
m.train()

Similarly, we can compute the MSPE using `x_test`.


In [11]:
pred_test = m.predict(x_test)[0]
np.mean((y_test - pred_test) ** 2)

1.1967065724486168

Compute and the variable importance scores using `x_train` and calculate the auroc score. Can also do this using `x_test`.

In [12]:
psi_train = m.estimate_psi(x_train)[0]
roc_auc_score(true, psi_train)

0.81

In [17]:
psi_test = m.estimate_psi(x_test)[0]
roc_auc_score(true, psi_test)

array([0.05374935, 0.09295999, 0.00228801, 0.02916229, 0.00893562,
       0.02350537, 0.01223522, 0.00875522, 0.00095611, 0.0028615 ,
       0.00158964, 0.00312815, 0.00289245, 0.01179507, 0.00149384,
       0.00064764, 0.00203501, 0.00327557, 0.00317923, 0.00255488,
       0.00289723, 0.01099721, 0.00123574, 0.0012253 , 0.01066384])

## Neural Network (NN)

Finally, we see the performance of our variable importance measure applying to Neural Network. 

We use a one-hidden-layer NN with regularization imposed on the layer between the input and the hidden layer. 



In [13]:
lr = 1e-3  # @param
l1 = 1e2  # @param
l2 = 1e2
dim_hidden = 512  # @param
outlier_mse_cutoff = [100] # remove outliers
pred_mse_lst = [] # store the MSPE for each NN
psi_lst = [] # store the variable importance scores for each NN
batch_size = 20
sig2 = 0.01

We repeat the model 5 times with different seeds. 

In [14]:
n_model = 5
for i in range(n_model):
    m = vie.get_nn_model(dim_in=dim_in, dim_hidden=dim_hidden,
                         seed=i, lr=lr, l1=l1, l2=l2,
                         outlier_mse_cutoff=outlier_mse_cutoff)

    # Define callbacks.
    early_stop_callback = keras.callbacks.EarlyStopping(
        monitor="val_robust_mse_100", min_delta=1e-6, patience=25, verbose=0)
    metrics_callback = vie.MetricsCallback(x_train=x_train,
                                           x_test=x_test,
                                           y_test=y_test,
                                           outlier_mse_cutoff=outlier_mse_cutoff,
                                           dim_in=dim_in)

    m.fit(x_train, y_train, batch_size=32, epochs=500, 
          validation_data=(x_test, y_test), 
          callbacks=[early_stop_callback, metrics_callback],
          verbose=0)

    # Decide best epoch by MSPE.
    mse_history = metrics_callback.history['y_mse_100']
    psi_history = metrics_callback.history['psi']
    best_epoch_mse = np.argmin(mse_history)
    best_mse_by_mse = mse_history[best_epoch_mse]
    best_psi_by_mse = psi_history[best_epoch_mse]
    pred_mse_lst.append(best_mse_by_mse)
    psi_lst.append(best_psi_by_mse)

In [39]:
np.mean(pred_mse_lst) # MSPE

0.902294902450946

In [40]:
psi_train = np.mean(np.array(psi_lst), 0)
roc_auc_score(true, psi_train)

0.78

Unfortunately, unlike FDT and RFF, we cannot compute the variable importance scores using `x_test`. This is because during training, the model only computes the variable importance scores using `x_train`.