[FEA] Return the observations propagated to leaf values #424

edwardwliu · 2022-12-05T17:02:40Z

Background / Motivation

As highlighted in a previous exploration, identifying which training observations were used to make a prediction can be useful for calculating out-of-bag errors and causal applications. Although the existing Treelite interface can be manipulated to return which observations were involved per tree, other features such as generating a random forest’s weight matrix (see explanation and a specific implementation in a library) requires identifying which observations were involved per leaf. This generates useful insights since a single prediction can be thought of a weighted set of observations from the training set.

Although it would be great to implement OOB predictions and weight matrix features natively, it may require non-lossy representation of leaf nodes in the checkpoint format. Instead of just storing the average value of a leaf, the training observations used to calculate the average value would needed as well. An alternative implementation that avoids changing the checkpoint format would be exposing an interface that returns which input observations were propagated to each leaf. Calculating OOB error and a weight matrix would then require a two-pass approach in which the training data is re-propagated to nodes, then compared against new test data handled by the same interface. This approach requires downstream implementation for Treelite users, but is flexible enough to support other applications. For instance, it would be possible to perturb the training data and generate custom weight matrices for other specific insights.

Potential Interface

Similar to the pred_margin parameter, there could be an additional parameter that specifies whether to return the observations associated with each leaf instead of the raw prediction in the treelite.gtil.predict() function or in the standard runtime library.

The returned value could remain the same numpy.ndarray type, but with some concern regarding memory depending on size of forest. It also requires that during serialization of a tree, each leaf node has a static index. The interface could return a vector with a length equivalent to the number of trees. Each tree index could then hold a mapping of leaf index to the input observation indices propagated to that leaf. This could be represented by a Boolean matrix (number of leaves by number of observations) or a vector of the observations in each leaf. As long as the leaf indices remain consistent across calls, a Treelite user can implement OOB and weight matrix themselves.

Example Application

Pseudo-code for potential implementation of a weight matrix:

training_leaves = predict(treelite_model, training_data, *pred_leaf* =True)
test_leaves = predict(treelite_model, test_data, *pred_leaf* =True)
i, j = number of training observations, number of test observations
weight_matrix = empty matrix of size i x j
for leaf_index, leaf in enumerate(test_leaves):
  for test_obs_index in leaf:
    for training_obs_index in training_leaves[leaf_index]:
       weight_matrix[training_obs_index, test_obs_index] += 1
weight_matrix = rescale weight_matrix to get final representation

The text was updated successfully, but these errors were encountered:

hcho3 · 2022-12-06T00:02:56Z

Thanks for your input! I'm out this week and will review in the following week.

hcho3 · 2022-12-13T23:55:18Z

@edwardwliu

How about creating a predict_leaf function that returns the leaf destination for each observation? It will be like scikit-learn's apply function:

apply(X)
Apply trees in the forest to X, return leaf indices.
Parameters:

X: {array-like, sparse matrix} of shape (n_samples, n_features)
The input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Returns:

X_leaves: ndarray of shape (n_samples, n_estimators)
For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Users of scikit-learn will find this primitive quite familiar. Also, this interface will return a simply 2D numpy.array; no nested data structure is required.

hcho3 · 2022-12-14T00:21:25Z

It is straightforward to obtain the list of observations per leaf, given a predict_leaf function. Here's working code with scikti-learn's apply function:

import numpy as np
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.datasets import load_diabetes

ntree = 5

X, y = load_diabetes(return_X_y=True)
clf = rfr(n_estimators=ntree, max_depth=4)
clf.fit(X, y)

pred_leaf = clf.apply(X)
print(pred_leaf.shape)
print(pred_leaf)

leaf_map = [{} for _ in range(ntree)]

for i in range(X.shape[0]):
    for j in range(ntree):
        if pred_leaf[i, j] not in leaf_map[j]:
            leaf_map[j][pred_leaf[i, j]] = []
        leaf_map[j][pred_leaf[i, j]].append(i)

for j in range(ntree):
    print(f"Tree {j}")
    for k, v in sorted(leaf_map[j].items()):
        print(k, v)

edwardwliu · 2022-12-15T15:53:28Z

Thanks @hcho3 for your consideration. Yes, the proposal sounds good. The outlined predict_leaf() (or apply()) would be immensely helpful for deriving the weightMatrix.

As an aside, we explored some examples with this proposal and would be able to implement OOB predictions with predict_leaf(). The implementation would likely be inefficient for our purposes so we may continue to explore hacking the interface outlined here i.e. returning an array of predictions per trees instead of just average. However, looking forward to testing out a real example if predict_leaf() is implemented.

edwardwliu · 2023-02-06T19:55:48Z

Hi @hcho3 appreciate you taking the time to review this earlier. We previously discussed the value of supporting a weight matrix and out-of-bag predictions. Do you know if there are plans to develop this request?

Although supporting both the weight matrix and fast out-of-bag predictions would be ideal, I understand there may be resource constraints. As such, we decided to explicitly separate fast out-of-bag predictions from this request. If one request could be prioritized, it would be native support for out-of-bag predictions (#435) over this one for the weight matrix. Happy to provide any additional details. Thanks!

hcho3 self-assigned this Dec 13, 2022

wphicks mentioned this issue Dec 15, 2022

[FEA] Add predict_leaf functionality to FIL rapidsai/cuml#5093

Open

edwardwliu mentioned this issue Feb 6, 2023

[FEA] Interface for fast out-of-bag predictions #435

Closed

hcho3 mentioned this issue Feb 22, 2023

Implement predict_leaf and predict_per_tree #442

Merged

hcho3 closed this as completed in #442 Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Return the observations propagated to leaf values #424

[FEA] Return the observations propagated to leaf values #424

edwardwliu commented Dec 5, 2022

hcho3 commented Dec 6, 2022

hcho3 commented Dec 13, 2022

hcho3 commented Dec 14, 2022 •

edited

edwardwliu commented Dec 15, 2022 •

edited

edwardwliu commented Feb 6, 2023

[FEA] Return the observations propagated to leaf values #424

[FEA] Return the observations propagated to leaf values #424

Comments

edwardwliu commented Dec 5, 2022

Background / Motivation

Potential Interface

Example Application

hcho3 commented Dec 6, 2022

hcho3 commented Dec 13, 2022

hcho3 commented Dec 14, 2022 • edited

edwardwliu commented Dec 15, 2022 • edited

edwardwliu commented Feb 6, 2023

hcho3 commented Dec 14, 2022 •

edited

edwardwliu commented Dec 15, 2022 •

edited