Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Return the observations propagated to leaf values #424

Closed
edwardwliu opened this issue Dec 5, 2022 · 5 comments · Fixed by #442
Closed

[FEA] Return the observations propagated to leaf values #424

edwardwliu opened this issue Dec 5, 2022 · 5 comments · Fixed by #442
Assignees

Comments

@edwardwliu
Copy link

Background / Motivation

As highlighted in a previous exploration, identifying which training observations were used to make a prediction can be useful for calculating out-of-bag errors and causal applications. Although the existing Treelite interface can be manipulated to return which observations were involved per tree, other features such as generating a random forest’s weight matrix (see explanation and a specific implementation in a library) requires identifying which observations were involved per leaf. This generates useful insights since a single prediction can be thought of a weighted set of observations from the training set.

Although it would be great to implement OOB predictions and weight matrix features natively, it may require non-lossy representation of leaf nodes in the checkpoint format. Instead of just storing the average value of a leaf, the training observations used to calculate the average value would needed as well. An alternative implementation that avoids changing the checkpoint format would be exposing an interface that returns which input observations were propagated to each leaf. Calculating OOB error and a weight matrix would then require a two-pass approach in which the training data is re-propagated to nodes, then compared against new test data handled by the same interface. This approach requires downstream implementation for Treelite users, but is flexible enough to support other applications. For instance, it would be possible to perturb the training data and generate custom weight matrices for other specific insights.

Potential Interface

Similar to the pred_margin parameter, there could be an additional parameter that specifies whether to return the observations associated with each leaf instead of the raw prediction in the treelite.gtil.predict() function or in the standard runtime library.

The returned value could remain the same numpy.ndarray type, but with some concern regarding memory depending on size of forest. It also requires that during serialization of a tree, each leaf node has a static index. The interface could return a vector with a length equivalent to the number of trees. Each tree index could then hold a mapping of leaf index to the input observation indices propagated to that leaf. This could be represented by a Boolean matrix (number of leaves by number of observations) or a vector of the observations in each leaf. As long as the leaf indices remain consistent across calls, a Treelite user can implement OOB and weight matrix themselves.

Example Application

Pseudo-code for potential implementation of a weight matrix:

training_leaves = predict(treelite_model, training_data, *pred_leaf* =True)
test_leaves = predict(treelite_model, test_data, *pred_leaf* =True)
i, j = number of training observations, number of test observations
weight_matrix = empty matrix of size i x j
for leaf_index, leaf in enumerate(test_leaves):
  for test_obs_index in leaf:
    for training_obs_index in training_leaves[leaf_index]:
       weight_matrix[training_obs_index, test_obs_index] += 1
weight_matrix = rescale weight_matrix to get final representation
@hcho3
Copy link
Collaborator

hcho3 commented Dec 6, 2022

Thanks for your input! I'm out this week and will review in the following week.

@hcho3 hcho3 self-assigned this Dec 13, 2022
@hcho3
Copy link
Collaborator

hcho3 commented Dec 13, 2022

@edwardwliu

How about creating a predict_leaf function that returns the leaf destination for each observation? It will be like scikit-learn's apply function:

apply(X)
Apply trees in the forest to X, return leaf indices.
Parameters:

  • X: {array-like, sparse matrix} of shape (n_samples, n_features)
    The input samples. Internally, its dtype will be converted to dtype=np.float32. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Returns:

  • X_leaves: ndarray of shape (n_samples, n_estimators)
    For each datapoint x in X and for each tree in the forest, return the index of the leaf x ends up in.

Users of scikit-learn will find this primitive quite familiar. Also, this interface will return a simply 2D numpy.array; no nested data structure is required.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 14, 2022

It is straightforward to obtain the list of observations per leaf, given a predict_leaf function. Here's working code with scikti-learn's apply function:

import numpy as np
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.datasets import load_diabetes

ntree = 5

X, y = load_diabetes(return_X_y=True)
clf = rfr(n_estimators=ntree, max_depth=4)
clf.fit(X, y)

pred_leaf = clf.apply(X)
print(pred_leaf.shape)
print(pred_leaf)

leaf_map = [{} for _ in range(ntree)]

for i in range(X.shape[0]):
    for j in range(ntree):
        if pred_leaf[i, j] not in leaf_map[j]:
            leaf_map[j][pred_leaf[i, j]] = []
        leaf_map[j][pred_leaf[i, j]].append(i)

for j in range(ntree):
    print(f"Tree {j}")
    for k, v in sorted(leaf_map[j].items()):
        print(k, v)

@edwardwliu
Copy link
Author

edwardwliu commented Dec 15, 2022

Thanks @hcho3 for your consideration. Yes, the proposal sounds good. The outlined predict_leaf() (or apply()) would be immensely helpful for deriving the weightMatrix.

As an aside, we explored some examples with this proposal and would be able to implement OOB predictions with predict_leaf(). The implementation would likely be inefficient for our purposes so we may continue to explore hacking the interface outlined here i.e. returning an array of predictions per trees instead of just average. However, looking forward to testing out a real example if predict_leaf() is implemented.

@edwardwliu
Copy link
Author

Hi @hcho3 appreciate you taking the time to review this earlier. We previously discussed the value of supporting a weight matrix and out-of-bag predictions. Do you know if there are plans to develop this request?

Although supporting both the weight matrix and fast out-of-bag predictions would be ideal, I understand there may be resource constraints. As such, we decided to explicitly separate fast out-of-bag predictions from this request. If one request could be prioritized, it would be native support for out-of-bag predictions (#435) over this one for the weight matrix. Happy to provide any additional details. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants