-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Return the observations propagated to leaf values #424
Comments
Thanks for your input! I'm out this week and will review in the following week. |
How about creating a
Users of scikit-learn will find this primitive quite familiar. Also, this interface will return a simply 2D |
It is straightforward to obtain the list of observations per leaf, given a import numpy as np
from sklearn.ensemble import RandomForestRegressor as rfr
from sklearn.datasets import load_diabetes
ntree = 5
X, y = load_diabetes(return_X_y=True)
clf = rfr(n_estimators=ntree, max_depth=4)
clf.fit(X, y)
pred_leaf = clf.apply(X)
print(pred_leaf.shape)
print(pred_leaf)
leaf_map = [{} for _ in range(ntree)]
for i in range(X.shape[0]):
for j in range(ntree):
if pred_leaf[i, j] not in leaf_map[j]:
leaf_map[j][pred_leaf[i, j]] = []
leaf_map[j][pred_leaf[i, j]].append(i)
for j in range(ntree):
print(f"Tree {j}")
for k, v in sorted(leaf_map[j].items()):
print(k, v) |
Thanks @hcho3 for your consideration. Yes, the proposal sounds good. The outlined As an aside, we explored some examples with this proposal and would be able to implement OOB predictions with |
Hi @hcho3 appreciate you taking the time to review this earlier. We previously discussed the value of supporting a weight matrix and out-of-bag predictions. Do you know if there are plans to develop this request? Although supporting both the weight matrix and fast out-of-bag predictions would be ideal, I understand there may be resource constraints. As such, we decided to explicitly separate fast out-of-bag predictions from this request. If one request could be prioritized, it would be native support for out-of-bag predictions (#435) over this one for the weight matrix. Happy to provide any additional details. Thanks! |
Background / Motivation
As highlighted in a previous exploration, identifying which training observations were used to make a prediction can be useful for calculating out-of-bag errors and causal applications. Although the existing Treelite interface can be manipulated to return which observations were involved per tree, other features such as generating a random forest’s weight matrix (see explanation and a specific implementation in a library) requires identifying which observations were involved per leaf. This generates useful insights since a single prediction can be thought of a weighted set of observations from the training set.
Although it would be great to implement OOB predictions and weight matrix features natively, it may require non-lossy representation of leaf nodes in the checkpoint format. Instead of just storing the average value of a leaf, the training observations used to calculate the average value would needed as well. An alternative implementation that avoids changing the checkpoint format would be exposing an interface that returns which input observations were propagated to each leaf. Calculating OOB error and a weight matrix would then require a two-pass approach in which the training data is re-propagated to nodes, then compared against new test data handled by the same interface. This approach requires downstream implementation for Treelite users, but is flexible enough to support other applications. For instance, it would be possible to perturb the training data and generate custom weight matrices for other specific insights.
Potential Interface
Similar to the
pred_margin
parameter, there could be an additional parameter that specifies whether to return the observations associated with each leaf instead of the raw prediction in thetreelite.gtil.predict()
function or in the standard runtime library.The returned value could remain the same
numpy.ndarray
type, but with some concern regarding memory depending on size of forest. It also requires that during serialization of a tree, each leaf node has a static index. The interface could return a vector with a length equivalent to the number of trees. Each tree index could then hold a mapping of leaf index to the input observation indices propagated to that leaf. This could be represented by a Boolean matrix (number of leaves by number of observations) or a vector of the observations in each leaf. As long as the leaf indices remain consistent across calls, a Treelite user can implement OOB and weight matrix themselves.Example Application
Pseudo-code for potential implementation of a weight matrix:
The text was updated successfully, but these errors were encountered: