-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add decision path dump feature #9151
base: master
Are you sure you want to change the base?
Conversation
Hi, thank you for working on the new feature, I hope you enjoy hacking the code! One question comes to mind, should we align the API to what's in sklearn? |
Personally, I don't think it's necessary. Nonetheless, it can be implemented when all else is ready, if you care to insist upon it. |
Calling There are several issues that I am not so sure about. Please advise!
|
I think the scikit-learn representation makes sense. It represents the decision path with a numeric data structure instead of strings. The former is more efficient and reusable, one can easily build a human-readable text representation based on the matrix output when needed. In addition, the result can be returned back to Python instead of being kept as C++-only structure.
It's not necessary to have support for all language bindings at the moment. We can iterate on that.
We can work on it later.
There are some tests in |
Then I must refactor things to align it with sklearn, and add tests. |
Let me know if there's anything I can help. |
Please correct me if I'm mistaken. I presently share my thoughts on the format of the returned representation. Sklearn's random forest allocates node IDs globally, whereas xgboost's tree node IDs are local to each tree. Thus, different trees could have the same ID for the nodes. Therefore, our My proposal is then in two forms of the same idea.
On second thought, the second one would not be feasible because the n_node value might not be the same for different trees. |
I have done as the first one indicated. |
Apologies for the slow reply. I will mention the issue with @hcho3 and @RAMitchell today. In the meanwhile, I will look into your solution. Thank you for the nice work! |
Is there any update here? |
I'm not sure why XGBoost should return multiple indicator matrices. In my opinion, it should be possible to return a single (sparse) matrix, along with an array |
For sklearn
Let me explain my understanding of this API. This second dimension accepts node id as index. Does my understanding align with what sklearn has? If so, xgboost's trees do not shared node id globally, to my knowledge, and therefore cannot return the nodes ids as the trees have them. Nonetheless, we could have a mapping from (tree_id, node_id) to some global node id for each node. |
From the scikit-learn doc:
So |
The wording here is a bit strange to me. I thought I should alter my implementation accordingly. |
The current implementation is too complex and cannot be admitted as it is. Here's my guidance for implementing
n_total_nodes = # ... call C API to compute this
indicator = np.zeros((n_rows, n_total_nodes), dtype=np.int8)
n_trees = len(bst.get_dump(dump_format="json"))
n_nodes_ptr[0] = 0
for i in range(1, n_trees + 1):
n_nodes_ptr[i] = n_nodes_ptr[i - 1] + get_n_nodes_for_tree(i) Use leaf_ids = bst.predict(X, pred_leaf=True)
for row_id in range(X.shape[0]):
for tree_id in range(n_trees):
ancestor_list = # ...
for ancestor in ancestor_list:
indicator[row_id, n_nodes_ptr[tree_id] + ancestor] = 1 Note the use of Let me know if you need any help. In particular, I will help you with creating "C API functions" and how to call them from the Python side. |
Thanks for the suggestion. I should fork another branch to do this. |
Feel free to reach out to me for help |
Sorry about the delay. I was busy with other work. Now that I a bit more free, I would submit another PR in the recent days. |
No rush! Take your time.
…________________________________
From: Anarion ***@***.***>
Sent: Saturday, July 8, 2023 1:37:12 AM
To: dmlc/xgboost ***@***.***>
Cc: Jiaming Yuan ***@***.***>; Comment ***@***.***>
Subject: Re: [dmlc/xgboost] [WIP] Add decision path dump feature (PR #9151)
Sorry about the delay. I was busy with other work. Now that I a bit more free, I would submit another PR in the recent days.
—
Reply to this email directly, view it on GitHub<#9151 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD7YPKMGO6BUAL6CDY6YLHLXPBCMRANCNFSM6AAAAAAX4IUBRE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Is this feature written by someone else yet? I am truly free now. Here to fulfill my long-awaited destiny... |
I don't think any of the core contributor is actively working on it. But do let me refresh, I want to have a deeper look before giving any advice. |
Is there any update? |
Here's another proposal for the API. Looking deeper into the FormatLet's assume we have two trees
The question is now how do we represent this table in Python efficiently while maintaining some level of interpretability. My suggestion is we return a list of CSR matrics with a size equal to the number of trees: decision_path = [
CSR(values=[0, 1, 4, 0, 1, 3, 5], indptr=[0, 3, 7]),
CSR(values=[0, 2, 3, 5, 0, 2], indptr=[0, 5, 6]),
] However, complications arise when we have multi-class models or boosted random forests as it's quite difficult for users to find the correct tree. Without further grouping strategies, it's almost impossible for a normal user to find "the trees that represent the second class". ImplementationI agree with @hcho3 that we need a 2-phrase API for this feature. Using thread-local memory would be too expensive. However, I haven't decided what exactly the C function should return. For instance, we can return a single tree and let the Python function loop over all trees, or we can return all the paths in one go. Repeating the concern in the previous section, we might use the Would love to hear others' opinions on these issues. |
This patch follows this issue #9128. I have already implemented a prototype for dumping decision paths in tree models.
XGBoosterPredictFromDMatrix
calls this new version ofpredict
and works as expected when called from Python.A dumped decision path looks like the following. For now, dumping is directed onto stderr.
Some future work direction: