Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-15569: Split on the right subtree when node value == threshold on MultipleAdditiveTreesModel #242

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

spyk
Copy link

@spyk spyk commented Jul 30, 2021

https://issues.apache.org/jira/browse/SOLR-15569

Description

Fixes an issue where the MultipleAdditiveTreesModel does not split correctly when the tree node value equals the split threshold. This was discovered while testing a translated XGBoost model for LTR and getting slightly different score results.

NOTE: The previous logic split to the left and also added a NODE_SPLIT_SLACK that has been removed. Not sure if this original logic was part of another model, or served another purpose.

Solution

  • Changed split logic to split on the right subtree when the node value equals the threshold
  • Removed NODE_SPLIT_SLACK completely

Tests

  • Added TestMultipleAdditiveTreesModel.testMultipleAdditiveTreesSplitAtThreshold test to showcase the issue

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

if (featureVector[regressionTreeNode.featureIndex] <= regressionTreeNode.threshold) {
regressionTreeNode = regressionTreeNode.left;
} else {
if (featureVector[regressionTreeNode.featureIndex] >= regressionTreeNode.threshold) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it consistent, isn't <= in XGboost and similar?
This is a pretty big change, and I think it shouldn't be necssary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The == going left or right is an interesting question! One way to maintain backwards compatibility whilst supporting "go right instead of left" could be via an optional configuration element e.g.

- "params" : { "trees" : [ ... ] }
+ "params" : { "trees" : [ ... ], "splitToRight" : true }

And if there is a scoreNode change then explainNode requires a matching change I think.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run into case with a specific XGBoost model where a variable might take on 0 or 1 and the split condition being 1.0. So, in that case if we go left by default the right path is never evaluated, which results in silently decreased model performance.

But, agree it's a rather big change, there might be issues with a) other tree models that split to the left b) backwards incompatibility, so @cpoerschke suggestion makes a lot of sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think == should be always on the left. But I agree with splitToRight for backward compatibility.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@diegoceccarelli not sure if there are different tree models that either split to left or right when equal. At least in the case of the XGBoost model, it seems to split to the right by default. I'll check if I can add that flag functionality on the PR.

"weight" : "1f",
"root": {
"feature": "constantScoreToForceMultipleAdditiveTreesScoreAllDocs",
"threshold": "1.0f",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should verify this, I think branching is generated on <=left and > right in the training algorithm

@@ -249,5 +274,5 @@ public void multipleAdditiveTreesTestUnknownFeature(){
});
assertEquals(expectedException.toString(), ex.toString());
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unintended?

@spyk
Copy link
Author

spyk commented Sep 24, 2021

I took a stab on the 'splitToRight' flag implementation. Please review and let me know of any issues, thanks!

Copy link

This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the dev@solr.apache.org mailing list. Thank you for your contribution!

@github-actions github-actions bot added the stale PR not updated in 60 days label Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale PR not updated in 60 days
Projects
None yet
4 participants