Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Feature] Interaction constraints #3135

Closed
BlueTea88 opened this issue Feb 26, 2018 · 11 comments
Closed

[New Feature] Interaction constraints #3135

BlueTea88 opened this issue Feb 26, 2018 · 11 comments

Comments

@BlueTea88
Copy link
Contributor

Hi, I would like to add interaction constraints functionality to tree building.

Basically, this would constrain the combination of variables in each tree based on constraints specified by the user. Initial variables will still get selected in a greedy fashion (best variable at the time will be chosen) but subsequent variables will be limited to variables that have permitted interactions with the initial variables.

Potential benefits include:

  • Better predictive performance from focusing on interactions that work - whether through domain specific knowledge or algorithms that rank interactions
  • Less noise in predictions
  • More control to the user on what the model can fit (for example, the user may want to exclude some interactions even if they perform well due to regulatory constraints)

The idea is discussed briefly in the paper:
Delta Boosting Machine and its Application in Actuarial Modelling (Antonio et al., 2015)
https://actuaries.asn.au/Library/Events/ASTINAFIRERMColloquium/2015/AntonioEtAlDeltaBoostingPaper.pdf

@BlueTea88
Copy link
Contributor Author

I have an experimental version in the repo below.
https://github.com/BlueTea88/xgboost/tree/int_cont

If possible, I would love to merge it to this main repository (see #3136).

It has two arguments:

  • int_constraints_flag - TRUE/FALSE whether to impose interaction constraints
  • int_constraints_list - permitted interactions specified as a list, where each item of the list is a set of column names that represents one permitted interaction (all column names listed in the set can be interacted with each other)

As an example, please see:
https://github.com/BlueTea88/xgboost/blob/int_cont/R-package/demo/interaction_constraints.R

Limitations

  • Currently only supports the exact greedy algorithm on multi-core
  • API only updated for R

@cassieqiao
Copy link

I followed your example and I assume int_constraints_list = list(c('V1','V2'),c('V3','V4','V5')) means V1 can only interact with V2 and vice versa. Same idea for V3, V4 and V5. But I got results like below. Basically, the constraints are not applied. Can you clarify it? Thanks.

temp.int
[[1]]
[1] "V3" "V5"

[[2]]
[1] "V3" "V4"

[[3]]
[1] "V4" "V5"

[[4]]
[1] "V3" "V4" "V5"

[[5]]
[1] "V1" "V3" "V4" "V5"

[[6]]
[1] "V2" "V3" "V4" "V5"

[[7]]
[1] "V3" "V4" "V5" "V6"

[[8]]
[1] "V4" "V5" "V8"

[[9]]
[1] "V4" "V5" "V7"

[[10]]
[1] "V2" "V4" "V5"

Best,
Cassie

@BlueTea88
Copy link
Contributor Author

Hi Cassie,

The changes hasn't been merged to this master repo. Just checking whether you tried the code using xgboost from my experimental branch:
https://github.com/BlueTea88/xgboost/tree/int_cont

Cheers,
Andrew

@BlueTea88
Copy link
Contributor Author

If you are using Windows and you want an easy way to test interaction constraints, you can download the binary file here:
https://github.com/BlueTea88/xgboost/releases/download/v0.70-int/xgboost_0.7.0.zip

And install in R using:

remove.packages('xgboost')
install.packages(file.path(your-file-directory, 'xgboost_0.7.0.zip'), repos=NULL)

@hcho3
Copy link
Collaborator

hcho3 commented Jul 4, 2018

Consolidating to #3439. This issue should be re-opened when you and others decide to actively work on implementing this feature. I look forward to working with you to have feature interaction implemented.

@BlueTea88
Copy link
Contributor Author

@hcho3 I've submitted a PR #3466 for this. Any chance this issue could be re-opened? Happy to make changes if required.

@hcho3 hcho3 reopened this Aug 16, 2018
@hcho3
Copy link
Collaborator

hcho3 commented Aug 16, 2018

@BlueTea88 Sure. Let me know when your pull request is ready for review.

@BlueTea88
Copy link
Contributor Author

@hcho3 It is ready for review now. Thanks

@yanyachen
Copy link

Is there any way to constrain the feature not to interact with itself? I didn't find it in the document. Thanks.

@hcho3
Copy link
Collaborator

hcho3 commented Sep 10, 2018

@yanyachen I don't think it's possible yet. Is it something you or others would find useful? If so, why and how?

@yanyachen
Copy link

I think it's useful in these 2 slightly different scenario:

  1. There is a very strong feature but we think there might be small leakage (risk of overfitting because of that feature), and want to regularize that.
  2. There are a small group of feature set A (e.g. FICO Score) that are significantly important than other less important features set B (e.g. credit utilization, age, education level). One thing to be noted is that, same information that A contains can be also learned from B, it's just features in B are override by A. Therefore sometimes people may want to regularize the model to not relying on feature set A too much through intentionally fitting weaker trees using more weak features.

I think this is useful but also admitted this feature may be not popular to others, or say similar effect can be achieved through feature bootstrapping and then model bagging.

@lock lock bot locked as resolved and limited conversation to collaborators Dec 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants