xgbfi C-API #2846

Far0n · 2017-10-30T14:50:15Z

C++ implementation of xgbfi
XGBoosterGetFeatureInteractions C-API

RAMitchell

The code looks really good in general. I wondered if it would be better to try to use the existing tree model classes but I think it's fine to define your own so long as they are encapsulated within the .cc file as per my comment.

I would like to see some form of unit test.

Do you have plans to add any Python/R bindings to this functionality? This is not necessary for this PR but just curious.

Also I am interested in the choice to return the interactions as a vector of strings. What are the advantages of doing it this way instead of for example returning pairs of integer indices.

RAMitchell · 2017-10-30T21:15:46Z

src/c_api/c_api.cc

+  xgbfi::XgbModel model = xgbfi::XgbModelParser::GetXgbModelFromDump(dump, ntrees);
+  xgbfi::FeatureInteractions fi = model.GetFeatureInteractions(max_fi_depth,
+                                                               max_tree_depth,
+                                                               max_deepening);


I think you don't need all of these classes here. Is it possible to have one single function (not method) GetFeatureInteractions() here? That way you do not need to define these classes in xgbfi.h, you can instead define them in the .cc file. This is cleaner as these classes are only specific to xgbfi and are not needed in other code.

sure, will do.

Far0n · 2017-10-30T22:01:29Z

Thank you for the timely review @RAMitchell! I'm wanted to ship unit tests along with the Python bindings in a 2nd commit to this PR. I do not plan to add R bindings here.

A feature interaction has the structure of

struct FI {
  vector<string> feature_names;  // varying size
  double gain;
  double fscore;
  ... // or
  vector<float> stats;
}

One reason for strings is, that I want to avoid feature id ->feature names mapping in python and that this csv-like format is easily converted into a pandas DataFrame.

codecov-io · 2017-11-01T07:32:53Z

Codecov Report

Merging #2846 into master will decrease coverage by 0.76%.
The diff coverage is 6.82%.

@@             Coverage Diff              @@
##             master    #2846      +/-   ##
============================================
- Coverage     43.08%   42.32%   -0.77%     
  Complexity      200      200              
============================================
  Files           151      152       +1     
  Lines         11518    11767     +249     
  Branches       1167     1191      +24     
============================================
+ Hits           4963     4980      +17     
- Misses         6225     6457     +232     
  Partials        330      330

Impacted Files	Coverage Δ	Complexity Δ
src/analysis/xgbfi.cc	`0% <0%> (ø)`	`0 <0> (?)`
src/c_api/c_api.cc	`17.45% <0%> (-0.47%)`	`0 <0> (ø)`
python-package/xgboost/core.py	`81.12% <85%> (+0.15%)`	`0 <0> (ø)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78d0bd6...f843045. Read the comment docs.

Far0n · 2017-11-01T13:15:38Z

@RAMitchell Could you please review the changes.

I don't know what to do with the following lint-error, because I think the method get_feature_interactions is at the right place:

 python-package/xgboost/core.py:706: refactor (R0904, too-many-public-methods, Booster) Too many public methods (21/20)

Thank you!

RAMitchell · 2017-11-02T02:35:03Z

Looks good. My inclination would be to disable the lint-error. I agree with you that the method is in the correct place. For the R build you probably need to add the new c++ file to the amalgamation. One of the Python failures is also due to flake8 warnings so also address these.

@khotilov can I get a second review?

khotilov · 2017-11-02T07:45:01Z

I'll try to take a better look this weekend, but so far I've had just under an hour to look through the code and to write some notes and questions.

@Far0n My first question is about understanding your intentions better: do you see this addition as sort of a "plugin" to xgboost that you would like to have here, and you would like to keep it as similar as reasonably possible in structure and function to your original tool (which you want to keep developing)? Or would you prefer this to be absorbed/integrated into xgboost? I'm asking because I see that your code is currently very loosely coupled with the core (just by passing a text dump of a model).

The code has little in common with the stuff under src/tree/, and not only based on the code itself but also based on the purpose (keep in mind that the src/tree/ is about building a tree). Maybe it would make more sense to create an src/analysis/ directory?
It would be better to have the stuff in StringUtils under common/. Some of it can be reused.
Note that instead of having multiple separate small classes, xgboost's c++ style tends to use struct for fully public classes, and to nest the internal auxiliary classes/data structures within the classes that utilize them.
I'm not sure why there must be a static XgbModelParser::xgb_node_lists_ container.
A fair amount of code could be saved by dealing with GBTreeModel directly instead of parsing a model dump.
But if the model dump parsing is to stay, why not to have a functionality to restore a GBTreeModel from it?
An interface to filter specific trees would be useful (e.g., to determine interactions that are important to a specific class in a multiclass setting).

Again, those are just my observations so far, which are more for a discussion than for action.

Far0n · 2017-11-02T08:36:50Z

@khotilov

For obv reasons, I don't like this dump parsing as well, but my journey starting within c_api hit a private wall in booster -> learner. So how to get the model without changing core code? I like the idea to keep it loosely coupled, because I have high hopes regarding sth like treelite to provide it for any tree based model. GBTreeModel seems also very optimzed for internal use.

khotilov · 2017-11-03T05:56:00Z

Yes, the tree model is well hidden, for the reason of having the same interface in all GradientBoosters. However, we can change the core code for a good cause. Doing various tree model analysis tasks is a good cause. I did it previously for my own hacking experiments, but we'll think about some clean option.

The treelite is more for deployment than analysis. There's dmlc/treelite/issues/5 about a possibility of doing some transformations though... Maybe @hcho3 can comment.

hcho3 · 2017-11-06T20:12:04Z

@Far0n Yes, treelite is currently geared toward deployment. I am actually debating whether there should be a separate project for model analysis and transformation. The workflow would be:

   Model (xgboost, lightgbm, scikit-learn etc.)
=> Common model schema
=> Transformed model (pruning etc)
=> Optimize for deployment
=> Compiled prediction library, for deployment

The original plan was to have transformation step as part of treelite, but it may be beneficial to have a separate project for transformation and analysis. Let me get back to you on this decision.

hcho3 · 2017-11-06T20:12:54Z

Also, it would be nice if we can set a time to chat about what kind of model analysis you'd like to do. Can we schedule a time for a Google hangout?

RAMitchell · 2017-11-06T22:47:27Z

@hcho3 if you are able to come to h2oworld @Far0n will also be there.

hcho3 · 2017-11-07T23:37:35Z

@RAMitchell That's great! I really want to come now. I'll let you know by end of this week whether I'd be able to attend.

Far0n · 2017-11-08T01:10:47Z

make it happen @hcho3 :)

hcho3 · 2017-11-13T22:02:15Z

@Far0n I am able to attend h2oworld. I will see you next month.

Far0n · 2017-11-17T15:58:46Z

awesome @hcho3, looking forward to!

khotilov · 2018-02-13T07:52:35Z

So, what was the outcome of you all meeting in December?

hcho3 · 2018-02-20T03:52:31Z

@khotilov @Far0n I am currently working to separate out the front-end of treelite to enable model analysis and editing. I've been swamped by other duties as a maintainer, but I'll try to get it done by next month.

Far0n · 2018-02-20T07:28:28Z

Thank you @hcho3. I'm looking forward to it. Take your time though. :)

Far0n force-pushed the varimp branch from 221ff4d to 3886995 Compare October 30, 2017 14:53

Far0n mentioned this pull request Oct 30, 2017

Advanced Feature Importance (n-way interactions) #474

Closed

Far0n force-pushed the varimp branch 4 times, most recently from a4a6c23 to a52787c Compare October 30, 2017 18:29

RAMitchell requested changes Oct 30, 2017

View reviewed changes

Far0n force-pushed the varimp branch 8 times, most recently from 7c27386 to 2fbadd0 Compare November 1, 2017 07:32

Far0n force-pushed the varimp branch 3 times, most recently from b1dc3a1 to 0e26ea2 Compare November 1, 2017 12:42

Far0n force-pushed the varimp branch 2 times, most recently from d626d90 to ace0b82 Compare November 1, 2017 15:20

RAMitchell approved these changes Nov 2, 2017

View reviewed changes

Far0n force-pushed the varimp branch from d1869c2 to 508afbd Compare November 2, 2017 07:45

Far0n force-pushed the varimp branch from 508afbd to 5a51889 Compare November 4, 2017 22:12

Far0n force-pushed the varimp branch 4 times, most recently from 40ac241 to 3254874 Compare November 5, 2017 09:39

xgbfi C-API & Python bindings

f843045

Far0n force-pushed the varimp branch from 3254874 to f843045 Compare November 5, 2017 11:47

Far0n closed this Feb 13, 2018

lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xgbfi C-API #2846

xgbfi C-API #2846

Far0n commented Oct 30, 2017

RAMitchell left a comment •

edited

Loading

RAMitchell Oct 30, 2017

Far0n Oct 30, 2017

Far0n commented Oct 30, 2017 •

edited

Loading

codecov-io commented Nov 1, 2017 •

edited

Loading

Far0n commented Nov 1, 2017

RAMitchell commented Nov 2, 2017

khotilov commented Nov 2, 2017

Far0n commented Nov 2, 2017

khotilov commented Nov 3, 2017

hcho3 commented Nov 6, 2017

hcho3 commented Nov 6, 2017

RAMitchell commented Nov 6, 2017

hcho3 commented Nov 7, 2017

Far0n commented Nov 8, 2017

hcho3 commented Nov 13, 2017

Far0n commented Nov 17, 2017

khotilov commented Feb 13, 2018

hcho3 commented Feb 20, 2018

Far0n commented Feb 20, 2018

xgbfi C-API #2846

xgbfi C-API #2846

Conversation

Far0n commented Oct 30, 2017

RAMitchell left a comment • edited Loading

Choose a reason for hiding this comment

RAMitchell Oct 30, 2017

Choose a reason for hiding this comment

Far0n Oct 30, 2017

Choose a reason for hiding this comment

Far0n commented Oct 30, 2017 • edited Loading

codecov-io commented Nov 1, 2017 • edited Loading

Codecov Report

Far0n commented Nov 1, 2017

RAMitchell commented Nov 2, 2017

khotilov commented Nov 2, 2017

Far0n commented Nov 2, 2017

khotilov commented Nov 3, 2017

hcho3 commented Nov 6, 2017

hcho3 commented Nov 6, 2017

RAMitchell commented Nov 6, 2017

hcho3 commented Nov 7, 2017

Far0n commented Nov 8, 2017

hcho3 commented Nov 13, 2017

Far0n commented Nov 17, 2017

khotilov commented Feb 13, 2018

hcho3 commented Feb 20, 2018

Far0n commented Feb 20, 2018

RAMitchell left a comment •

edited

Loading

Far0n commented Oct 30, 2017 •

edited

Loading

codecov-io commented Nov 1, 2017 •

edited

Loading