Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xgbfi C-API #2846

Closed
wants to merge 1 commit into from
Closed

xgbfi C-API #2846

wants to merge 1 commit into from

Conversation

Far0n
Copy link
Contributor

@Far0n Far0n commented Oct 30, 2017

  • C++ implementation of xgbfi
  • XGBoosterGetFeatureInteractions C-API

Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks really good in general. I wondered if it would be better to try to use the existing tree model classes but I think it's fine to define your own so long as they are encapsulated within the .cc file as per my comment.

I would like to see some form of unit test.

Do you have plans to add any Python/R bindings to this functionality? This is not necessary for this PR but just curious.

Also I am interested in the choice to return the interactions as a vector of strings. What are the advantages of doing it this way instead of for example returning pairs of integer indices.

xgbfi::XgbModel model = xgbfi::XgbModelParser::GetXgbModelFromDump(dump, ntrees);
xgbfi::FeatureInteractions fi = model.GetFeatureInteractions(max_fi_depth,
max_tree_depth,
max_deepening);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need all of these classes here. Is it possible to have one single function (not method) GetFeatureInteractions() here? That way you do not need to define these classes in xgbfi.h, you can instead define them in the .cc file. This is cleaner as these classes are only specific to xgbfi and are not needed in other code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will do.

@Far0n
Copy link
Contributor Author

Far0n commented Oct 30, 2017

Thank you for the timely review @RAMitchell! I'm wanted to ship unit tests along with the Python bindings in a 2nd commit to this PR. I do not plan to add R bindings here.

A feature interaction has the structure of

struct FI {
  vector<string> feature_names;  // varying size
  double gain;
  double fscore;
  ... // or
  vector<float> stats;
}

One reason for strings is, that I want to avoid feature id ->feature names mapping in python and that this csv-like format is easily converted into a pandas DataFrame.

@Far0n Far0n force-pushed the varimp branch 8 times, most recently from 7c27386 to 2fbadd0 Compare November 1, 2017 07:32
@codecov-io
Copy link

codecov-io commented Nov 1, 2017

Codecov Report

Merging #2846 into master will decrease coverage by 0.76%.
The diff coverage is 6.82%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #2846      +/-   ##
============================================
- Coverage     43.08%   42.32%   -0.77%     
  Complexity      200      200              
============================================
  Files           151      152       +1     
  Lines         11518    11767     +249     
  Branches       1167     1191      +24     
============================================
+ Hits           4963     4980      +17     
- Misses         6225     6457     +232     
  Partials        330      330
Impacted Files Coverage Δ Complexity Δ
src/analysis/xgbfi.cc 0% <0%> (ø) 0 <0> (?)
src/c_api/c_api.cc 17.45% <0%> (-0.47%) 0 <0> (ø)
python-package/xgboost/core.py 81.12% <85%> (+0.15%) 0 <0> (ø) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 78d0bd6...f843045. Read the comment docs.

@Far0n Far0n force-pushed the varimp branch 3 times, most recently from b1dc3a1 to 0e26ea2 Compare November 1, 2017 12:42
@Far0n
Copy link
Contributor Author

Far0n commented Nov 1, 2017

@RAMitchell Could you please review the changes.

I don't know what to do with the following lint-error, because I think the method get_feature_interactions is at the right place:

 python-package/xgboost/core.py:706: refactor (R0904, too-many-public-methods, Booster) Too many public methods (21/20)

Thank you!

@Far0n Far0n force-pushed the varimp branch 2 times, most recently from d626d90 to ace0b82 Compare November 1, 2017 15:20
@RAMitchell
Copy link
Member

Looks good. My inclination would be to disable the lint-error. I agree with you that the method is in the correct place. For the R build you probably need to add the new c++ file to the amalgamation. One of the Python failures is also due to flake8 warnings so also address these.

@khotilov can I get a second review?

@khotilov
Copy link
Member

khotilov commented Nov 2, 2017

I'll try to take a better look this weekend, but so far I've had just under an hour to look through the code and to write some notes and questions.

@Far0n My first question is about understanding your intentions better: do you see this addition as sort of a "plugin" to xgboost that you would like to have here, and you would like to keep it as similar as reasonably possible in structure and function to your original tool (which you want to keep developing)? Or would you prefer this to be absorbed/integrated into xgboost? I'm asking because I see that your code is currently very loosely coupled with the core (just by passing a text dump of a model).

  • The code has little in common with the stuff under src/tree/, and not only based on the code itself but also based on the purpose (keep in mind that the src/tree/ is about building a tree). Maybe it would make more sense to create an src/analysis/ directory?
  • It would be better to have the stuff in StringUtils under common/. Some of it can be reused.
  • Note that instead of having multiple separate small classes, xgboost's c++ style tends to use struct for fully public classes, and to nest the internal auxiliary classes/data structures within the classes that utilize them.
  • I'm not sure why there must be a static XgbModelParser::xgb_node_lists_ container.
  • A fair amount of code could be saved by dealing with GBTreeModel directly instead of parsing a model dump.
  • But if the model dump parsing is to stay, why not to have a functionality to restore a GBTreeModel from it?
  • An interface to filter specific trees would be useful (e.g., to determine interactions that are important to a specific class in a multiclass setting).

Again, those are just my observations so far, which are more for a discussion than for action.

@Far0n
Copy link
Contributor Author

Far0n commented Nov 2, 2017

@khotilov

For obv reasons, I don't like this dump parsing as well, but my journey starting within c_api hit a private wall in booster -> learner. So how to get the model without changing core code? I like the idea to keep it loosely coupled, because I have high hopes regarding sth like treelite to provide it for any tree based model. GBTreeModel seems also very optimzed for internal use.

@khotilov
Copy link
Member

khotilov commented Nov 3, 2017

Yes, the tree model is well hidden, for the reason of having the same interface in all GradientBoosters. However, we can change the core code for a good cause. Doing various tree model analysis tasks is a good cause. I did it previously for my own hacking experiments, but we'll think about some clean option.

The treelite is more for deployment than analysis. There's dmlc/treelite/issues/5 about a possibility of doing some transformations though... Maybe @hcho3 can comment.

@Far0n Far0n force-pushed the varimp branch 4 times, most recently from 40ac241 to 3254874 Compare November 5, 2017 09:39
@hcho3
Copy link
Collaborator

hcho3 commented Nov 6, 2017

@Far0n Yes, treelite is currently geared toward deployment. I am actually debating whether there should be a separate project for model analysis and transformation. The workflow would be:

   Model (xgboost, lightgbm, scikit-learn etc.)
=> Common model schema
=> Transformed model (pruning etc)
=> Optimize for deployment
=> Compiled prediction library, for deployment

The original plan was to have transformation step as part of treelite, but it may be beneficial to have a separate project for transformation and analysis. Let me get back to you on this decision.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 6, 2017

Also, it would be nice if we can set a time to chat about what kind of model analysis you'd like to do. Can we schedule a time for a Google hangout?

@RAMitchell
Copy link
Member

@hcho3 if you are able to come to h2oworld @Far0n will also be there.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 7, 2017

@RAMitchell That's great! I really want to come now. I'll let you know by end of this week whether I'd be able to attend.

@Far0n
Copy link
Contributor Author

Far0n commented Nov 8, 2017

make it happen @hcho3 :)

@hcho3
Copy link
Collaborator

hcho3 commented Nov 13, 2017

@Far0n I am able to attend h2oworld. I will see you next month.

@Far0n
Copy link
Contributor Author

Far0n commented Nov 17, 2017

awesome @hcho3, looking forward to!

@Far0n Far0n closed this Feb 13, 2018
@khotilov
Copy link
Member

So, what was the outcome of you all meeting in December?

@hcho3
Copy link
Collaborator

hcho3 commented Feb 20, 2018

@khotilov @Far0n I am currently working to separate out the front-end of treelite to enable model analysis and editing. I've been swamped by other duties as a maintainer, but I'll try to get it done by next month.

@Far0n
Copy link
Contributor Author

Far0n commented Feb 20, 2018

Thank you @hcho3. I'm looking forward to it. Take your time though. :)

@lock lock bot locked as resolved and limited conversation to collaborators Jan 18, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants