-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml #10607
Conversation
cc @holdenk @jkbradley Could you review when you get a chance? |
@@ -87,6 +87,14 @@ final class DecisionTreeRegressor @Since("1.4.0") (@Since("1.4.0") override val | |||
trees.head.asInstanceOf[DecisionTreeRegressionModel] | |||
} | |||
|
|||
/** (private[ml]) Train a decision tree on an RDD */ | |||
private[ml] def train(data: RDD[LabeledPoint], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GBTs in spark.ml are handled by converting a dataframe to an RDD of LabeledPoint
and then working with that during training. I added a new train
method to accept an RDD that can be used to train the trees in the GBT ensemble. I appreciate feedback on this approach or alternative approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach - and it looks to mirror the approach taken when ALS was ported over (namely their is a train function in the new ALS code marked as a developer API taking the old format of inputs). We could also convert the RDD of LabeledPoints to a DataFrame (which is something I remember being asked to do in one of my previous PRs). @jkbradley & @dbtsai what are your thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine for now as it is private
. The main idea with this PR will be to start the migration. As part of the remaining steps in SPARK-12326, there should be plenty of opportunity to clean things up.
Test build #48795 has finished for PR 10607 at commit
|
Test build #49197 has finished for PR 10607 at commit
|
Test build #49202 has finished for PR 10607 at commit
|
Test build #49403 has finished for PR 10607 at commit
|
ping @jkbradley I have updated the comments with a link to performance testing. It was my first time using spark-perf so please let me know if I need to reconfigure and run again (or if the cluster size is not sufficient). Also if I need to aggregate the results more cleanly (e.g. into a summary table) I can do that as well. |
@sethah I did find the perf-test results very difficult to read. Would it be ok to summarize into a readable table to make it easier to compare the before and after numbers (for posterity)? |
@MLnick I updated the doc with cleaner results. I can do some further analytics on the results if needed. I wanted to make sure the test setup was valid first. |
/** | ||
* Method to train a gradient boosting model | ||
* @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]]. | ||
* @return a gradient boosted trees model that can be used for prediction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstring for return should be updated to reflect the new return tuple (rather than a model class).
@sethah from my reading of the perf test results, there doesn't appear to be any major difference between before and after (in most cases it seems same or slightly better, in a few cases slightly worse), so no regressions that I can see. It may be a good idea to try some larger scale tests, if it's possible for you to get the cluster resources for that? |
@MLnick thanks for your comments! I have updated the scaladoc return types to reflect the tuple. I will look into running bulkier performance tests soon, hopefully we can continue reviewing in the meantime. |
Test build #52324 has finished for PR 10607 at commit
|
@sethah This looks fine to me though there are merge conflicts that need to be resolved. It would be good to get this in ASAP so the work (and clean up that can happen) in SPARK-12381 and SPARK-12382 can begin. @jkbradley can you take a quick pass? |
Test build #52678 has finished for PR 10607 at commit
|
@sethah If no further comments overnight, I'm going to merge this so you can move ahead with porting the helper classes to ML and removing the old MLLIB impl. As part of those tickets, I think we can clean up this ML impl and interfaces if required (e.g. we could look at removing the If any other issues arise from this (including perf regression from larger scale perf-tests) we can also clean up subsequently. |
I'm still working on getting some performance tests on a larger cluster up and running. I can continue this effort in parallel as the other changes progress, or I can try to expedite if we feel the tests are blocking the rest of these efforts. |
@sethah given it's a copy of MLLIb impl, and given test passes and the smaller spark-perf tests, I'm comfortable moving ahead. We should still run the larger scale tests ASAP to check, but I don't think it should be a blocker. |
@sethah merged to master. Ping me when the remaining work on SPARK-12326 is ready for review. |
ping @MLnick @jkbradley I got some performance tests results on a larger cluster (3x100G), and linked them in the original comment at the top. To be honest, I am not great at interpreting those results and what normal variance is, so I guess I have a hard time drawing a solid conclusion from them. Let me know if those seem adequate. |
Thanks for doing this migration. I checked the PR and it LGTM Your tests look good to me. The tests all seem fairly close, except for a couple of outliers, but even those seem within a standard deviation or so (the 2nd value in spark-perf results). Thanks for running them! Also @MLnick
If the ML implementation uses RDDs underneath, it will be nice to call directly into that implementation from spark.mllib in order to avoid serialization overhead. |
Sure makes sense - it was my impression that the ML impl would be improved
|
Oh I see. That will be great to do eventually, but there are some issues right now supporting iterative algorithms using DataFrames (b/c of query plans growing very large). Those will be addressed, but it's (sort of) a blocker for now. |
Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation. Performance testing should be done to ensure there were no regressions. Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing) Author: sethah <seth.hendrickson16@gmail.com> Closes apache#10607 from sethah/SPARK-12379.
Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation.
Performance testing should be done to ensure there were no regressions.
Performance testing results are here
Large scale performance tests are here