Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml #10607

Closed
wants to merge 11 commits into from

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Jan 5, 2016

Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation.
Performance testing should be done to ensure there were no regressions.

Performance testing results are here

Large scale performance tests are here

@sethah
Copy link
Contributor Author

sethah commented Jan 5, 2016

cc @holdenk @jkbradley Could you review when you get a chance?

@@ -87,6 +87,14 @@ final class DecisionTreeRegressor @Since("1.4.0") (@Since("1.4.0") override val
trees.head.asInstanceOf[DecisionTreeRegressionModel]
}

/** (private[ml]) Train a decision tree on an RDD */
private[ml] def train(data: RDD[LabeledPoint],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GBTs in spark.ml are handled by converting a dataframe to an RDD of LabeledPoint and then working with that during training. I added a new train method to accept an RDD that can be used to train the trees in the GBT ensemble. I appreciate feedback on this approach or alternative approaches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach - and it looks to mirror the approach taken when ALS was ported over (namely their is a train function in the new ALS code marked as a developer API taking the old format of inputs). We could also convert the RDD of LabeledPoints to a DataFrame (which is something I remember being asked to do in one of my previous PRs). @jkbradley & @dbtsai what are your thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine for now as it is private. The main idea with this PR will be to start the migration. As part of the remaining steps in SPARK-12326, there should be plenty of opportunity to clean things up.

@SparkQA
Copy link

SparkQA commented Jan 5, 2016

Test build #48795 has finished for PR 10607 at commit b2490e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #49197 has finished for PR 10607 at commit 2ac6219.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 12, 2016

Test build #49202 has finished for PR 10607 at commit dc7d3eb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 14, 2016

Test build #49403 has finished for PR 10607 at commit cbbb92d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Jan 20, 2016

ping @jkbradley

I have updated the comments with a link to performance testing. It was my first time using spark-perf so please let me know if I need to reconfigure and run again (or if the cluster size is not sufficient). Also if I need to aggregate the results more cleanly (e.g. into a summary table) I can do that as well.

@MLnick
Copy link
Contributor

MLnick commented Feb 16, 2016

@sethah I did find the perf-test results very difficult to read. Would it be ok to summarize into a readable table to make it easier to compare the before and after numbers (for posterity)?

@sethah
Copy link
Contributor Author

sethah commented Feb 16, 2016

@MLnick I updated the doc with cleaner results. I can do some further analytics on the results if needed. I wanted to make sure the test setup was valid first.

/**
* Method to train a gradient boosting model
* @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
* @return a gradient boosted trees model that can be used for prediction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring for return should be updated to reflect the new return tuple (rather than a model class).

@MLnick
Copy link
Contributor

MLnick commented Mar 2, 2016

@sethah from my reading of the perf test results, there doesn't appear to be any major difference between before and after (in most cases it seems same or slightly better, in a few cases slightly worse), so no regressions that I can see.

It may be a good idea to try some larger scale tests, if it's possible for you to get the cluster resources for that?

@sethah
Copy link
Contributor Author

sethah commented Mar 2, 2016

@MLnick thanks for your comments! I have updated the scaladoc return types to reflect the tuple. I will look into running bulkier performance tests soon, hopefully we can continue reviewing in the meantime.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52324 has finished for PR 10607 at commit 7c5f384.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Mar 4, 2016

@sethah This looks fine to me though there are merge conflicts that need to be resolved.

It would be good to get this in ASAP so the work (and clean up that can happen) in SPARK-12381 and SPARK-12382 can begin.

@jkbradley can you take a quick pass?

@SparkQA
Copy link

SparkQA commented Mar 8, 2016

Test build #52678 has finished for PR 10607 at commit 74ee41f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Mar 14, 2016

@sethah If no further comments overnight, I'm going to merge this so you can move ahead with porting the helper classes to ML and removing the old MLLIB impl. As part of those tickets, I think we can clean up this ML impl and interfaces if required (e.g. we could look at removing theprivate [ml] train method in favour of one in MLLIb that converts RDDs to DataFrame and calls ML, we can make more stuff private where possible, etc). But I think it'll be a lot easier to clean things up once everything is in ML.

If any other issues arise from this (including perf regression from larger scale perf-tests) we can also clean up subsequently.

@sethah
Copy link
Contributor Author

sethah commented Mar 15, 2016

I'm still working on getting some performance tests on a larger cluster up and running. I can continue this effort in parallel as the other changes progress, or I can try to expedite if we feel the tests are blocking the rest of these efforts.

@MLnick
Copy link
Contributor

MLnick commented Mar 15, 2016

@sethah given it's a copy of MLLIb impl, and given test passes and the smaller spark-perf tests, I'm comfortable moving ahead. We should still run the larger scale tests ASAP to check, but I don't think it should be a blocker.

@asfgit asfgit closed this in dafd70f Mar 15, 2016
@MLnick
Copy link
Contributor

MLnick commented Mar 15, 2016

@sethah merged to master. Ping me when the remaining work on SPARK-12326 is ready for review.

@sethah
Copy link
Contributor Author

sethah commented Mar 17, 2016

ping @MLnick @jkbradley

I got some performance tests results on a larger cluster (3x100G), and linked them in the original comment at the top. To be honest, I am not great at interpreting those results and what normal variance is, so I guess I have a hard time drawing a solid conclusion from them. Let me know if those seem adequate.

@jkbradley
Copy link
Member

Thanks for doing this migration. I checked the PR and it LGTM

Your tests look good to me. The tests all seem fairly close, except for a couple of outliers, but even those seem within a standard deviation or so (the 2nd value in spark-perf results). Thanks for running them!

Also @MLnick

As part of those tickets, I think we can clean up this ML impl and interfaces if required (e.g. we could look at removing theprivate [ml] train method in favour of one in MLLIb that converts RDDs to DataFrame and calls ML, we can make more stuff private where possible, etc). But I think it'll be a lot easier to clean things up once everything is in ML.

If the ML implementation uses RDDs underneath, it will be nice to call directly into that implementation from spark.mllib in order to avoid serialization overhead.

@MLnick
Copy link
Contributor

MLnick commented Mar 18, 2016

Sure makes sense - it was my impression that the ML impl would be improved
and I assumed part of that may involve using more of DF/DS, hence my
comment
On Thu, 17 Mar 2016 at 20:58, jkbradley notifications@github.com wrote:

Thanks for doing this migration. I checked the PR and it LGTM

Your tests look good to me. The tests all seem fairly close, except for a
couple of outliers, but even those seem within a standard deviation or so
(the 2nd value in spark-perf results). Thanks for running them!

Also @MLnick https://github.com/MLnick

As part of those tickets, I think we can clean up this ML impl and
interfaces if required (e.g. we could look at removing theprivate [ml]
train method in favour of one in MLLIb that converts RDDs to DataFrame and
calls ML, we can make more stuff private where possible, etc). But I think
it'll be a lot easier to clean things up once everything is in ML.

If the ML implementation uses RDDs underneath, it will be nice to call
directly into that implementation from spark.mllib in order to avoid
serialization overhead.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#10607 (comment)

@jkbradley
Copy link
Member

Oh I see. That will be great to do eventually, but there are some issues right now supporting iterative algorithms using DataFrames (b/c of query plans growing very large). Those will be addressed, but it's (sort of) a blocker for now.

roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation.
Performance testing should be done to ensure there were no regressions.

Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing)

Author: sethah <seth.hendrickson16@gmail.com>

Closes apache#10607 from sethah/SPARK-12379.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants