[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml #10607

sethah · 2016-01-05T22:42:18Z

Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation.
Performance testing should be done to ensure there were no regressions.

Performance testing results are here

Large scale performance tests are here

sethah · 2016-01-05T22:43:29Z

cc @holdenk @jkbradley Could you review when you get a chance?

sethah · 2016-01-05T22:45:47Z

mllib/src/main/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.scala

@@ -87,6 +87,14 @@ final class DecisionTreeRegressor @Since("1.4.0") (@Since("1.4.0") override val
    trees.head.asInstanceOf[DecisionTreeRegressionModel]
  }

+  /** (private[ml]) Train a decision tree on an RDD */
+  private[ml] def train(data: RDD[LabeledPoint],


GBTs in spark.ml are handled by converting a dataframe to an RDD of LabeledPoint and then working with that during training. I added a new train method to accept an RDD that can be used to train the trees in the GBT ensemble. I appreciate feedback on this approach or alternative approaches.

I like this approach - and it looks to mirror the approach taken when ALS was ported over (namely their is a train function in the new ALS code marked as a developer API taking the old format of inputs). We could also convert the RDD of LabeledPoints to a DataFrame (which is something I remember being asked to do in one of my previous PRs). @jkbradley & @dbtsai what are your thoughts on this?

This seems fine for now as it is private. The main idea with this PR will be to start the migration. As part of the remaining steps in SPARK-12326, there should be plenty of opportunity to clean things up.

SparkQA · 2016-01-05T23:49:17Z

Test build #48795 has finished for PR 10607 at commit b2490e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-12T01:12:03Z

Test build #49197 has finished for PR 10607 at commit 2ac6219.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-12T02:28:49Z

Test build #49202 has finished for PR 10607 at commit dc7d3eb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-01-14T17:35:36Z

Test build #49403 has finished for PR 10607 at commit cbbb92d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-01-20T22:10:09Z

ping @jkbradley

I have updated the comments with a link to performance testing. It was my first time using spark-perf so please let me know if I need to reconfigure and run again (or if the cluster size is not sufficient). Also if I need to aggregate the results more cleanly (e.g. into a summary table) I can do that as well.

MLnick · 2016-02-16T07:25:58Z

@sethah I did find the perf-test results very difficult to read. Would it be ok to summarize into a readable table to make it easier to compare the before and after numbers (for posterity)?

sethah · 2016-02-16T19:10:42Z

@MLnick I updated the doc with cleaner results. I can do some further analytics on the results if needed. I wanted to make sure the test setup was valid first.

MLnick · 2016-03-02T09:18:05Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala

+  /**
+   * Method to train a gradient boosting model
+   * @param input Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
+   * @return a gradient boosted trees model that can be used for prediction


Docstring for return should be updated to reflect the new return tuple (rather than a model class).

MLnick · 2016-03-02T09:31:17Z

@sethah from my reading of the perf test results, there doesn't appear to be any major difference between before and after (in most cases it seems same or slightly better, in a few cases slightly worse), so no regressions that I can see.

It may be a good idea to try some larger scale tests, if it's possible for you to get the cluster resources for that?

sethah · 2016-03-02T16:48:07Z

@MLnick thanks for your comments! I have updated the scaladoc return types to reflect the tuple. I will look into running bulkier performance tests soon, hopefully we can continue reviewing in the meantime.

SparkQA · 2016-03-02T17:30:42Z

Test build #52324 has finished for PR 10607 at commit 7c5f384.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-03-04T07:52:59Z

@sethah This looks fine to me though there are merge conflicts that need to be resolved.

It would be good to get this in ASAP so the work (and clean up that can happen) in SPARK-12381 and SPARK-12382 can begin.

@jkbradley can you take a quick pass?

SparkQA · 2016-03-08T19:47:16Z

Test build #52678 has finished for PR 10607 at commit 74ee41f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-03-14T13:12:44Z

@sethah If no further comments overnight, I'm going to merge this so you can move ahead with porting the helper classes to ML and removing the old MLLIB impl. As part of those tickets, I think we can clean up this ML impl and interfaces if required (e.g. we could look at removing theprivate [ml] train method in favour of one in MLLIb that converts RDDs to DataFrame and calls ML, we can make more stuff private where possible, etc). But I think it'll be a lot easier to clean things up once everything is in ML.

If any other issues arise from this (including perf regression from larger scale perf-tests) we can also clean up subsequently.

sethah · 2016-03-15T01:46:24Z

I'm still working on getting some performance tests on a larger cluster up and running. I can continue this effort in parallel as the other changes progress, or I can try to expedite if we feel the tests are blocking the rest of these efforts.

MLnick · 2016-03-15T06:58:36Z

@sethah given it's a copy of MLLIb impl, and given test passes and the smaller spark-perf tests, I'm comfortable moving ahead. We should still run the larger scale tests ASAP to check, but I don't think it should be a blocker.

MLnick · 2016-03-15T09:53:45Z

@sethah merged to master. Ping me when the remaining work on SPARK-12326 is ready for review.

sethah · 2016-03-17T15:57:32Z

ping @MLnick @jkbradley

I got some performance tests results on a larger cluster (3x100G), and linked them in the original comment at the top. To be honest, I am not great at interpreting those results and what normal variance is, so I guess I have a hard time drawing a solid conclusion from them. Let me know if those seem adequate.

jkbradley · 2016-03-17T18:57:07Z

Thanks for doing this migration. I checked the PR and it LGTM

Your tests look good to me. The tests all seem fairly close, except for a couple of outliers, but even those seem within a standard deviation or so (the 2nd value in spark-perf results). Thanks for running them!

Also @MLnick

As part of those tickets, I think we can clean up this ML impl and interfaces if required (e.g. we could look at removing theprivate [ml] train method in favour of one in MLLIb that converts RDDs to DataFrame and calls ML, we can make more stuff private where possible, etc). But I think it'll be a lot easier to clean things up once everything is in ML.

If the ML implementation uses RDDs underneath, it will be nice to call directly into that implementation from spark.mllib in order to avoid serialization overhead.

MLnick · 2016-03-18T11:34:05Z

Sure makes sense - it was my impression that the ML impl would be improved
and I assumed part of that may involve using more of DF/DS, hence my
comment
On Thu, 17 Mar 2016 at 20:58, jkbradley notifications@github.com wrote:

Thanks for doing this migration. I checked the PR and it LGTM

Your tests look good to me. The tests all seem fairly close, except for a
couple of outliers, but even those seem within a standard deviation or so
(the 2nd value in spark-perf results). Thanks for running them!

Also @MLnick https://github.com/MLnick

As part of those tickets, I think we can clean up this ML impl and
interfaces if required (e.g. we could look at removing theprivate [ml]
train method in favour of one in MLLIb that converts RDDs to DataFrame and
calls ML, we can make more stuff private where possible, etc). But I think
it'll be a lot easier to clean things up once everything is in ML.

If the ML implementation uses RDDs underneath, it will be nice to call
directly into that implementation from spark.mllib in order to avoid
serialization overhead.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#10607 (comment)

jkbradley · 2016-03-19T00:56:10Z

Oh I see. That will be great to do eventually, but there are some issues right now supporting iterative algorithms using DataFrames (b/c of query plans growing very large). Those will be addressed, but it's (sort of) a blocker for now.

Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation. Performance testing should be done to ensure there were no regressions. Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing) Author: sethah <seth.hendrickson16@gmail.com> Closes apache#10607 from sethah/SPARK-12379.

sethah reviewed Jan 5, 2016
View reviewed changes

MLnick reviewed Mar 2, 2016
View reviewed changes

sethah force-pushed the SPARK-12379 branch from cbbb92d to 7c5f384 Compare March 2, 2016 16:46

sethah added 10 commits March 8, 2016 10:39

copying GBT implementation to spark.ml

d378023

changing scopes and cleaning up

acbc407

minor changes

9bea66f

minor cleanup

97230c3

removing since annotations

b20f25f

style fixes

56c9f74

removing unused imports

6ba4f4d

renaming imports to use Old prefix

3bea29c

style fixes

6dfaa7c

import ordering

d3ad836

updating return types in scaladoc

74ee41f

sethah force-pushed the SPARK-12379 branch from 7c5f384 to 74ee41f Compare March 8, 2016 19:02

asfgit closed this in dafd70f Mar 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml #10607

[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml #10607

sethah commented Jan 5, 2016

sethah commented Jan 5, 2016

sethah Jan 5, 2016

holdenk Jan 5, 2016

MLnick Mar 2, 2016

SparkQA commented Jan 5, 2016

SparkQA commented Jan 12, 2016

SparkQA commented Jan 12, 2016

SparkQA commented Jan 14, 2016

sethah commented Jan 20, 2016

MLnick commented Feb 16, 2016

sethah commented Feb 16, 2016

MLnick Mar 2, 2016

MLnick commented Mar 2, 2016

sethah commented Mar 2, 2016

SparkQA commented Mar 2, 2016

MLnick commented Mar 4, 2016

SparkQA commented Mar 8, 2016

MLnick commented Mar 14, 2016

sethah commented Mar 15, 2016

MLnick commented Mar 15, 2016

MLnick commented Mar 15, 2016

sethah commented Mar 17, 2016

jkbradley commented Mar 17, 2016

MLnick commented Mar 18, 2016

jkbradley commented Mar 19, 2016

[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml #10607

[SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.ml #10607

Conversation

sethah commented Jan 5, 2016

sethah commented Jan 5, 2016

sethah Jan 5, 2016

Choose a reason for hiding this comment

holdenk Jan 5, 2016

Choose a reason for hiding this comment

MLnick Mar 2, 2016

Choose a reason for hiding this comment

SparkQA commented Jan 5, 2016

SparkQA commented Jan 12, 2016

SparkQA commented Jan 12, 2016

SparkQA commented Jan 14, 2016

sethah commented Jan 20, 2016

MLnick commented Feb 16, 2016

sethah commented Feb 16, 2016

MLnick Mar 2, 2016

Choose a reason for hiding this comment

MLnick commented Mar 2, 2016

sethah commented Mar 2, 2016

SparkQA commented Mar 2, 2016

MLnick commented Mar 4, 2016

SparkQA commented Mar 8, 2016

MLnick commented Mar 14, 2016

sethah commented Mar 15, 2016

MLnick commented Mar 15, 2016

MLnick commented Mar 15, 2016

sethah commented Mar 17, 2016

jkbradley commented Mar 17, 2016

MLnick commented Mar 18, 2016

jkbradley commented Mar 19, 2016