[SPARK-7000] [ml] Refactor prediction and tree abstractions to be under ml.prediction subpackage #5585

jkbradley · 2015-04-20T03:25:54Z

From JIRA:

spark.ml prediction abstractions are currently not gathered; they are in both ml.impl and ml.tree. Instead, they should be gathered into ml.prediction. This will become more important as more abstractions, such as ensembles, are added.

I refactored using IntelliJ.

The only additional changes I made were:

Better doc for DecisionTreeExample to warn users that it can require more memory than run-example provides.
line 120 of treeParams.scala (a small correction for a warning)

CC: @mengxr

… sub-package

…xample

SparkQA · 2015-04-20T05:12:18Z

Test build #30574 has finished for PR 5585 at commit b2cb6ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

mengxr · 2015-04-20T16:49:11Z

@jkbradley I don't quite understand the purpose of this PR, e.g., why tree.Node should live under ml.prediction.impl. On the high-level, what classes do we want to put under ml.prediction, and what do we mean by impl, developer API or private classes that implements some interfaces?

jkbradley · 2015-04-20T17:01:25Z

It's mainly to reduce clutter in the spark.ml namespace. We'll get more and more items shared between classification and regression:

public interfaces
- Predictor (private now, but should be public later)
- tree abstractions: Node, Split, models
- ensembles: boosting & bagging
impl
- tree params
- ensemble params

Once the prediction Dev APIs are made public (Predictor, etc.), then we'll have a spark.ml.prediction subpackage anyways. At that point, tree and ensemble abstractions seem like they would belong in that subpackage, rather than in the spark.ml namespace.

I'm OK if you prefer to keep these items in the .ml namespace, but if you're ambivalent, then I'd prefer fewer subpackages under spark.ml

jkbradley · 2015-04-20T17:04:08Z

Oh, I misread one thing you wrote: tree.Node lives under ml.prediction (not ml.prediction.impl) since it's a public interface.

jkbradley · 2015-04-20T17:05:20Z

One more thought: Later on, I could imagine us having other types of trees, such as for hierarchical clustering. Those would live under the ml.clustering namespace

mengxr · 2015-04-20T20:57:26Z

prediction sounds too general here, and I don't know what should go into this package. Many models can make predictions, but only tree nodes are under prediction now.

jkbradley · 2015-04-20T21:14:45Z

Ok, so you'd vote for having separate subpackages for each type of classification/prediction abstraction?

ml.prediction.Predictor (once it is public)
ml.tree.*
ml.ensembles.* (once we add general boosting, bagging)
(There may be more which are not on the roadmap.)

mengxr · 2015-04-20T22:09:05Z

ml.tree and ml.ensemble look good. If we want to distinguish decision tree from tree elements used in hierarchical clustering, we can put them under separate packages, e.g., ml.tree and ml.clustering.hierachical. It is not necessary to create common base classes if the subclasses are not expected to be called in a generic way.

What do we want to put under ml.prediction beside Predictor?

jkbradley · 2015-04-20T22:21:47Z

I'm not sure what else would go under ml.prediction. I have, however, started to wonder if evaluation metrics should sit under the relevant subpackage (to make it easier for users to matches evaluators with models), in which case there might be an evaluation abstraction under ml.prediction.

jkbradley · 2015-04-21T18:22:42Z

Closing this pending discussions

jkbradley · 2015-04-21T18:25:19Z

I'm copying the 2 small edits to [https://github.com//pull/5567]

jkbradley added 2 commits April 19, 2015 15:19

Refactored prediction and tree abstractions to be under ml.prediction…

775cecf

… sub-package

Fixed build issues after refactor, and improved doc for DecisionTreeE…

b2cb6ad

…xample

jkbradley mentioned this pull request Apr 20, 2015

[SPARK-6113] [ml] Small cleanups after original tree API PR #5567

Closed

jkbradley closed this Apr 21, 2015

jkbradley deleted the dt-api-dt3 branch May 4, 2015 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7000] [ml] Refactor prediction and tree abstractions to be under ml.prediction subpackage #5585

[SPARK-7000] [ml] Refactor prediction and tree abstractions to be under ml.prediction subpackage #5585

jkbradley commented Apr 20, 2015

SparkQA commented Apr 20, 2015

mengxr commented Apr 20, 2015

jkbradley commented Apr 20, 2015

jkbradley commented Apr 20, 2015

jkbradley commented Apr 20, 2015

mengxr commented Apr 20, 2015

jkbradley commented Apr 20, 2015

mengxr commented Apr 20, 2015

jkbradley commented Apr 20, 2015

jkbradley commented Apr 21, 2015

jkbradley commented Apr 21, 2015

[SPARK-7000] [ml] Refactor prediction and tree abstractions to be under ml.prediction subpackage #5585

[SPARK-7000] [ml] Refactor prediction and tree abstractions to be under ml.prediction subpackage #5585

Conversation

jkbradley commented Apr 20, 2015

SparkQA commented Apr 20, 2015

mengxr commented Apr 20, 2015

jkbradley commented Apr 20, 2015

jkbradley commented Apr 20, 2015

jkbradley commented Apr 20, 2015

mengxr commented Apr 20, 2015

jkbradley commented Apr 20, 2015

mengxr commented Apr 20, 2015

jkbradley commented Apr 20, 2015

jkbradley commented Apr 21, 2015

jkbradley commented Apr 21, 2015