Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7000] [ml] Refactor prediction and tree abstractions to be under ml.prediction subpackage #5585

Closed
wants to merge 2 commits into from

Conversation

jkbradley
Copy link
Member

From JIRA:

spark.ml prediction abstractions are currently not gathered; they are in both ml.impl and ml.tree. Instead, they should be gathered into ml.prediction. This will become more important as more abstractions, such as ensembles, are added.

I refactored using IntelliJ.

The only additional changes I made were:

  • Better doc for DecisionTreeExample to warn users that it can require more memory than run-example provides.
  • line 120 of treeParams.scala (a small correction for a warning)

CC: @mengxr

@SparkQA
Copy link

SparkQA commented Apr 20, 2015

Test build #30574 has finished for PR 5585 at commit b2cb6ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
  • This patch does not change any dependencies.

@mengxr
Copy link
Contributor

mengxr commented Apr 20, 2015

@jkbradley I don't quite understand the purpose of this PR, e.g., why tree.Node should live under ml.prediction.impl. On the high-level, what classes do we want to put under ml.prediction, and what do we mean by impl, developer API or private classes that implements some interfaces?

@jkbradley
Copy link
Member Author

It's mainly to reduce clutter in the spark.ml namespace. We'll get more and more items shared between classification and regression:

  • public interfaces
    • Predictor (private now, but should be public later)
    • tree abstractions: Node, Split, models
    • ensembles: boosting & bagging
  • impl
    • tree params
    • ensemble params

Once the prediction Dev APIs are made public (Predictor, etc.), then we'll have a spark.ml.prediction subpackage anyways. At that point, tree and ensemble abstractions seem like they would belong in that subpackage, rather than in the spark.ml namespace.

I'm OK if you prefer to keep these items in the .ml namespace, but if you're ambivalent, then I'd prefer fewer subpackages under spark.ml

@jkbradley
Copy link
Member Author

Oh, I misread one thing you wrote: tree.Node lives under ml.prediction (not ml.prediction.impl) since it's a public interface.

@jkbradley
Copy link
Member Author

One more thought: Later on, I could imagine us having other types of trees, such as for hierarchical clustering. Those would live under the ml.clustering namespace

@mengxr
Copy link
Contributor

mengxr commented Apr 20, 2015

prediction sounds too general here, and I don't know what should go into this package. Many models can make predictions, but only tree nodes are under prediction now.

@jkbradley
Copy link
Member Author

Ok, so you'd vote for having separate subpackages for each type of classification/prediction abstraction?

  • ml.prediction.Predictor (once it is public)
  • ml.tree.*
  • ml.ensembles.* (once we add general boosting, bagging)
  • (There may be more which are not on the roadmap.)

@mengxr
Copy link
Contributor

mengxr commented Apr 20, 2015

ml.tree and ml.ensemble look good. If we want to distinguish decision tree from tree elements used in hierarchical clustering, we can put them under separate packages, e.g., ml.tree and ml.clustering.hierachical. It is not necessary to create common base classes if the subclasses are not expected to be called in a generic way.

What do we want to put under ml.prediction beside Predictor?

@jkbradley
Copy link
Member Author

I'm not sure what else would go under ml.prediction. I have, however, started to wonder if evaluation metrics should sit under the relevant subpackage (to make it easier for users to matches evaluators with models), in which case there might be an evaluation abstraction under ml.prediction.

@jkbradley
Copy link
Member Author

Closing this pending discussions

@jkbradley jkbradley closed this Apr 21, 2015
@jkbradley
Copy link
Member Author

I'm copying the 2 small edits to [https://github.com//pull/5567]

@jkbradley jkbradley deleted the dt-api-dt3 branch May 4, 2015 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants