Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5597] [mllib] Save/load for Decision Trees and ensembles #4444

Closed
wants to merge 3 commits into from

Conversation

jkbradley
Copy link
Member

This adds save/load methods for trees and ensembles. For details on the design, please see the design doc.

Notes:

  • This also modifies/adds for InformationGainStats and Predict.
  • There are a bunch of case classes for model formats. We could potentially change the model classes themselves to be case classes, rather than creating new ones. I'd be OK with that, though it's an API change.

CC: @mengxr

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26963 has finished for PR 4444 at commit 45873a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeModel(val topNode: Node, val algo: Algo) extends Serializable with Saveable
    • case class PredictData(predict: Double, prob: Double)
    • case class InformationGainStatsData(
    • case class SplitData(
    • case class NodeData(
    • case class NodeWithKids(node: Node, leftChildId: Int, rightChildId: Int)
    • case class Metadata(
    • * case class EnsembleNodeData(treeId: Int, node: NodeData),
    • case class NodeData(

@mengxr
Copy link
Contributor

mengxr commented Feb 9, 2015

To help review, this is the schema for saving a tree node:

id: Int
predict/
       |- predict: Double
       |- prob: Double
impurity: Double
isLeaf: Boolean
split/
     |- feature: Int
     |- threshold: Double
     |- featureType: Int
     |- categories: Array[Double]
leftNodeId: Integer
rightNodeId: Integer
stats/
     |- gain: Double
     |- impurity: Double
     |- leftImpurity: Double
     |- rightImpurity: Double
     |- leftPredict/
                   |- predict: Double
                   |- prob: Double
     |- rightPredict/
                    |- predict: Double
                    |- prob: Double

@mengxr
Copy link
Contributor

mengxr commented Feb 9, 2015

The question is whether we want to save stats, which contains information that is already included in other nodes except gain. (Correct me if I'm wrong.) I'm considering replacing stats by infoGain: Double and re-construct other fields in InformationGainStats at load(). The schema becomes

id: Int
predict/
       |- predict: Double
       |- prob: Double
impurity: Double
isLeaf: Boolean
split/
     |- feature: Int
     |- threshold: Double
     |- featureType: Int
     |- categories: Array[Double]
leftNodeId: Integer
rightNodeId: Integer
infoGain: Double

asfgit pushed a commit that referenced this pull request Feb 10, 2015
This is based on #4444 from jkbradley with the following changes:

1. Node schema updated to
   ~~~
treeId: int
nodeId: Int
predict/
       |- predict: Double
       |- prob: Double
impurity: Double
isLeaf: Boolean
split/
     |- feature: Int
     |- threshold: Double
     |- featureType: Int
     |- categories: Array[Double]
leftNodeId: Integer
rightNodeId: Integer
infoGain: Double
~~~

2. Some refactor of the implementation.

Closes #4444.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #4493 from mengxr/SPARK-5597 and squashes the following commits:

75e3bb6 [Xiangrui Meng] fix style
2b0033d [Xiangrui Meng] update tree export schema and refactor the implementation
45873a2 [Joseph K. Bradley] org imports
1d4c264 [Joseph K. Bradley] Added save/load for tree ensembles
dcdbf85 [Joseph K. Bradley] added save/load for decision tree but need to generalize it to ensembles

(cherry picked from commit ef2f55b)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@asfgit asfgit closed this in ef2f55b Feb 10, 2015
@jkbradley jkbradley deleted the ml-io-trees branch May 4, 2015 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants