New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-34047][ML] tree models saving: compute numParts according to numNodes #31090
Conversation
Kubernetes integration test starting |
Test build #133831 has finished for PR 31090 at commit
|
Kubernetes integration test status success |
ping @srowen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we do this for other models? My only concern is whether this makes it harder to load when the model is large. These are parquet files, so not likely anything would want to read them from a single file, in the way maybe CSV output should.
@srowen reasonable. |
Yes, for most classificaion and regression models, we save them in single partitions. |
Sounds like a reasonable heuristic to me. |
Kubernetes integration test starting |
Test build #133992 has finished for PR 31090 at commit
|
Kubernetes integration test status failure |
@srowen For RF&GBT, maybe we can determine the number of partitions by the total number of tree nodes? |
I think that's kind of arbitrary.. I suppose if anything we should follow suit and save 1 partition per tree, by this logic. I'd simply favor making whatever change improves consistency. |
I just create another rf model with 10 trees and totally 2,789,824 nodes:
save it to disk and its size is 49M.
Since the model size is in propotion to number of nodes, so what about determine the number of paraitions by a formula like |
current impl doesn't save one tree per partition, do you mean changing |
Hm, the description says this is all to make GBT/DT consistent with other impls that save in 1 partition? that's a fine reason to make this change. I'm saying that seems like fine logic. Basing it on node count also seems healthy if you want to change all implementations of tree models to work that way. |
I perfer determine the numParts by numNodes, I will update the description and PR. |
Kubernetes integration test starting |
Test build #134173 has finished for PR 31090 at commit
|
Kubernetes integration test status failure |
@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica | |||
DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata)) | |||
val (nodeData, _) = NodeData.build(instance.rootNode, 0) | |||
val dataPath = new Path(path, "data").toString | |||
sparkSession.createDataFrame(nodeData).write.parquet(dataPath) | |||
// 2,000,000 nodes is about 40MB | |||
val numDataParts = (instance.numNodes / 2000000.0).ceil.toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - my rule of thumb about partition sizes is "128MB" going back to the days of Hadoop. Any number in that range is about as good as the next, but I might increase this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will increase this
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #134209 has finished for PR 31090 at commit
|
@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica | |||
DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata)) | |||
val (nodeData, _) = NodeData.build(instance.rootNode, 0) | |||
val dataPath = new Path(path, "data").toString | |||
sparkSession.createDataFrame(nodeData).write.parquet(dataPath) | |||
// 7,280,000 nodes is about 128MB | |||
val numDataParts = (instance.numNodes / 7280000.0).ceil.toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any easy place to expose a small shared method for this rather than duplicate it in several places?
Kubernetes integration test starting |
Test build #134265 has finished for PR 31090 at commit
|
Kubernetes integration test status success |
Merged to master, thanks @srowen for reviewing! |
…umNodes ### What changes were proposed in this pull request? determine the numParts by numNodes ### Why are the changes needed? current model saving may generate too many small files, a tree model can be too large to single partition (a RandomForestClassificationModel with numTrees=100 and depth=20, its size is 226M) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes apache#31090 from zhengruifeng/treemodel_single_part. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
What changes were proposed in this pull request?
determine the numParts by numNodes
Why are the changes needed?
current model saving may generate too many small files,
a tree model can be too large to single partition (a RandomForestClassificationModel with numTrees=100 and depth=20, its size is 226M)
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing testsuites