[SPARK-34047][ML] tree models saving: compute numParts according to numNodes #31090

zhengruifeng · 2021-01-08T09:20:10Z

What changes were proposed in this pull request?

determine the numParts by numNodes

Why are the changes needed?

current model saving may generate too many small files,
a tree model can be too large to single partition (a RandomForestClassificationModel with numTrees=100 and depth=20, its size is 226M)

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

SparkQA · 2021-01-08T10:09:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38420/

SparkQA · 2021-01-08T10:30:04Z

Test build #133831 has finished for PR 31090 at commit ef555de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-08T10:45:01Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38420/

zhengruifeng · 2021-01-11T02:12:42Z

ping @srowen

srowen

Do we do this for other models? My only concern is whether this makes it harder to load when the model is large. These are parquet files, so not likely anything would want to read them from a single file, in the way maybe CSV output should.

zhengruifeng · 2021-01-13T03:22:49Z

@srowen reasonable.
I just create a RandomForestClassificationModel with numTrees=100 and depth=20, then find that the model size is 226M. So I think for RF and GBT, we should keep current behavior.
But for a DecisionTree, whose size is definitely small enough (I also create a decision tree with depth=30, its size is 3.9M), I think it is safe to use single partition.

zhengruifeng · 2021-01-13T03:26:07Z

Do we do this for other models?

Yes, for most classificaion and regression models, we save them in single partitions.

srowen · 2021-01-13T03:36:38Z

Sounds like a reasonable heuristic to me.

SparkQA · 2021-01-13T04:29:17Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38580/

SparkQA · 2021-01-13T04:50:51Z

Test build #133992 has finished for PR 31090 at commit 9f6dffa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-13T05:04:54Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38580/

zhengruifeng · 2021-01-14T08:52:36Z

@srowen For RF&GBT, maybe we can determine the number of partitions by the total number of tree nodes?

srowen · 2021-01-14T10:57:06Z

I think that's kind of arbitrary.. I suppose if anything we should follow suit and save 1 partition per tree, by this logic. I'd simply favor making whatever change improves consistency.

zhengruifeng · 2021-01-14T10:59:51Z

I just create another rf model with 10 trees and totally 2,789,824 nodes:

scala> rfcm.trees.length
res3: Int = 10

scala> rfcm.trees.map(_.numNodes).sum
res4: Int = 2789824

scala> rfcm.save("/tmp/rfcm")

save it to disk and its size is 49M.

du -sh /tmp/rfcm 
49M	/tmp/rfcm

Since the model size is in propotion to number of nodes, so what about determine the number of paraitions by a formula like numNodes / 1,000,000?

zhengruifeng · 2021-01-14T11:04:04Z

current impl doesn't save one tree per partition, do you mean changing sql.createDataFrame(nodeDataRDD).write.parquet(dataPath) to sql.createDataFrame(nodeDataRDD).write.partitionBy("treeID").parquet(dataPath)? @srowen

srowen · 2021-01-14T14:54:12Z

Hm, the description says this is all to make GBT/DT consistent with other impls that save in 1 partition? that's a fine reason to make this change. I'm saying that seems like fine logic. Basing it on node count also seems healthy if you want to change all implementations of tree models to work that way.

zhengruifeng · 2021-01-18T02:40:39Z

I perfer determine the numParts by numNodes, I will update the description and PR.

SparkQA · 2021-01-18T03:50:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38757/

SparkQA · 2021-01-18T04:07:46Z

Test build #134173 has finished for PR 31090 at commit c4a77bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-18T04:20:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38757/

srowen · 2021-01-18T15:30:49Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
      DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
      val (nodeData, _) = NodeData.build(instance.rootNode, 0)
      val dataPath = new Path(path, "data").toString
-      sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
+      // 2,000,000 nodes is about 40MB
+      val numDataParts = (instance.numNodes / 2000000.0).ceil.toInt


OK - my rule of thumb about partition sizes is "128MB" going back to the days of Hadoop. Any number in that range is about as good as the next, but I might increase this.

ok, I will increase this

SparkQA · 2021-01-19T03:43:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38794/

SparkQA · 2021-01-19T04:08:40Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38794/

SparkQA · 2021-01-19T04:13:45Z

Test build #134209 has finished for PR 31090 at commit 8d5b076.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2021-01-19T18:38:49Z

mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala

@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
      DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
      val (nodeData, _) = NodeData.build(instance.rootNode, 0)
      val dataPath = new Path(path, "data").toString
-      sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
+      // 7,280,000 nodes is about 128MB
+      val numDataParts = (instance.numNodes / 7280000.0).ceil.toInt


Is there any easy place to expose a small shared method for this rather than duplicate it in several places?

SparkQA · 2021-01-20T10:58:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38851/

SparkQA · 2021-01-20T11:11:09Z

Test build #134265 has finished for PR 31090 at commit 08733c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-01-20T11:32:30Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38851/

zhengruifeng · 2021-01-21T02:30:46Z

Merged to master, thanks @srowen for reviewing!

…umNodes ### What changes were proposed in this pull request? determine the numParts by numNodes ### Why are the changes needed? current model saving may generate too many small files, a tree model can be too large to single partition (a RandomForestClassificationModel with numTrees=100 and depth=20, its size is 226M) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes apache#31090 from zhengruifeng/treemodel_single_part. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>

init

ef555de

github-actions bot added the ML label Jan 8, 2021

srowen reviewed Jan 11, 2021

View reviewed changes

revert rf & gbt

9f6dffa

zhengruifeng changed the title ~~[SPARK-34047][ML] save tree model in single partition~~ [SPARK-34047][ML] save decisiontree model in single partition Jan 13, 2021

zhengruifeng changed the title ~~[SPARK-34047][ML] save decisiontree model in single partition~~ [SPARK-34047][ML] tree models saving: compute numParts according to numNodes Jan 18, 2021

use numNodes

c4a77bc

srowen reviewed Jan 18, 2021

View reviewed changes

40MB -> 128MB

8d5b076

srowen reviewed Jan 19, 2021

View reviewed changes

add a method

08733c8

srowen approved these changes Jan 20, 2021

View reviewed changes

zhengruifeng closed this in 7c9b756 Jan 21, 2021

zhengruifeng deleted the treemodel_single_part branch January 21, 2021 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34047][ML] tree models saving: compute numParts according to numNodes #31090

[SPARK-34047][ML] tree models saving: compute numParts according to numNodes #31090

zhengruifeng commented Jan 8, 2021 •

edited

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

zhengruifeng commented Jan 11, 2021

srowen left a comment

zhengruifeng commented Jan 13, 2021 •

edited

zhengruifeng commented Jan 13, 2021

srowen commented Jan 13, 2021

SparkQA commented Jan 13, 2021

SparkQA commented Jan 13, 2021

SparkQA commented Jan 13, 2021

zhengruifeng commented Jan 14, 2021 •

edited

srowen commented Jan 14, 2021

zhengruifeng commented Jan 14, 2021

zhengruifeng commented Jan 14, 2021

srowen commented Jan 14, 2021

zhengruifeng commented Jan 18, 2021

SparkQA commented Jan 18, 2021

SparkQA commented Jan 18, 2021

SparkQA commented Jan 18, 2021

srowen Jan 18, 2021

zhengruifeng Jan 19, 2021

SparkQA commented Jan 19, 2021

SparkQA commented Jan 19, 2021

SparkQA commented Jan 19, 2021

srowen Jan 19, 2021

SparkQA commented Jan 20, 2021

SparkQA commented Jan 20, 2021

SparkQA commented Jan 20, 2021

zhengruifeng commented Jan 21, 2021

[SPARK-34047][ML] tree models saving: compute numParts according to numNodes #31090

[SPARK-34047][ML] tree models saving: compute numParts according to numNodes #31090

Conversation

zhengruifeng commented Jan 8, 2021 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

SparkQA commented Jan 8, 2021

zhengruifeng commented Jan 11, 2021

srowen left a comment

Choose a reason for hiding this comment

zhengruifeng commented Jan 13, 2021 • edited

zhengruifeng commented Jan 13, 2021

srowen commented Jan 13, 2021

SparkQA commented Jan 13, 2021

SparkQA commented Jan 13, 2021

SparkQA commented Jan 13, 2021

zhengruifeng commented Jan 14, 2021 • edited

srowen commented Jan 14, 2021

zhengruifeng commented Jan 14, 2021

zhengruifeng commented Jan 14, 2021

srowen commented Jan 14, 2021

zhengruifeng commented Jan 18, 2021

SparkQA commented Jan 18, 2021

SparkQA commented Jan 18, 2021

SparkQA commented Jan 18, 2021

srowen Jan 18, 2021

Choose a reason for hiding this comment

zhengruifeng Jan 19, 2021

Choose a reason for hiding this comment

SparkQA commented Jan 19, 2021

SparkQA commented Jan 19, 2021

SparkQA commented Jan 19, 2021

srowen Jan 19, 2021

Choose a reason for hiding this comment

SparkQA commented Jan 20, 2021

SparkQA commented Jan 20, 2021

SparkQA commented Jan 20, 2021

zhengruifeng commented Jan 21, 2021

zhengruifeng commented Jan 8, 2021 •

edited

zhengruifeng commented Jan 13, 2021 •

edited

zhengruifeng commented Jan 14, 2021 •

edited