Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34047][ML] tree models saving: compute numParts according to numNodes #31090

Closed

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jan 8, 2021

What changes were proposed in this pull request?

determine the numParts by numNodes

Why are the changes needed?

current model saving may generate too many small files,
a tree model can be too large to single partition (a RandomForestClassificationModel with numTrees=100 and depth=20, its size is 226M)

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing testsuites

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38420/

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Test build #133831 has finished for PR 31090 at commit ef555de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 8, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38420/

@github-actions github-actions bot added the ML label Jan 8, 2021
@zhengruifeng
Copy link
Contributor Author

ping @srowen

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we do this for other models? My only concern is whether this makes it harder to load when the model is large. These are parquet files, so not likely anything would want to read them from a single file, in the way maybe CSV output should.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Jan 13, 2021

@srowen reasonable.
I just create a RandomForestClassificationModel with numTrees=100 and depth=20, then find that the model size is 226M. So I think for RF and GBT, we should keep current behavior.
But for a DecisionTree, whose size is definitely small enough (I also create a decision tree with depth=30, its size is 3.9M), I think it is safe to use single partition.

@zhengruifeng zhengruifeng changed the title [SPARK-34047][ML] save tree model in single partition [SPARK-34047][ML] save decisiontree model in single partition Jan 13, 2021
@zhengruifeng
Copy link
Contributor Author

Do we do this for other models?

Yes, for most classificaion and regression models, we save them in single partitions.

@srowen
Copy link
Member

srowen commented Jan 13, 2021

Sounds like a reasonable heuristic to me.

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38580/

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Test build #133992 has finished for PR 31090 at commit 9f6dffa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 13, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38580/

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Jan 14, 2021

@srowen For RF&GBT, maybe we can determine the number of partitions by the total number of tree nodes?

@srowen
Copy link
Member

srowen commented Jan 14, 2021

I think that's kind of arbitrary.. I suppose if anything we should follow suit and save 1 partition per tree, by this logic. I'd simply favor making whatever change improves consistency.

@zhengruifeng
Copy link
Contributor Author

I just create another rf model with 10 trees and totally 2,789,824 nodes:

scala> rfcm.trees.length
res3: Int = 10

scala> rfcm.trees.map(_.numNodes).sum
res4: Int = 2789824

scala> rfcm.save("/tmp/rfcm")

save it to disk and its size is 49M.

du -sh /tmp/rfcm 
49M	/tmp/rfcm

Since the model size is in propotion to number of nodes, so what about determine the number of paraitions by a formula like numNodes / 1,000,000?

@zhengruifeng
Copy link
Contributor Author

current impl doesn't save one tree per partition, do you mean changing sql.createDataFrame(nodeDataRDD).write.parquet(dataPath) to sql.createDataFrame(nodeDataRDD).write.partitionBy("treeID").parquet(dataPath)? @srowen

@srowen
Copy link
Member

srowen commented Jan 14, 2021

Hm, the description says this is all to make GBT/DT consistent with other impls that save in 1 partition? that's a fine reason to make this change. I'm saying that seems like fine logic. Basing it on node count also seems healthy if you want to change all implementations of tree models to work that way.

@zhengruifeng
Copy link
Contributor Author

I perfer determine the numParts by numNodes, I will update the description and PR.

@zhengruifeng zhengruifeng changed the title [SPARK-34047][ML] save decisiontree model in single partition [SPARK-34047][ML] tree models saving: compute numParts according to numNodes Jan 18, 2021
@SparkQA
Copy link

SparkQA commented Jan 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38757/

@SparkQA
Copy link

SparkQA commented Jan 18, 2021

Test build #134173 has finished for PR 31090 at commit c4a77bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 18, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38757/

@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
val (nodeData, _) = NodeData.build(instance.rootNode, 0)
val dataPath = new Path(path, "data").toString
sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
// 2,000,000 nodes is about 40MB
val numDataParts = (instance.numNodes / 2000000.0).ceil.toInt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - my rule of thumb about partition sizes is "128MB" going back to the days of Hadoop. Any number in that range is about as good as the next, but I might increase this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will increase this

@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38794/

@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38794/

@SparkQA
Copy link

SparkQA commented Jan 19, 2021

Test build #134209 has finished for PR 31090 at commit 8d5b076.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -288,7 +288,9 @@ object DecisionTreeClassificationModel extends MLReadable[DecisionTreeClassifica
DefaultParamsWriter.saveMetadata(instance, path, sc, Some(extraMetadata))
val (nodeData, _) = NodeData.build(instance.rootNode, 0)
val dataPath = new Path(path, "data").toString
sparkSession.createDataFrame(nodeData).write.parquet(dataPath)
// 7,280,000 nodes is about 128MB
val numDataParts = (instance.numNodes / 7280000.0).ceil.toInt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any easy place to expose a small shared method for this rather than duplicate it in several places?

@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38851/

@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Test build #134265 has finished for PR 31090 at commit 08733c8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38851/

@zhengruifeng
Copy link
Contributor Author

Merged to master, thanks @srowen for reviewing!

@zhengruifeng zhengruifeng deleted the treemodel_single_part branch January 21, 2021 02:30
skestle pushed a commit to skestle/spark that referenced this pull request Feb 3, 2021
…umNodes

### What changes were proposed in this pull request?
determine the numParts by numNodes

### Why are the changes needed?
current model saving may generate too many small files,
a tree model can be too large to single partition (a RandomForestClassificationModel with numTrees=100 and depth=20, its size is 226M)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes apache#31090 from zhengruifeng/treemodel_single_part.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants