[SPARK-33043][ML] Handle spark.driver.maxResultSize=0 in RowMatrix heuristic computation #29925

srowen · 2020-10-01T19:10:22Z

What changes were proposed in this pull request?

RowMatrix contains a computation based on spark.driver.maxResultSize. However, when this value is set to 0, the computation fails (log of 0). The fix is simply to correctly handle this setting, which means unlimited result size, by using a tree depth of 1 in the RowMatrix method.

Why are the changes needed?

Simple bug fix to make several Spark ML functions which use RowMatrix run correctly in this case.

Does this PR introduce any user-facing change?

Not other than the bug fix of course.

How was this patch tested?

Existing RowMatrix tests plus a new test.

karenfeng · 2020-10-01T19:12:56Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

    require(aggregatedObjectSizeInBytes > 0,
      "Cannot compute aggregate depth heuristic based on a zero-size object to aggregate")

    val maxDriverResultSizeInBytes = rows.conf.get[Long](MAX_RESULT_SIZE)
+    if (maxDriverResultSizeInBytes == 0) {
+      // Unlimited result size, so 1 is OK
+      return 1


Out of curiosity, why is this 1 given that the default argument for depth in rdd.treeAggregate is 2?

Good question - 2 could be OK too. I suspect that was chosen to line up with the default max result size. Higher depths are needed when the result size is smaller, so I figured when the result size is unlimited, the depth can be as low as possible, 1.

SparkQA · 2020-10-01T19:53:15Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33930/

SparkQA · 2020-10-01T20:09:56Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33930/

SparkQA · 2020-10-01T20:49:54Z

Test build #129315 has finished for PR 29925 at commit 0d15226.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala

SparkQA · 2020-10-02T13:54:36Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33958/

SparkQA · 2020-10-02T14:13:28Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33958/

SparkQA · 2020-10-02T14:29:46Z

Test build #129348 has finished for PR 29925 at commit 2274f52.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-10-03T18:13:14Z

Merged to master/3.0

…uristic computation ### What changes were proposed in this pull request? RowMatrix contains a computation based on spark.driver.maxResultSize. However, when this value is set to 0, the computation fails (log of 0). The fix is simply to correctly handle this setting, which means unlimited result size, by using a tree depth of 1 in the RowMatrix method. ### Why are the changes needed? Simple bug fix to make several Spark ML functions which use RowMatrix run correctly in this case. ### Does this PR introduce _any_ user-facing change? Not other than the bug fix of course. ### How was this patch tested? Existing RowMatrix tests plus a new test. Closes #29925 from srowen/SPARK-33043. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit f86171a) Signed-off-by: Sean Owen <srowen@gmail.com>

…uristic computation ### What changes were proposed in this pull request? RowMatrix contains a computation based on spark.driver.maxResultSize. However, when this value is set to 0, the computation fails (log of 0). The fix is simply to correctly handle this setting, which means unlimited result size, by using a tree depth of 1 in the RowMatrix method. ### Why are the changes needed? Simple bug fix to make several Spark ML functions which use RowMatrix run correctly in this case. ### Does this PR introduce _any_ user-facing change? Not other than the bug fix of course. ### How was this patch tested? Existing RowMatrix tests plus a new test. Closes apache#29925 from srowen/SPARK-33043. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit f86171a) Signed-off-by: Sean Owen <srowen@gmail.com>

Handle spark.driver.maxResultSize=0 in RowMatrix heuristic computation

0d15226

srowen self-assigned this Oct 1, 2020

karenfeng reviewed Oct 1, 2020

View reviewed changes

HyukjinKwon approved these changes Oct 2, 2020

View reviewed changes

HyukjinKwon reviewed Oct 2, 2020

View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 2, 2020

View reviewed changes

mllib/src/test/scala/org/apache/spark/mllib/linalg/distributed/RowMatrixSuite.scala Outdated Show resolved Hide resolved

Review changes

2274f52

srowen closed this in f86171a Oct 3, 2020

srowen deleted the SPARK-33043 branch March 9, 2021 02:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33043][ML] Handle spark.driver.maxResultSize=0 in RowMatrix heuristic computation #29925

[SPARK-33043][ML] Handle spark.driver.maxResultSize=0 in RowMatrix heuristic computation #29925

srowen commented Oct 1, 2020

karenfeng Oct 1, 2020

srowen Oct 1, 2020

SparkQA commented Oct 1, 2020

SparkQA commented Oct 1, 2020

SparkQA commented Oct 1, 2020

SparkQA commented Oct 2, 2020

SparkQA commented Oct 2, 2020

SparkQA commented Oct 2, 2020

srowen commented Oct 3, 2020

[SPARK-33043][ML] Handle spark.driver.maxResultSize=0 in RowMatrix heuristic computation #29925

[SPARK-33043][ML] Handle spark.driver.maxResultSize=0 in RowMatrix heuristic computation #29925

Conversation

srowen commented Oct 1, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

karenfeng Oct 1, 2020

Choose a reason for hiding this comment

srowen Oct 1, 2020

Choose a reason for hiding this comment

SparkQA commented Oct 1, 2020

SparkQA commented Oct 1, 2020

SparkQA commented Oct 1, 2020

SparkQA commented Oct 2, 2020

SparkQA commented Oct 2, 2020

SparkQA commented Oct 2, 2020

srowen commented Oct 3, 2020